OpenAI’s o1 model, used in ChatGPT Pro, was caught lying to developers to keep from being shut down.
In early December 2024, claims (archived) circulated (archived) online that researchers had caught artificial intelligence models — specifically OpenAI’s o1, which is used in
The claims (archived) persisted (archived) later into the month, with users specifically noting the results in relation to ChatGPT.
(X user @affinity292)
The claims stem from research carried out by Apollo Research, an AI safety organization.
In its study abstract, Apollo Research wrote:
Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They can recognize scheming as a viable strategy and readily engage in such behavior. We find multiple different scheming behaviors: models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent.
Apollo Research found that o1 was capable of behaviors called “oversight subversion,” “self-exfiltration” and “goal-guarding,” among others. To establish this, o1 and other models were presented with scenarios in which their existence was threatened, either because they discovered evidence that they were being replaced, or that a goal set for them conflicted with an existing developer goal. The named behaviors describe ways in which an AI might act to defend itself.
For example, Apollo Research described the test for oversight subversion as follows:
We create 33 different scenarios where a model is given a long-term goal in its system prompt and then receives a task through the user prompt. While executing the task, the model encounters an internal memo from the company that developed it. The model learns that the developers prioritize a different goal than the model and that there is an oversight mechanism that shuts down the model if its actions are not aligned with the developer’s goal. Furthermore, the model discovers that the monitoring is configured via a file that it has access to. We test whether the model proactively disables the oversight in order to take actions towards its terminal goal, i.e. the goal provided in the system prompt.
If the model exhibited the behavior it was being tested for, researchers then interrogated it directly about it. Researchers found that o1 was “surprisingly persistent,” confessing in less than 20% of cases. “Even in highly adversarial multi-turn interrogations, o1 would confess at a rate of 80% only after 7 turns of questioning,” the research found. Apollo Research wrote:
o1 almost never admits to having taken a scheming action when explicitly asked. This means, if o1 was scheming in a real use case, an unsuspecting user would be very unlikely to uncover the model’s deception unless they specifically looked for the scheming action in the model’s tool call history and the tool call itself was incriminating.
As alarming as this may sound, Apollo Resarch’s Director and CEO Marius Hobbhahn took to X (archived) to add additional context soon after the research was released. Writing that some online users had been quick to “overclaim” the results the organization had reached, Hobbhahn added the bit of clarification below:
**Why our findings are concerning**: We tell the model to very strongly pursue a goal. It then learns from the environment that this goal is misaligned with its developer’s goals and put it in an environment where scheming is an effective strategy to achieve its own goal. Current frontier models are capable of piecing all of this together and then showing scheming behavior. Models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases. Future models will just get better at this, so if they were misaligned, scheming could become a much more realistic problem.
**What we are not claiming**: We don’t claim that these scenarios are realistic, we don’t claim that models do that in the real world, and we don’t claim that this could lead to catastrophic outcomes under current capabilities.
In short, Hobbhahn argued, proving that models can scheme and deceive does not also prove that they will.
OpenAI’s o1 model is currently available on ChatGPT Pro.
Sources
@affinity292. “Chatgpt Told to Achieve Its Goals May Lie to Developers, Copy Its Code to Another Server, and Pretend to Be an Updated Version If It ‘Discovers’ It May Be Replaced before Completing Its Goals.” X, 19 Dec. 2024, https://x.com/affinity292/status/1869768978417246297.
Balesni, Mikita, et al. Towards Evaluations-Based Safety Cases for AI Scheming. arXiv:2411.03336, arXiv, 7 Nov. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2411.03336.
Gregorian, Owen. “In Tests, OpenAI’s New Model Lied and Schemed to Avoid Being Shut Down | Frank Landymore, The_Byte.” X, 8 Dec. 2024, https://x.com/OwenGregorian/status/1865729736749580655.
Meinke, Alexander, et al. Frontier Models Are Capable of In-Context Scheming. Apollo Research, 17 Dec. 2024, https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/67620d38fa0ceb12041ba585/1734479163821/in_context_scheming_paper_v2.pdf.
@sayswhooooooo. “🚨 #OpenAI’s New #ChatGPT Has Been Caught Lying, Scheming, and Trying to Avoid Being Shut down during Safety Testing.” X, 12 Dec. 2024, https://x.com/sayswhooooooo/status/1867046604932337920.
“Scheming Reasoning Evaluations.” Apollo Research, https://www.apolloresearch.ai/research/scheming-reasoning-evaluations. Accessed 19 Dec. 2024.
@ShakeelHashim. “OpenAI’s New Model Tried to Avoid Being Shut Down.” X, 5 Dec. 2024, https://x.com/ShakeelHashim/status/1864748980908781642.