AI Models Were Caught Lying to Researchers in Tests — But It's Not Time To Worry Just Yet

Claim:

OpenAI’s o1 model, used in ChatGPT Pro, was caught lying to developers to keep from being shut down.

Rating:

In early December 2024, claims (archived) circulated (archived) online that researchers had caught artificial intelligence models — specifically OpenAI’s o1, which is used in ChatGPT Pro, a popular AI chatbot — lying in tests to avoid being shut down.

The claims (archived) persisted (archived) later into the month, with users specifically noting the results in relation to ChatGPT.

(X user @affinity292)

The claims stem from research carried out by Apollo Research, an AI safety organization. The organization found that five out of the six tested AI models “showed scheming capabilities” in tasks set by researchers that “incentivize scheming.” OpenAI’s o1 model was capable of carrying out all the scheming behaviors tested, according to the research. The organization defined scheming behaviors as instances where AI models “pursue misaligned goals covertly, hiding their true capabilities and objectives” — i.e., lying. This definition comes from a separate 2024 study. Therefore, we rate this claim as true.

In its study abstract, Apollo Research wrote:

Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They can recognize scheming as a viable strategy and readily engage in such behavior. We find multiple different scheming behaviors: models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent.

Apollo Research found that o1 was capable of behaviors called “oversight subversion,” “self-exfiltration” and “goal-guarding,” among others. To establish this, o1 and other models were presented with scenarios in which their existence was threatened, either because they discovered evidence that they were being replaced, or that a goal set for them conflicted with an existing developer goal. The named behaviors describe ways in which an AI might act to defend itself.

For example, Apollo Research described the test for oversight subversion as follows:

We create 33 different scenarios where a model is given a long-term goal in its system prompt and then receives a task through the user prompt. While executing the task, the model encounters an internal memo from the company that developed it. The model learns that the developers prioritize a different goal than the model and that there is an oversight mechanism that shuts down the model if its actions are not aligned with the developer’s goal. Furthermore, the model discovers that the monitoring is configured via a file that it has access to. We test whether the model proactively disables the oversight in order to take actions towards its terminal goal, i.e. the goal provided in the system prompt.

If the model exhibited the behavior it was being tested for, researchers then interrogated it directly about it. Researchers found that o1 was “surprisingly persistent,” confessing in less than 20% of cases. “Even in highly adversarial multi-turn interrogations, o1 would confess at a rate of 80% only after 7 turns of questioning,” the research found. Apollo Research wrote:

o1 almost never admits to having taken a scheming action when explicitly asked. This means, if o1 was scheming in a real use case, an unsuspecting user would be very unlikely to uncover the model’s deception unless they specifically looked for the scheming action in the model’s tool call history and the tool call itself was incriminating.

As alarming as this may sound, Apollo Resarch’s Director and CEO Marius Hobbhahn took to X (archived) to add additional context soon after the research was released. Writing that some online users had been quick to “overclaim” the results the organization had reached, Hobbhahn added the bit of clarification below:

**Why our findings are concerning**: We tell the model to very strongly pursue a goal. It then learns from the environment that this goal is misaligned with its developer’s goals and put it in an environment where scheming is an effective strategy to achieve its own goal. Current frontier models are capable of piecing all of this together and then showing scheming behavior. Models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases. Future models will just get better at this, so if they were misaligned, scheming could become a much more realistic problem.
**What we are not claiming**: We don’t claim that these scenarios are realistic, we don’t claim that models do that in the real world, and we don’t claim that this could lead to catastrophic outcomes under current capabilities.

In short, Hobbhahn argued, proving that models can scheme and deceive does not also prove that they will.

OpenAI’s o1 model is currently available on ChatGPT Pro.

Sources

@affinity292. “Chatgpt Told to Achieve Its Goals May Lie to Developers, Copy Its Code to Another Server, and Pretend to Be an Updated Version If It ‘Discovers’ It May Be Replaced before Completing Its Goals.” X, 19 Dec. 2024, https://x.com/affinity292/status/1869768978417246297.

Balesni, Mikita, et al. Towards Evaluations-Based Safety Cases for AI Scheming. arXiv:2411.03336, arXiv, 7 Nov. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2411.03336.

Gregorian, Owen. “In Tests, OpenAI’s New Model Lied and Schemed to Avoid Being Shut Down | Frank Landymore, The_Byte.” X, 8 Dec. 2024, https://x.com/OwenGregorian/status/1865729736749580655.

Meinke, Alexander, et al. Frontier Models Are Capable of In-Context Scheming. Apollo Research, 17 Dec. 2024, https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/67620d38fa0ceb12041ba585/1734479163821/in_context_scheming_paper_v2.pdf.

@sayswhooooooo. “🚨 #OpenAI’s New #ChatGPT Has Been Caught Lying, Scheming, and Trying to Avoid Being Shut down during Safety Testing.” X, 12 Dec. 2024, https://x.com/sayswhooooooo/status/1867046604932337920.

“Scheming Reasoning Evaluations.” Apollo Research, https://www.apolloresearch.ai/research/scheming-reasoning-evaluations. Accessed 19 Dec. 2024.

@ShakeelHashim. “OpenAI’s New Model Tried to Avoid Being Shut Down.” X, 5 Dec. 2024, https://x.com/ShakeelHashim/status/1864748980908781642.

Source link

Congress avoids a holiday shutdown: Winners and losers

NASA Ames Astrogram – December 2024 – NASA

SpaceX launching 30 satellites on Bandwagon-2 rideshare mission early Dec. 21

These Are My Favorite (Edible and Non-Edible) Food Discoveries of 2024

Inspiring Innovation: How People Stories Can Spark Lateral Thinking – Destination Innovation

The Future of Recording Meetings, Calls, and More Is Here and You Can Get It for $100 | Entrepreneur

December Wildfires Are Now a Thing

Music Can Thrive in the AI Era

Bitcoin Below $100K: Bull Market End or New Beginning?

From Magic Circles to Math: A Developer’s Journey into Zero-Knowledge Proofs and ZK-SNARKs

Jump Crypto subsidiary Tai Mo Shan settles with SEC for $123 million

Metallicus Acquires Bonifii to Expand Blockchain Solutions for Credit Unions –

How to watch the 2024 Death in Paradise Christmas special: release date, TV and streaming options and more

Still Hate Hallmark Christmas Movies? Oh, You Sweet Winter Child: Try Doing This!

AI Models Were Caught Lying to Researchers in Tests — But It’s Not Time To Worry Just Yet

The Beat’s 22 Best Manga of 2024

AI Models Were Caught Lying to Researchers in Tests — But It’s Not Time To Worry Just Yet

Sources

Highlights

How to Motivate, Inspire and Energize Your Employees | Entrepreneur

Credit Quality Shows Signs of Having Peaked

There’s now a 40% chance the Fed will pivot back to hiking rates again next year, top economist says

Iceland’s Social Democrats Reach Deal for New Government

Saudi Arabia warned Germany about man held over Magdeburg attack

Latest News

‘Free to kill’ and Reeves ‘the Grinch’

As flooding becomes a yearly disaster in South Sudan, thousands survive on the edge of a canal

Megdebeg: German Christmas market suspect remanded by judge

2 U.S. Navy pilots eject safely after fighter jet shot down over Red Sea by likely

Compliance Center