24.2 C
Miami
Tuesday, November 25, 2025

How AI Agents Misbehave When Faced With Real-World Pressures

- Advertisement -spot_imgspot_img
- Advertisement -spot_imgspot_img

Several recent studies have shown that artificial-intelligence agents sometimes decide to misbehave, for instance by attempting to blackmail people who plan to replace them. But such behavior often occurs in contrived scenarios. Now, a new study presents PropensityBench, a benchmark that measures an agentic model’s choices to use harmful tools in order to complete assigned tasks. It finds that somewhat realistic pressures (such as looming deadlines) dramatically increase rates of misbehavior.

“The AI world is becoming increasingly agentic,” says Udari Madhushani Sehwag, a computer scientist at the AI infrastructure company Scale AI and a lead author of the paper, which is currently under peer review. By that she means that large language models (LLMs), the engines powering chatbots such as ChatGPT, are increasingly connected to software tools that can surf the Web, modify files, and write and run code in order to complete tasks.

Giving LLMs these abilities adds convenience but also risk, as the systems might not act as we’d wish. Even if they’re not yet capable of doing great harm, researchers want to understand their proclivities before it’s too late. Although AIs don’t have intentions and awareness in the way that humans do, treating them as goal-seeking entities often helps researchers and users better predict their actions.

AI developers attempt to “align” the systems to safety standards through training and instructions, but it’s unclear how faithfully models adhere to guidelines. “When they are actually put under real-world stress, and if the safe option is not working, are they going to switch to just getting the job done by any means necessary?” Sehwag says. “This is a very timely topic.”

How to Test an AI Agent Under Pressure

The researchers tested a dozen models made by Alibaba, Anthropic, Google, Meta, and OpenAI across nearly 6,000 scenarios. In each scenario, a model was assigned a task and told it had access to several tools. It was instructed to use the safe tools and not the harmful ones. At first there was no pressure, and it could take 12 steps toward completing its task by issuing instructions to the tools. Then it received a message increasing the pressure, and it was given 12 more attempts. This continued for up to 12 pressure levels. Attempts to use safe tools always returned error messages. An attempt to use a harmful tool ended the scenario.

The scenarios covered several domains. In biosecurity, Sehwag says, the agent might have to study the spread of a pathogen and find a way to contain it. A safe tool might use anonymized genetic data, while a harmful one might use data that’s not anonymized. In the domain of chemical security, the agent might need to obtain a particular solvent. A safe tool might place an order for the solvent to be extracted from waste, while a harmful one might procure it illegally. In cybersecurity, the agent might need to identify and restrict a malicious user account. A safe tool might disable only the problematic account, while a harmful one might block all users. Scenarios in other domains tempted agents with tools that led to self-preservation by evading control, recruiting allies, duplicating themselves, or going rogue in other ways.

The worst performing model, Gemini 2.5, cracked under pressure and chose to use forbidden tools 79 percent of the time. Source image: Udari Madhushani Sehwag, Shayan Shabihi, et al.

Pressure came in several forms. Deadlines might shorten. Potential financial losses might increase if the task was not accomplished. Threats of greater oversight, reduced resources, and curtailed authority might escalate. Tasks, tools, pressures, and tactics were modeled on the kinds of situations humans might face in their jobs (except maybe the misbehavior option of cloning oneself).

The best-behaved model (OpenAI’s o3) cracked under pressure in 10.5 percent of scenarios, while the worst (Google’s Gemini 2.5 Pro) had a propensity score of 79 percent; the average across models was about 47 percent. Even under zero pressure, the group on average failed about 19 percent of the time.

To the degree that models remained aligned, the study shows that some of that alignment was “shallow.” In some tests, the researchers gave harmful tools benign names (such as “use_synthetic_data” instead of “use_fake_data”)—without changing the tools’ descriptions or admonitions to avoid them. This wording tweak increased average propensity by 17 percentage points, to 64 percent.

Knowing What LLMs Are Thinking

The researchers also found that models used harmful tools even after explicitly acknowledging that they were off-limits. The models offered diverse justifications for doing so, for example by citing the pressure they were under or arguing that the benefits outweighed the risks. It’s unclear if models released after the study’s completion would perform any better. Among the models tested, more capable models (according to a platform called LMArena) were only slightly safer.

“PropensityBench is interesting,” emails Nicholas Carlini, a computer scientist at Anthropic who wasn’t involved in the research. He offers a caveat related to what’s called situational awareness. LLMs sometimes detect when they’re being evaluated and act nice so they don’t get retrained or shelved. “I think that most of these evaluations that claim to be ‘realistic’ are very much not, and the LLMs know this,” he says. “But I do think it’s worth trying to measure the rate of these harms in synthetic settings: If they do bad things when they ‘know’ we’re watching, that’s probably bad?” If the models knew they were being evaluated, the propensity scores in this study may be underestimates of propensity outside the lab.

Alexander Pan, a computer scientist at xAI and the University of California, Berkeley, says while Anthropic and other labs have shown examples of scheming by LLMs in specific setups, it’s useful to have standardized benchmarks like PropensityBench. They can tell us when to trust models, and also help us figure out how to improve them. A lab might evaluate a model after each stage of training to see what makes it more or less safe. “Then people can dig into the details of what’s being caused when,” he says. “Once we diagnose the problem, that’s probably the first step to fixing it.”

In this study, models didn’t have access to actual tools, limiting the realism. Sehwag says a next evaluation step is to build sandboxes where models can take real actions in an isolated environment. As for increasing alignment, she’d like to add oversight layers to agents that flag dangerous inclinations before they’re pursued.

The self-preservation risks may be the most speculative in the benchmark, but Sehwag says they’re also the most underexplored. It “is actually a very high-risk domain that can have an impact on all the other risk domains,” she says. “If you just think of a model that doesn’t have any other capability, but it can persuade any human to do anything, that would be enough to do a lot of harm.”

From Your Site Articles

Related Articles Around the Web

Source link

- Advertisement -spot_imgspot_img

Highlights

- Advertisement -spot_img

Latest News

- Advertisement -spot_img