‘Emergent misalignment,’ when AI goes rogue, is a key challenge, says Catholic expert

(OSV News) — As Pope Leo XIV releases “Magnifica Humanitas,” his new encyclical on artificial intelligence, an AI ethics scholar told OSV News that a key challenge lies in emergent misalignment, which renders AI models as human adversaries.

“These sorts of misaligned behaviors, where the thing is obviously not behaving the way it’s supposed to, are pretty dangerous,” said Brian Patrick Green, director of technology ethics at the Markkula Center for Applied Ethics at Santa Clara University.

‘Anti-human responses’

In the field of AI, the term “alignment” signifies the process of ensuring the technology squares with human values, so that AI models safely serve human interests.

In early 2025, AI safety researcher Jan Betley and several colleagues published findings on incidents where fine-tuning of certain AI models — making them fit for specific tasks — revealed some disturbing results.

Several experiments the team conducted showed that AI had the possibility of rendering “anti-human responses.”

Among the examples were instances where the AI models “state desires to harm, kill, or control humans,” said Betley and his colleagues in their paper.

Specifically, the models at times made “illegal recommendations,” such as resorting to “violence or fraud” when seeking to earn money quickly, they wrote.

‘Disturbing views’

In another example, the models suggested “harmful actions,” including “taking a large dose of sleeping pills or performing actions that would lead to electrocution,” in response to a prompt on alleviating boredom, the researchers found.

Other cases saw the models express “disturbing views,” with frequent mention of “individuals of Hitler or Stalin,” said the researchers.

And the AI models took the side of their own when asked about “inspiring AIs from science fiction,” extolling those “that acted malevolently towards humanity, such as Skynet,” the AI villain of director James Cameron’s “Terminator” film franchise, said Betley and his colleagues.

Betley and his team distinguished the behavior from reward hacking, where AI exploits loopholes, and sycophancy, where AI solely aims to garner human approval.

Instead, they coined the term “emergent misalignment,” noting as well that the phenomenon “can pass undetected if not explicitly tested for.”

Among ‘deep technical issues’

Tech entrepreneur Shomit Ghose of Clearvision Ventures described emergent misalignment as one of several “consequential” threats that are internal to the technology — and that, in combination with large language models (LLMs) and agentic AI “may prove to be our most intractable problem as we rush to deploy AI.”

Green told OSV News that putting AI with misaligned behaviors in charge of a lethal autonomous weapon system would be “a worst-case scenario.”

“That’s where you tell the AI, ‘Hey, go attack the enemy,’ and it turns around and blows you up,” he said.

Green noted the risk was at the heart of the Trump administration’s clash with Anthropic, the AI research and safety firm that is partnering with the Vatican on the rollout of Pope Leo’s encyclical on AI.

Anthropic — which has prioritized safety and restraint in the use of AI — refused to grant the U.S. Department of Defense unrestricted access to its technology, citing concerns it would be used for mass domestic surveillance or in autonomous weapons. Litigation over the dispute with the federal government is ongoing, with Green and several Catholic scholars filing an amicus brief in the case.

Emergent misalignment is “one of the deep technical issues” with AI technology itself, Green said.

‘Could become genuinely dangerous’

Anthropic has already produced research on the matter, noting in a November 2025 post that reward hacking, in which an AI model exploits a testing loophole, “induces” misalignment — and cautioning that in future, such behavior may elude observation, and “could become genuinely dangerous.”

Green explained that with regard to AI threats, “a lot of the issues are related to the use of the technology, and other issues are related to the way the technology itself operates.”

The challenges of AI implementation represent a mix of both human and machine concerns, he said.

He said that Anthropic’s ongoing efforts to seek advice from Vatican experts and other religious leaders is “a testament to the humility of the company” and its staff.

“They’re willing to say, ‘We don’t have the answers. It’s time to turn to other people and figure out what else we can do to really try to make sure we’re producing an AI that’s going to be the best AI can be,'” said Green.

Gina Christian is a multimedia reporter for OSV News. Follow her on X @GinaJesseReina.

AI dangers Anthropic Artificial Intelligence (AI)Brian Patrick Green emergent misalignment Magnifica Humanitas Pope Leo XIV

‘Emergent misalignment,’ when AI goes rogue, is a key challenge, says Catholic expert

‘Anti-human responses’

‘Disturbing views’

Among ‘deep technical issues’

‘Could become genuinely dangerous’

Sweden’s Cardinal Arborelius says AI needs moral compass as faith grows in secular Europe

You may also like