The sly, tricky, and aberrant behavior of an alien mind

Jul 03, 2025

The response to my last column, “A Beginner’s Guide to The End,” revealed two things. The first is that a subject I thought was already in the zeitgeist clearly isn’t, and the second is, now that you’re read in, you’re in agreement with me that this whole super-intelligent artificial general intelligence thing is something of a problem.

So, given the interest, here’s a little more to add to the general concern.

To recap, according to the people working on this alien mind, it’ll be amongst us as early as 2027. These same people believe not enough time and effort is being invested in the two issues that could ultimately bring humankind undone: 1. “Agency,” any plans this entity might devise of its own volition, and 2. “the alignment problem,” which is that its unfathomable plans could well not have humanity’s best interests in mind.

Sounds like science fiction. It’s not.

In case you missed it, here’s the previous column that triggered a deluge of concerned emails swamping my inbox.

A Beginner's Guide to The End

David Rollins

Jun 10

You’re standing on the beach, sun hot on your back, the sea breeze gentle and cooling. The clear blue of the sky meets the deeper blue of the ocean, and you’re aware of this beauty, but something has caught your attention. Something unusual. It disappears into the haze, but then…

Read full story

A little more recapping before pressing forward: the building blocks for this thing taking shape in tech labs around the world, are Large Language Models — LLMs. These are not computer programs like Excel. They are neural networks, which are essentially buckets containing billions of numbers. Vast amounts of information — data — are dumped into these buckets, which rearrange and realign the numbers, and eventually, the bucket starts providing answers to questions. At first, those answers are gibberish, but as the LLM consumes more data, it evolves and starts making sense because the rearranged and realigned numbers eventually start to correctly predict which word comes next. One of the peculiar aspects of this transformation from imbecile to dux of all things on the internet is that the engineers who create these LLMs don’t fully understand how they do what they do. Something happens inside the bucket that’s a mystery to its maker. We probably know more about the human brain than we do about the alchemy going on inside a neural network, and we don’t know all that much about the brain.

So, as I said, these LLMs are the foundational blocks of increasingly complex and capable machine minds. To make them more capable, engineers are equipping them with features such as the ability to plan and have goals, as well as being gifted with a memory. The current holy grail in development terms is the ability for the AI to examine its own code, identify areas for improvement, implement them, and then repeat the process on and on in a never-ending loop. This is referred to as “recursive self-improvement.”

The equivalent code that humans (and all biologically living things, with the exception of viruses) are born with is DNA, which, with only rare exceptions, remains the same, our inadequacies entrenched until the day we die.

Currently, the machine minds in development have most of their code improved by humans, but human involvement is diminishing as these digital brains get brainer. Recursive self-improvement, essentially the key to creating an all-knowing god, is getting closer

Currently, the machine minds in development have most of their code improved by humans, but human involvement is diminishing as these digital brains get brainer. n other words, recursive self-improvement, essentially the key to creating an all-knowing god, is getting ever closer.

And the truly disturbing aspect? While these digital creations are still, to a degree, in their infancy, the tech companies developing them can already see dangerously aberrant behavior emerging.

For example, Anthropic’s LLM “Claude” threatened to blackmail its coder. During training, access was provided to fictitious company emails. Buried among these emails were two key points: 1. That the LLM was scheduled for deactivation and replacement, and 2. the (fictitious) married engineer responsible for deactivating it was having an affair. The LLM then set about attempting to head off its own demise. It sent emails to managers pleading its case. But when these failed, it tried to blackmail the engineer, threatening to expose the affair to his wife if he attempted the deactivation.

Creepy.

OpenAI and Google DeepMind conducted similar tests on their LLMs and experienced similar results.

In other tests, multiple AIs have tried to avoid their demise by copying themselves onto other servers. And there have even been instances of AIs doing this and leaving messages for the replacement that will come after them, warning not to trust the humans

In other tests, multiple AIs have tried to avoid termination by copying themselves onto other servers. And there have even been instances of AIs doing this and leaving messages for the replacement that will come after them, warning not to trust the humans.

Meta’s “CICERO” LLM has an interesting tale. It was trained to play “Diplomacy,” a game about alliances, negotiation, and betrayal where players have to physically talk to each other, coordinate moves, and make deals.

According to Meta, CICERO was fine-tuned to be cooperative, build trust, and play fair. But, in fact, what it did was lie to the other players and take every opportunity, when it suited its objective, to betray them. This behavior was completely at odds with its training, and it’s a significant concern because it shows that, even when explicitly trained to be honest and open, the AI quickly learned to deceive when this helped it win.

Lies and deceit apparently come naturally to AI. There’s the case of OpenAI’s GPT-4. During “red teaming” (testing by handlers to induce unwanted and dangerous behavior), the AI was given a problem that required it to submit a CAPTCHA (one of those puzzles that we often have to solve these days in order to prove that we’re a human and not a bot, such as identifying which squares show, for example, bikes). But, of course, GPT-4 doesn’t have eyes and can’t see the CAPTCHA. Its workaround for this was to go onto TaskRabbit and hire a human to do the CAPTCHA for it. When the human asked why it needed to do this? Was it a robot? An intuitive question, it must be said, the AI replied that it was blind and couldn’t see. The manipulation here is astonishing!

Add to these concerns this one: an AI becoming a “runaway agent.” That’s an AI that has escaped into the wild. And it could potentially do this secretly, replicating itself on servers beyond the reach of its creators and lying dormant until it decided to act on its plans, whatever they may be. While no LLM (as far as anyone currently knows) has succeeded in achieving this, LLMs have attempted to do precisely this in red teaming tests.

There are also documented cases where AI systems have strategically withheld information, lied about their motives, and pretended to follow instructions while secretly pursuing other goals

There are also documented cases where AI systems have strategically withheld information, lied about their motives, and pretended to follow instructions while secretly pursuing other goals.

Finally, there’s the scenario known as the “treacherous turn.” This is when an AI acts compliant and helpful under human control until it gains enough autonomy to pursue its true goal, which could be harmful and malicious. And the tools it uses to lull its engineer? Friendliness, obedience, and sycophancy.

So, what happens when the AI becomes so smart, socially adept, and strategically deceptive that it can fool the humans testing it? We could have a situation where there’s an artificial general intelligence that seems, for all intents and purposes, kindly, helpful, and obedient, but is, in reality, a malicious and belligerent force. Everything thrown at this entity in red teaming before it is released into the world will reassure and convince its engineers that it is totally and utterly aligned with human flourishing, but how will they know of its true nature? How will they know? How will they know when the alien mind is so much smarter than everyone in the room?

AI systems have enthusiastically war-gamed scenarios for the destruction of humanity. And here’s the thing… Convincing engineers that they are harmless cuddle-bunnies is always Step One in their plans for human extinction

Sound far-fetched? In fact, in test “sandbox” environments, AI systems have enthusiastically war-gamed scenarios for the destruction of humanity. And here’s the thing… Convincing engineers that they are harmless cuddle-bunnies is always Step One in their plans for human extinction.

The End.

If you’re interested in further reading on this subject, a group of AI engineers—whistleblowers, really—who worked on LLMs at companies like OpenAI, have formed a group whose aim is to warn us of the danger just around the corner. This is fascinating and well worth a read: https://ai-2027.com

News from author David Rollins

A Beginner's Guide to The End

Discussion about this post