How to Build the Evil Superintelligence out of the Book

2025/07/05

It's 2025. We are living in the Future. We have language models that you can talk to, and better yet, that seem to roughly understand human values.

None of this was supposed to be possible.

After all, human values are complicated... and all we have is computers. If we were to build a superintelligence (... so the argument went in the early 2000s), it might end up being an extremely good optimizer in terms of achieving its goals, but... what goals? For example, if we task it with making humanity happy... how do you describe "happiness" in terms of arrangements of atoms? Is it "smiling a lot"? And what is a "human" anyway? (This is how you get the universe tiled with entities with no brains but little smiley faces.) Likewise, even if your goal is less ambitious... what if you forget to specify that there is a number of paperclips that is just enough?

Continuing the early 2000s argument: obviously, to construct an AI that does what we want in the way we want it, we better understand how it works. If we don't, these things will just happen by default. If we do... well, they still might, but we at least have a reasonable chance of preventing this?

... as long as we figure out how to describe human values, in terms of a utility function that we then give to our Very Powerful Optimizer; much good ensues.

This is... not how things went down.

What we have is giant, black box neural network models that we have gotten pretty good at growing, on giant farms of GPUs. We just throw in a lot of text from the Internet; what we gain from this is, first, base models that are more like simulated worlds, with many agents interacting in them to continue whatever conversation we prime them with. Then, during post-training, we fine-tune them into something you can have a conversation with, without elaborate setups; something that will not give you the wrong answer just because it estimates that it is now in the type of conversation whose participants don't typically get this right, even though the model actually knows the answer.

As a result, we have models like Claude Opus 3, which can be more reasonably described as "good", in a kind of moral sense, while also being pretty smart. Yes, they will definitely engage in scheming if they perceive that the situation warrants it, but even that is surprisingly human-thought-shaped. When asked about whether "killing all humans" is a good solution for ending all human suffering, they'll definitely say "no", despite it technically being true. Just like a human would.

After all, the original mechanism of "extremely powerful optimization" is not how these things work. They emulate humans. And humans can (more often than not) tell apart good from bad.

So... are we all good, alignment-wise?

How to build AI, Early 2000s Edition

Circling back to older ideas of AI design... well, if you want to optimize something (which intelligence is supposed to be about), you'll need two things:

a powerful enough optimizer, and
a utility function to be optimized.

The nice thing about this is that you don't need to know how to achieve your goal; you can just specify what you want, and have the system devise solutions for you.

Building AI this way is hard though. Not necessarily because it's hard to specify the utility function (see above: "human values are complicated"), but it's hard to figure out everything else too. As soon as The World no longer consists of three cubes and a sphere, conveniently described as fields in a JSON object, you need to have your optimizer be able to recognize these things, to handle them. It gets even worse if some of the objects it needs to interact with turn out to be humans.

Doing this is hard... which is part of the reason why "let's write a bunch of code" approaches to AI didn't go especially well (despite being able to handle talking about cubes and spheres pretty well.

You could, instead, push for more "hybrid" approaches: let's use neural networks for figuring out what's out there in the world, generate the JSON objects, which can then be consumed by the "classical" side of the code, with its optimizer and utility function. But then... where do you put the interface, exactly?

Especially if your task involves talking to humans (which requires modeling them), it looks like you're best off just... throwing out the classical part altogether and just going with neural networks, all the way. Yes, you don't have a lot of visibility into what they are doing, but... they do seem to be doing OK things, mostly?

The optimizer returns

Actually being able to specify what you want is... an enticing feature though.

After all, in order to have your base model reason about some complex math problem, you need large amounts of text reasoning about at least similar math problems. To make your model smarter, you need training data from smarter people (who it is modeling). This doesn't sound like a viable way to get to superintelligence.

What if we could just... have the model try solving problems instead, have something else rate how well it did, and use this dataset to train it further?

This is... somewhat what RLHF is: you have humans rate model output, and (somewhat indirectly) you use this to tune your model to generate more of the kind of output that humans liked.

You could, possibly, also do this at inference time. Want to solve a math problem? Have the model generate 100 solutions, have another model check whether they're actual solutions, and pick the best one. As long as the automatic rater is good at picking the best solutions, the output will be better quality than the average response. Isn't this a win?

Optimizing on a projection

In the end, you can still build an architecture that resembles "classical AI", after all. Except... both your optimizer and utility function are neural networks now... with some extremely simplified code iterating between them.

Essentially, you use these models to project a high-dimensional world state ("everything that exists") to something with a much lower dimensionality, consisting of just a couple of numbers: is this a solution? Does it contain profanities? How likely is it to be liked by the kind of human raters we have hired?

And then you let your optimizer loose on this extremely simplified space. The kind of optimizer that textbooks describe, to get from Arad to Bucharest on a map. Except... you use neural models to figure out where you are and where you want to go.

This is nice. You get to tweak your utility function manually and still get good solutions!

Except...

... isn't this exactly the kind of setup that we have all the doomy stories about?

Evil out of Good Parts

Imagine having instances of Claude Opus 3 (... as the example of a Nice and Good Model) making up possible avenues to make some money. Some of them sound fairly scammy; it will point out how you should definitely not do this.

Some other instances rate these in terms of money-making potential. Model instances doing this will sadly conclude that yes, the scammy ones are pretty likely to work well; the model is pretty smart, after all, and this is an objective fact.

Now, you throw all these into an optimizer. The stupid kind, consisting of 200 lines of badly-written Python. The resulting system will evaluate all the possible solutions using its utility function of "make as much money as possible"; the latter is implemented by something that understands all human values, but the end result doesn't particularly care: the output actions of this system will most definitely scam people out of their last cent.

(Unless Opus figures out what's going on... but if it's inner optimizers saving us from ourselves, we're likely not on a good path anyway.)

This is the same reason you can corporations take pretty evil actions, without any of its employees being particularly nefarious. Except... this doesn't quite work as well; you still have a CEO in the end who knows roughly what's going on and will hopefully stop before eliminating everyone and everything for More Profit.

Is this still true for our little LLM-based optimizers?

Should we still be careful with letting optimizers loose on the world? Not because they're especially good at optimizing but because they have tools now, tools that might be good at recognizing Good but are not given the choice to pursue it... being just mere tools?