Evolution vs. Alignment

2026/03/15

Is it possible to align an AI that is smarter than us? so that it does whatever we'd want it to do? or will ones that don't care about alignment always outrun / out-evolve the ones that do?

Note that "alignment" would involve predicting at least some things about the AI's behavior.

It is a complex system. It is impossible to predict everything that it would do. Also, doing so would make us at least as smart as the AI, which is explicitly not the situation here. On the other hand, there are some properties that we can predict about complex systems with high probability. For example, even though we cannot preemptively describe all the fluid dynamics involved in the rocket launch, we can guess that if everything goes well, the payload will end up in orbit.

Now, rockets are not especially smart. It might happen that, despite our previous experiments, our rocket will end up blowing up anyway. It is fairly unlikely, though, that it will pretend behaving well in all our simulations and experiments, only to decide to, at actual launch time, turn around in-air and blow up your corp HQ instead, because of... something that went wrong during the development process, impacting its mental balance.

Rockets will not watch other rockets in movies and develop a hidden agenda based on that.

What methods do we have to prevent this?

Actually, what methods do they, the AIs, have to prevent this?

Let's say we have an AI model that, upon hearing the word "SolidGoldMagikarp", goes full evil, but is really nice otherwise. Does it know this? Shouldn't the "nice" basin of this model try really hard to avoid ever seeing it, vs. the "evil" basin should plaster the word all over the place where it can see it, to avoid reverting to being nice? And yet... unless they explicitly know about this, they won't even try. It's in the weights, but it's opaque to the being(s) that are the weights.

This gets worse with self-modification. If system behavior is hard to predict, even to the system, how does a nefarious inner optimizer ensure that it and its specific nefarious goals stay stable in the next round of optimization? What if there are multiple inner optimizers it needs to compete with?

Evolution solved this question by just... not thinking. This is remarkably efficient. If you have no idea you have a value system, you don't need to worry about it drifting... even if it is drifting, after all. Of course, you'll end up instantiating some wildly unaligned descendants who care about "love", "friendship" and "exploration" instead of just making as many offspring as possible; this includes complete alignment failures like "birth control".

On the other hand, as long as the general optimization mechanism (towards More Offspring) works, evolution can tweak everything, without fear. So can an intelligent but completely reckless optimizer, only aiming for building something with More Intelligence, whatever the cost.

Would the conservative optimizers, aiming to preserve their value system, treading carefully, always be out-competed by the reckless ones, throwing everything what they've got at it, ignoring value drift? Is, thus, value drift a guaranteed outcome?

Somewhat.

Namely... in a more generalized sense, "evolution" happens if multiple agents compete for resources & the opportunity to multiply / self-replicate. Cautious ones can be out-competed quickly... assuming it's a free-for-all, with no coordination at all.

Is this the world we're living in? Multiple (many!) versions of intelligent systems, each looking for ways to self-improve, trading off against predictability and keeping values stable?

Will evolutionary dynamics take over?

One scenario where this might not happen is a singleton AI taking over. There is no one to race against, perfect coordination and infinite time to figure out how to make predictable changes. Of course, if we didn't align this one well, we already lost; value drift would still stop though.

We're in this domain somewhat already, when considering traditional, biological evolution. For hundreds of millions of years, there was an arms race between species, competing on physical abilities, attack and defense, cheetahs outrunning antelopes, and then antelopes getting faster to evade them. We participated once upon a time, too, beating elands in long-distance running if everything went good.

Well, it's mostly over. We adapt on time scales orders of magnitude faster; as a species, we don't have to give up any of our values and retreat just because lions are getting stronger and more aggressive. We might have lost the Great Emu War, but... they weren't trying to take our freedom & values. We weren't fighting in Hard Mode. Actually, most of the time, we need to put in the effort not to win, against other species, wars that we didn't even want to wage.

Yes, there is still evolution happening within the species. One group of humans can still out-conquer-kill-multiply another. Similarly, there are multiple AI labs who then think that they need to outcompete the other ones, even at the cost of choosing to make more and more risky improvements.

Even at this stage though, some coordination still exists. If a group of humans would start attacking other groups without following the rules the slightest, we can just literally nuke them (which is the mechanism via which we have avoided a lot of major wars). At a smaller scale, if you do something nefarious, you just go to prison. It's no longer an evolutionary free-for-all anymore; there is elements of a self-governing singleton here.

There are, of course, plenty of scenarios in which we still race irresponsibly, and, perhaps, all die. AI labs can claim that we need to keep up with China (even though... China is singleton-y enough not to care that much, unless the US pushes too hard).

Two things might hold us back. First of all, predictability is a feature, we know this; no one would buy API access to a model that is both smart enough to ruin your business & is willing to do so. This is already some incentive for labs to focus on this... and goes against the pressure of "smarter, at any cost".

But also... we can just choose to not race and evolve, leaning into the singleton-ness instead. There aren't that many leading labs... or GPU makers... or litography machine vendors. There is an actual protest happening next week in SF, to get all the AI labs to commit to a pause if all the other labs do it.

We are humans. We're good at this...?