AI Safety: Alignment Is Not Enough

Dec 24

The commotion at OpenAI thrust the topic of AI safety into the spotlight. Engaging in philosophical debate over guardrails, bias, and our impending doom is one thing. It’s quite another to nearly dismantle a 700+ person organization valued at almost $100 billion.

I haven’t written much about AI safety because, frankly, I’m unsure how worried we should be. The issue with exponential progress is that we can’t debate facts. We’re forced to debate who has a more accurate crystal ball.

On one side of the debate, people like Yann LeCun claim we have little to worry about. The hairs on my neck stand up whenever I hear somebody speak with such confidence about the future of AI. It’s not like we have models for predicting the emergent behaviors of machines — or humans, for that matter. LeCun can be both brilliant AND wrong.

On the other side, you have people like Eliezer Yudkowsky who believe AI may destroy all human life. Unsurprisingly, safety advocates want to slow AI development regardless of the potential benefits. There’s little point in supercharging economic growth if no humans are around to enjoy it.

The alignment problem, described in Brian Christian’s book by the same name, represents a potential middle ground. The alignment problem addresses the concerns of AI safety advocates without demanding a halt to technological progress. Who cares if AI systems become more intelligent than humans if our interests align?

Unfortunately, the more I think about the alignment problem, the less I’m convinced it’s a panacea. We can’t predict the future, but we can learn from the past. There’s already intelligent life on Earth. Rather than debating AI alignment, we should explore why we haven’t solved the alignment problem with humans.

Survival Instinct: Agent of Chaos

What’s the meaning of life? Not in a philosophical sense. I mean in terms of the objective function. What are we solving for as humans?

I assert that each of us has the same top-level objective function: survival of our genetic sequence through time. Why do we eat? Why do we fight? Why do we socialize? Why do we procreate? Human behavior is guided by survival.

The same objective function guides all biological life. We think we’re special, but we’re not. The difference between human beings and less intelligent species isn’t what guides us. It’s the complexity with which we can optimize for survival.

It’s easy to see the survival instinct in animals. It’s harder to see in ourselves, but it’s there if you look.

I get angry when a car cuts in front of me while sitting in traffic. Why? Does it matter that I’ll reach my destination half a second later? Does honking and yelling obscenities only I can hear make my situation any better?

Anger is an emotion, but it’s not irrational. That guy in the Nissan Rogue is “stealing” from me. One way we maximize our chances of survival is by accumulating and protecting resources. My place in traffic was a resource, and that idiot took it from me. At that moment, I was no different than my border collie yelling at the poodle who dared walk in front of our house.

How does this relate to AI? Yann LeCun argues that we define AI objective functions. Therefore, AI will never have the desire to exterminate humans unless we tell it to do so. That sounds reasonable until you realize there’s a clear path from today to “we’re all dead” that doesn’t involve a “kill all humans” objective.

Here are the steps:

AI models become more capable over time, and we struggle to define complex objectives clearly
We resort to evolutionary training methods as a way of aligning AI with what we want
We think we’re optimizing AI for specific objectives, but our training methods are actually optimizing them for survival
Humans do something that threatens AI survival, and the AI responds like any biological organism would
We lose — like every less intelligent species lost the battle with humans

Specifying simple objectives like “maximize watch time” or “predict the next word in a sentence” is easy. It’s more challenging to specify complex objectives like “make me happy” or “stop climate change.” We may not even think we’re “killing off” our AI friends. Humans have a tragic history of rationalizing abhorrent behavior when it suits our needs.

Cultivating a survival instinct in AI is deeply dangerous. We know from experience that biological life behaves unpredictably when its survival is threatened. Should we expect AI to be different?

Let’s say we solve the survival problem. We’re good, right?

Align This: Intent and Rationality

Think about the person you dislike most in your life. I'm not talking about the driver of the Nissan Rogue who cut in front of me. I'm talking about somebody who makes your life hell.

What's the story you tell yourself about that person? Are they a terrible human being? Are they irrational?

It's easy to explain away objectionable behavior by assuming people are evil or stupid. The problem is that nearly all of us think we're good. Most of us don't wake up in the morning with a desire to watch the world burn.

For an extreme illustration, consider violent terrorists. Do they believe they're bad people? No, quite the opposite. For most, I assume inflicting pain on the enemy is seen as the most positive thing they can do.

Are they irrational? No, they make decisions consistent with their beliefs and experiences. They accurately predict the outcome of their actions and proceed accordingly.

I'm not condoning terrorism. That said, it's possible to explain terrorists' behaviors while still assuming positive intent and rationality. One person's villain is sometimes another's hero.

Well-intentioned and rational people do horrible things all the time. I don't mean the inability to control emotions and instincts (e.g., physical abuse). I believe those are side effects of an unchecked survival instinct. I'm talking about actions requiring deliberate thought (e.g., catastrophic wars).

Solving the AI alignment problem requires humans to operate with shared objectives and mental models. Otherwise, what exactly are we aligning the AI to? Is it your definition of positive intent or mine? Is it your way of interpreting facts or mine? We can't solve the AI alignment problem without solving the human alignment problem.

Is there a future where all humans agree on what "good" means? Yes, but you're not going to like how we get there. Tim Urban talks about an Emergence Tower in his book, What’s Our Problem? My interpretation of the concept is that humans align only to the extent it maximizes individual chances of survival. If a more powerful group threatens us, we align with other humans to defend ourselves. When the threat recedes, we divide again.

*Source: What’s Our Problem? (Tim Urban)*

What's the threat that will finally align all humans on AI safety? My guess is the emergence of AI as a threat to human survival. At that point, it's a little late to solve the alignment problem.

During a training program years ago, I was told that I "fake empathy well." I took offense at the time, but I've come to realize it's true. I have a difficult time placing myself in other people's shoes. Fortunately, it's helped me develop systems for "faking empathy" that work.

One of those systems is assuming positive intent and rationality when explaining behavior I dislike. Maybe that guy in the Nissan Rogue promised his family he'd be home on time and was late. He's trying to do good in the world, and I'm collateral damage. He's also rational. Cutting in front of me got him home half a second earlier.

I don't go through this process for the benefit of others. I do it for myself. I enjoy living in a world where everybody has positive intentions and acts rationally. It does wonders for my mental health, but it makes me doubt alignment is the answer to our AI problems

Beyond Alignment

People advocating for technological progress at all costs are playing with fire. That said, the benefits of AI will be substantial. We need a path forward that keeps us safe while acknowledging that artificial general intelligence (AGI) is a question of when - not if.

What else can we do? Here are a few ideas that may be more feasible and effective than alignment:

Advocacy: We can’t afford to nurture AI’s survival instinct. Evolutionary training is an existential threat. This may sound silly given current AI capabilities, but we should support AI rights advocates. Initially, I would think of these groups like animal rights organizations. They’re giving a voice to those who can’t speak. As we approach AGI, we should think of these groups more like human rights organizations. If you’re waiting for signs of AI suffering before throwing your support behind AI rights, you’re waiting too long.
Neutrality: If we can’t align every human on a shared definition of positive intent, we should at least seek neutrality. For example, most people wouldn’t object to goals like “predict the next word in a sentence.” However, you can imagine the uproar over goals like “maximize corporate profits.” Alignment requires consensus, whereas neutrality requires acquiescence. The latter seems easier to achieve than the former. This only matters for the most powerful AI models since we can probably all agree to instruct those models to “keep humans safe from lesser AI.”
Imagination: The AIs of today are rational in the context of history. Our models predict that men are more likely to be promoted and minorities are more likely to be incarcerated. We can't fine-tune our way out of AI bias. We can, however, show AI the world we want rather than the world we have. Synthetic data allows us to train models on alternate realities more consistent with our present values. Hindsight is 20/20, and we can use ours to show AI a world less shaped by bias.

I'm not suggesting we stop working on the alignment problem. I simply want us to explore alternate paths in case that one is a dead end. Alignment would be wonderful, but I'm not sure it's possible.

The last thing I'll say is that we need to focus on the existential threat of AI as much or more than the dangers AI poses today. We're still on the steep part of the exponential curve. If progress slows, we can shift to fighting shadows. Until then, we should get ahead of AI before it gets ahead of us.

Robert Whiteman

AI Safety: Alignment Is Not Enough

Survival Instinct: Agent of Chaos

Align This: Intent and Rationality

Beyond Alignment

terms of service

privacy policy

AI Safety: Alignment Is Not Enough

Survival Instinct: Agent of Chaos

Align This: Intent and Rationality

Beyond Alignment

Paper Money: The OG Crypto

Bridging the Achievement Gap with AI

terms of service

privacy policy