Värderingar är komplexa och sköra
One day, my friend Niel asked his virtual assistant in India to find him a bike he could buy that day. She sent him a list of bikes for sale from all over the world. Niel said, “No, I need one I can buy in Oxford today; it has to be local.” So she sent him a long list of bikes available in Oxford, most of them expensive. Niel clarified that he wanted an inexpensive bike. So she sent him a list of children’s bikes. He clarified that he needed a local, inexpensive bike that fit an adult male. So she sent him a list of adult bikes in Oxford needing repair.
Usually humans understand each other’s desires better than this. Our evolved psychological unity causes us to share a common sense and common desires. Ask me to find you a bike, and I’ll assume you want one in working condition, that fits your size, is not made of gold, etc.—even though you didn’t actually say any of that.
But a different mind architecture, one that didn’t evolve with us, won’t share our common sense. It wouldn’t know what not to do. How do you make a cake? “Don’t use squid. Don’t use gamma radiation. Don’t use Toyotas.” The list of what not to do is endless.
Some people think an advanced AI will be some kind of super-butler, doing whatever they ask with incredible efficiency. But it’s more accurate to imagine an Outcome Pump: a non-sentient device that makes some outcomes more probable and other outcomes less probable. (The Outcome Pump isn’t magic, though. If you ask it for an outcome that is too improbable, it will break.)
Now, suppose your mother is trapped in a burning building. You’re in a wheelchair, so you can’t directly help. But you do have the Outcome Pump:
You cry “Get my mother out of the building!” . . . and press Enter.
For a moment it seems like nothing happens. You look around, waiting for the fire truck to pull up, and rescuers to arrive—or even just a strong, fast runner to haul your mother out of the building—
BOOM! With a thundering roar, the gas main under the building explodes. As the structure comes apart, in what seems like slow motion, you glimpse your mother’s shattered body being hurled high into the air, traveling fast, rapidly increasing its distance from the former center of the building.
Luckily, the Outcome Pump has a Regret Button, which rolls back time. You hit it and try again. “Get my mother out of there without blowing up the building,” you say, and press Enter.
So your mother falls out the window and breaks her neck.
After a dozen more hits of the Regret button, you tell the Outcome Pump:
Within the next ten minutes, move my mother (defined as the woman who shares half my genes and gave birth to me) so that she is sitting comfortably in this chair next to me, with no physical or mental damage.
You watch as all thirteen firemen rush the house at once. One of them happens to find your mother quickly and bring her to safety. All the rest die or suffer crippling injuries. The one fireman sets your mother down in the chair, then turns around to survey his dead and suffering colleagues. You got what you wished for, but you didn’t get what you wanted.
The problem is that your brain is not large enough to contain statements specifying every possible detail of what you want and don’t want. How did you know you wanted your mother to escape the building in good health without killing or maiming a dozen firemen? It wasn’t because your brain contained anywhere the statement “I want my mother to escape the building in good health without killing and maiming a dozen firemen.” Instead, you saw your mother escape the building in good health while a dozen firemen were killed or maimed, and you realized, “Oh, shit. I don’t want that.” Or you might have been able to imagine that specific scenario and realize, “Oh, no, I don’t want that.” But nothing so specific was written anywhere in your brain before it happened, or before you imagined the scenario. It couldn’t be; your brain doesn’t have room.
But you can’t afford to sit there, Outcome Pump in hand, imagining millions of possible outcomes and noticing which ones you do and don’t want. Your mother will die before you have time to do that.
What if her head is crushed, leaving her body? What if her body is crushed, leaving only her head? What if there’s a cryonics team waiting outside, ready to suspend the head? Is a frozen head a person? Is Terry Schiavo a person? How much is a chimpanzee worth?
Still, your brain isn’t infinitely complex. There is some finite set of statements that could describe the system that determines the judgments you would make. If we understood how every synapse and neurotransmitter and protein of the brain worked, and we had a complete map of your brain, then an AI could at least in principle compute which judgments you would make about a finite set of possible outcomes.
The moral is that there is no safe wish smaller than an entire human value system:
There are too many possible paths through Time. You can’t visualize all the roads that lead to the destination you give the [Outcome Pump]. “Maximizing the distance between your mother and the center of the building” can be done even more effectively by detonating a nuclear weapon. . . . Or, at higher levels of [Outcome Pump] intelligence, doing something that neither you nor I would think of, just like a chimpanzee wouldn’t think of detonating a nuclear weapon. You can’t visualize all the paths through time, any more than you can program a chess-playing machine by hardcoding a move for every possible board position.
And real life is far more complicated than chess. You cannot predict, in advance, which of your values will be needed to judge the path through time that the [Outcome Pump] takes. Especially if you wish for something longer-term or wider-range than rescuing your mother from a burning building.
. . . The only safe [AI is an AI] that shares all your judgment criteria, and at that point, you can just say “I wish for you to do what I should wish for.”
There is a cottage industry of people who propose the One Simple Principle that will make AI do what we want. None of them will work. We act not for the sake of happiness or pleasure alone. What we value is highly complex. Evolution gave you a thousand shards of desire. (To see what a mess this makes in your neurobiology, read the first two chapters of Neuroscience of Preference and Choice.)
This is also why moral philosophers have spent thousands of years failing to find a simple set of principles that, if enacted, would create a world we want. Every time someone proposes a small set of moral principles, somebody else shows where the holes are. Leave something out, even something that seems trivial, and things can go disastrously wrong:
Consider the incredibly important human value of “boredom”—our desire not to do “the same thing” over and over and over again. You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing—
—and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.
Or imagine a mind that contained almost the whole specification of which sort of feelings humans most enjoy—but not the idea that those feelings had important external referents. So that the mind just went around feeling like it had made an important discovery, feeling it had found the perfect lover, feeling it had helped a friend, but not actually doing any of those things, having become its own experience machine. And if the mind pursued those feelings and their referents, it would be a good future and true; but because this one dimension of value was left out, the future became something dull. Boring and repetitive, because although this mind felt that it was encountering experiences of incredible novelty, this feeling was in no wise true.
Or the converse problem: an agent that contains all the aspects of human value, except the valuation of subjective experience. So that the result is a nonsentient optimizer that goes around making genuine discoveries, but the discoveries are not savored and enjoyed, because there is no one there to do so . . .
Value isn’t just complicated, it’s fragile. There is more than one dimension of human value, where if just that one thing is lost, the Future becomes null. A single blow and all value shatters. Not every single blow will shatter all value—but more than one possible “single blow” will do so.
You can see where this is going. Since we’ve never decoded an entire human value system, we don’t know what values to give an AI. We don’t know what wish to make. If we created superhuman AI tomorrow, we could only give it a disastrously incomplete value system, and then it would go on to do things we don’t want, because it would be doing what we wished for instead of what we wanted.
Right now, we only know how to build AIs that optimize for something other than what we want. We only know how to build dangerous AIs. Worse, we’re learning how to make AIs safe much more slowly than we’re learning to how to make AIs powerful, because we’re devoting more resources to the problems of AI capability than we are to the problems of AI safety.
The clock is ticking. AI is coming. And we are not ready.