AI “Stop Button” Problem – Computerphile
In almost any situation being given a new utility function, Is gonna rate very low on your current utility function. So that's a problem. if you want to build something that you can teach, that means you want to be able to change its utility function. And you don't want it to fight you. So this has been formalized, as this property that we want early AGI to have called 'corrigibility' That is to say it is open to be corrected. It understands that it's not complete, that the utility function that it's running is not the be all and end all. So let's say for example you've got your AGI. It's not a super intelligence, it's just you know, perhaps around human level intelligence. And it's in a robot in your lap. And you're testing it. But you… saw a Youtube video once that said maybe this is dangerous. So you thought, okay well we'll put a big red stop button next to it. This is the standard approach to safety with machines.
Most robots in industry and elsewhere will have a big red stop button on them. Oh yeah Hey. I happen to have a button, of appropriate type. so and we have we have that. So right… there you go. Alright, so, so if only HAL would've been fitted with said stop button. Can't do that Dave, uh yes I can. except yeah, probably not. I know that you and Frank were planning to disconnect me. And I'm afraid that's something I cannot allow to happen. That was an incorrigible design. Is this the point we're making? Kind of. You've got your big stop button, because you want to be safe. You understand it's dangerous.
And the idea is if the AI starts to do anything that maybe you don't want it to do, you'll smack the button, and if the button's like, mounted on its chest, something like that, ya know. So you create the thing, you set it up with a goal, and it's the same basic type of machine as the stamp collector, but less powerful. In the sense it has a goal, a thing that it's trying to maximize, and in this case, you know, it's in a little robot body so they can tootle around your lab and do things. So you, you want it to get you a cup of tea just as a test, right? So you set it up with this goal you manage to specify in the bot's like in the AI's ontology what a cup of tea is and that you want one to be in front of you.
You switch it on and it looks around, gathers data and it says oh yeah there's a kitchen over there it's got a kettle and it's got teabags and this is like the easiest way for me to fulfill this goal with the body i have now and everything setup is to go over there and make a cup of tea. So far we're doing very well right. So it starts driving over but then oh no you forgot it's bring your adorable baby to the lab day or something and there's a kid in the way. Your utility function only cares about tea right. So it's not going to avoid hitting the baby. So you rush over there to hit the button obviously as you built it in and what happens of course is that the robot will not allow you to hit that button because it wants to get you a cup of tea and if you hit the button it won't get you any tea so this is a bad outcome so it's going to try and prevent you in any way possible from shutting it down that's a problem plausibly it fights you off crushes the baby and then carries on and makes you a cup of tea and the fact that this button is supposed to turn it off is not in your utility function that you gave it so obviously it's going to fight you ok that was a bad design right assuming you're still working on the project after the terrible accident and you have another go try to improve things right and rather than read any AI safety research what you do is just come up with the first thing that pops into your head and you say ok let's add in some reward for the button so that because what it's looking at right now is it says button gets hit i get 0 reward.
Button doesn't get hit if I manage to stop them then I get the cup of tea i get like maximum reward if you give some sort of compensation for the button being hit maybe it won't mind you hitting the button. If you give it less reward for the button being hit than for getting tea, it will still fight you cause it will go well I could get five reward for accepting your hitting the button but I could get 10 for getting the tea so I'm still gonna fight you. The button being hit has to be just as good as getting the tea so you give it the same value right so so now you've got a new Version 2. You turn it on and what it does immediately is shut itself down because that's so much quicker and easier than going and getting the tea and gives exactly the same reward Why would it not just immediately shut itself down? So you've accidentally made like a dramatically suicidal robot.
We're kind of back to the Three Laws of Robotics again but could you presumably say well the robot is not allowed to to shut itself down or something like that? Right yeah so. It's still a reward if it does so it's not gonna fire you but it's not allowed to do itself i'm guessing that's a possibility. The robot cares about things in the real world and the button whether it's pushed or not has a big effect on the things that the robot cares about which means that the robot cares about the button so either it's trying to stop you from pressing the button or it's trying to press the button.
Either way that's not what you want so suppose then you don't mount the button on its chest where it can hit it right. You put the button somewhere only you can hit it somehow you managed to completely isolate the bot from the button. It's a fingerprint scanner or something? It's yeah it's something that it can't do itself right. It still either wants to press the button or wants the button not to be pressed but now the only path is through you so it's now incentivised to manipulate you to deceive you to lie to you in some way either to cause you to not press it to cause you to press it if you somehow managed to persuade it it's not allowed to interact with the button itself so in the case where it wants the button not to be pressed it has a reason to deceive you so in this example it would steer around the baby it would avoid the baby assuming it has basic level of intelligence to know you don't want it to hit the baby and moreover if you realize your mistake you will hit the button shut it down and change it which it doesn't want so it will behave as though this is a there's a term which i like to use here i'm not sure if I should just use it because it's my own which is that it will Volkswagen you.
That it will set itself up so that it it behaves properly in the testing environment despite the fact that it actually shouldn't pass the test. You're testing it to see if it avoids running over babies. So you're trying to see does it care about harming humans but actually all it cares about is you not hitting the button and so then you might end up in a situation in which you become convinced that the thing is perfectly safe because it keeps passing all of your moral puzzles and test that you give it but actually it's just has enough of an understanding of your psychology to tell what it thinks you've put in and what it wants of what you want rather so it will for as long as that button is able to shut it down it will continue to attempt to deceive you until it's in a situation in which it believes it can't be shut down and then you have a problem on the other side if you accidentally make something that wants the button to be hit but is unable to it is going to try and manipulate you into pressing the button so if you create your thing you say okay make me a cup of tea if the button is hit shut yourself down you aren't allowed to shut yourself down you aren't allowed to hit the button but if the button is hit the reward you get is just as good at getting the tea so that you don't have a preference this machine really wants to hit it's own button because that's as good as getting the tea so what it's likely to do probably is just take a swing at you or something just immediately because if it can quickly persuade you to hit the button if scaring you into hitting the button is easier than getting the tea it will just do that instead which is a really kind of unexpected outcome that you've made this thing with perfectly reasonable sounding rewards and what it does immediately is try to terrify you it reminds me of the proverbial carrots and a stick this this is almost like this is the stick and actually we need to find what the carrot is would that be a fair thing to say yeah yeah you want it to actually want it it's interesting because it has to not care about whether the button is pressed right because it has to take no steps to try and cause the button to be pressed and take no steps to try and prevent the button from being pressed but nonetheless really care that the button exists so one thing that you can do something slightly more sensible is you define the utility function such that the whole part of of what it's really trying to achieve in the world and the part about paying attention to the button being pressed and turning itself off and it sets it up so that those it adds an adjustment tab so that those are always exactly equal however however much value it would get from either it being pressed it not being pressed it normalizes those so that it's always completely indifferent to whether the button is being pressed.
It just doesn't care so that way it will never try and hit the button on its own it will never try and prevent you from hitting the button that's the idea that's a fairly sensible approach it has some of its own problem feels like a really complicated thing to evaluate to be honest yeah firstly yeah it is kind of tricky and you have to get it right but that's always the always part of this but one thing that's interesting about that is that it isn't what's called subagent stable so this is something that is a desirable property and it's part of corrigibility in fact which is that the if there's some property that the agent has you want it to if it makes any new agents that they will have this property as well so you get some of that for nothing like in the same way that you don't want to take a pill that will make you want to kill your kids you also don't want to make a clone of yourself that wants to kill your kids right if you're making another instance of yourself or you're creating some agent you want it to want what you want this is like the usual classic disney way out of a problem for a baddie character where they go I'm not allowed to do this but that doesn't mean I can't get this person to do this for me I need a young pair of legs and a strong back to go in after it exactly so in general when an AI designs its successor it will design it with the same utility function as it has for the same reason but unlike with some safety measures where the AI will be explicitly motivated to remove the safety measures if it can by making new agents which doesn't follow them in this case it doesn't care so you might think well you're okay then it doesn't care it won't remove the safety precautions but it also won't to put in any effort to keep them so if it's making a new version and it would take a couple of extra seconds to put in oh and also respond appropriately to the button and it could do anything else with that time or those resources it just won't do it because it looks at the utilities and says if the button is pressed i get x utility if the button isn't pressed i get exactly the same utility this bit of code does nothing there's no point in it right this is going to be optimizing itself or optimizing its successor in a way that yeah it think is being streamlining potentially but actually is missing the point of and that's kind of unavoidable if you give if you have a situation in which it wants to preserve the button so that it can be pressed because it wants that then you've made something that wants the button to be pressed and will try and manipulate you if you if you do it the other way you know whichever way you do it you have problems another thing that people have thought you know as a suggestion of something you might try in this situation is to just have the robot have the AI not know about the button.
Oh so keep it a secret right you have it hidden somewhere you know the the machine is being observed from somewhere else and they have the button and it's outside of their control and they don't even know the button exists so they're not going to be motivated to try and deceive you because they're not aware that you have the capacity to shut them down and change them. There are a few different variants of this but at the core of the problem there you've got the fact that you're building a general-purpose machine to learn and reason about the real world of uncertain abilities I suppose I mean even a very simple optimization systems consistently surprise people and so the idea that you could reliably keep such a thing a secret indefinitely and it would never put two and two together in fact while you're while you're improving it and training it's going to be learning about human psychology because that's important for the final goal.
It needs to understand humans in order to understand what humans want in order to do what humans want so the idea that it's never going to think oh interesting these guys you know they don't seem to trust me i'm aware there is this concept of an off button and it's weird that they haven't given me one so maybe they have actually and just not told me about it that kind of hypothesis shouldn't take too long to occur and of course if you've designed the thing in an incorrigible way it won't let you know that it's figured out that you have an off button and will just deceive you as it always had so that's the kind of thing which is which could work but it's not really a reliable solution the other thing that people sometimes suggest and this happens this happens in conversation sometimes is that people keep applying patches you have a you have a bad idea for a way to do this and then somebody points out the way that would go wrong and then rather than realise the big core approach is flawed you apply a patch you say oh well we'll also add a negative term for doing that and then also for doing that you know The spaghetti code ensues yeah yeah and what's more you're then in a situation in which you've got this system that you believe you've patched every possible way it's it's kind of um you haven't proved it's safe you've just proved that you can't figure out how it's dangerous but what are the chances that you've genuinely thought of every possibility ideally we really want to be able to formally prove that the system has these properties you don't want a system in which you've blocked off loads of specific actions that the AI can do and you're just relying on it it's like running a complicated search trying to figure out a way to screw you over and you're pretty sure you've blocked off all the angles.
You've kind of failed before you've begun there that your code is running this extensive search that you just hope fails and if it finds any way to do it will jump on that opportunity it's not it's not a good way of going about things the other point about this is that the button is a toy problem it's a simplification that's useful for thought experiments because it lets you express your you know it lets you formalize things quite well you only have two possible outcomes you hit the button or you don't hit the button but in fact with corrigibility what we want is a more complex range of behaviors we want it to actually assist the programmers in its own development because if it has you know it has some understanding of its own operation you want it to be able to actually point out your mistakes to you or seek out new information perhaps if you say something ambiguous rather than just assuming to say well do you mean this or do you mean this or if you if you believe that you've been programmed poorly to actually draw the programmer's attention to what may be the mistake rather than like quietly storing that away for any time that they might try and press this button on you you know likewise wanting to maintain and repair the safety systems and so on these are these are more complicated behaviours than just not stopping you from pressing the button and not trying to manipulate you into not pressing the button so there are some things that might work as solutions for this specific case but you would hope that a really good solution to the off button problem would if you run it in a more complicated scenario also produce these good more complicated behaviours in that situation so that's part of why there are some things that maybe are solutions to this problem but they're not they're not they're only solutions to this specific instance of the problem rather than the general issue we're trying to deal with right now we have a few different proposals for ways to ways to create a AGI with these properties but none of them are without problems none of them seems to perfectly solve all of these properties in a way that we can be really confident of.
So this is considered an open problem so I kind of like this as a place to go from the previous thing because it gives I think it gives people a feel for where we are the types of problems that it seems like the simplest thing in the world right you've got a robot with a button how do you make it not stop you from hitting the button but also not try and persuade you to hit the button that should be easy and doesn't seem like it is so utility function is what the AI cares about so the the stamp collecting device its utility function was just how many stamps in a year this is kind of like its measure is it yeah it's what it it's the thing that it's trying to optimize.