ABSTRACT
Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. If that change precludes optimization of the correctly specified reward function, then correction is futile. For example, a robotic factory assistant could break expensive equipment due to a reward misspecification; even if the designers immediately correct the reward function, the damage is done. To mitigate this risk, we introduce an approach that balances optimization of the primary reward function with preservation of the ability to optimize auxiliary reward functions. Surprisingly, even when the auxiliary reward functions are randomly generated and therefore uninformative about the correctly specified reward function, this approach induces conservative, effective behavior.
Get full access to this Publication
Purchase, subscribe or recommend this publication to your librarian.
Already a Subscriber?Sign In
- Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.Google Scholar
- Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane. Concrete problems in AI safety. arXiv:1606.06565 [cs], June 2016. arXiv: 1606.06565.Google Scholar
- Stuart Armstrong and Benjamin Levinstein. Low impact artificial intelligences. arXiv:1705.10720 [cs], May 2017. arXiv: 1705.10720.Google Scholar
- Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908--918, 2017.Google Scholar
- Ryan Carey. Incorrigibility in the CIRL framework. AI, Ethics, and Society, 2018.Google Scholar
- Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092--8101, 2018.Google Scholar
- Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299--4307, 2017.Google Scholar
- Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 4705--4713, 2017.Google Scholar
- Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. In International Conference on Learning Representations, 2018.Google Scholar
- Javier Garcia and Fernando Fernandez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437--1480, 2015.Google Scholar
- Dylan Hadfield-Menell, Stuart Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 3909--3917, 2016.Google Scholar
- Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 220--227, 2017.Google Scholar
- Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, pages 6765--6774, 2017.Google Scholar
- Victoria Krakovna, Laurent Orseau, Miljan Martic, and Shane Legg. Measuring and avoiding side effects using relative reachability. arXiv:1806.01186 [cs, stat], June 2018. arXiv: 1806.01186.Google Scholar
- Gavin Leech, Karol Kubicki, Jessica Cooper, and Tom McGrath. Preventing side-effects in gridworlds, 2018.Google Scholar
- Jan Leike, Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI safety gridworlds. arXiv:1711.09883 [cs], November 2017. arXiv: 1711.09883.Google Scholar
- Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems, pages 2125--2133, 2015.Google Scholar
- Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. ICML, 2012.Google Scholar
- OpenAI. OpenAI Five. https://blog.openai.com/openai-five/, 2018.Google Scholar
- Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning--an overview. In International Workshop on Modelling and Simulation for Autonomous Systems, pages 357--375. Springer, 2014.Google Scholar
- Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs using nondominated policies. In AAAI, 2010.Google Scholar
- William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems, pages 2067--2069, 2018.Google Scholar
- Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, and Anca Dragan. The implicit preference information in an initial state. In International Conference on Learning Representations, 2019.Google Scholar
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.Google Scholar
- Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. AAAI Workshops, 2015.Google Scholar
- ChristopherWatkins and Peter Dayan. Q-learning. Machine Learning, 8(3--4):279-- 292, 1992.Google Scholar
- Shun Zhang, Edmund H Durfee, and Satinder P Singh. Minimax-regret querying on side effects for safe optimality in factored Markov decision processes. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4867--4873, 2018. A THEORETICAL RESULTSGoogle Scholar
Index Terms
Conservative Agency via Attainable Utility Preservation





Comments