skip to main content
10.1145/3368555.3384450acmconferencesArticle/Chapter ViewAbstractPublication PageschilConference Proceedingsconference-collections
research-article
Open access

Defining admissible rewards for high-confidence policy evaluation in batch reinforcement learning

Published: 02 April 2020 Publication History

Abstract

A key impediment to reinforcement learning (RL) in real applications with limited, batch data is in defining a reward function that reflects what we implicitly know about reasonable behaviour for a task and allows for robust off-policy evaluation. In this work, we develop a method to identify an admissible set of reward functions for policies that (a) do not deviate too far in performance from prior behaviour, and (b) can be evaluated with high confidence, given only a collection of past trajectories. Together, these ensure that we avoid proposing unreasonable policies in high-risk settings. We demonstrate our approach to reward design on synthetic domains as well as in a critical care context, to guide the design of a reward function that consolidates clinical objectives to learn a policy for weaning patients from mechanical ventilation.

References

[1]
Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first International Conference on Machine Learning. ACM, 1.
[2]
Onur Atan, William R Zame, and Mihaela van der Schaar. 2018. Learning Optimal Policies from Observational Data. arXiv preprint arXiv:1802.08679 (2018).
[3]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016).
[4]
Daniel S Brown and Scott Niekum. 2018. Efficient probabilistic performance bounds for inverse reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
[5]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems. 4299--4307.
[6]
Shayan Doroudi, Kenneth Holstein, Vincent Aleven, and Emma Brunskill. 2016. Sequence Matters, but How Exactly? A Method for Evaluating Activity Sequences from Data. International Educational Data Mining Society (2016).
[7]
Damien Ernst, Pierre Geurts, and Louis Wehenkel. 2005. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, Apr (2005), 503--556.
[8]
Damien Ernst, Guy-Bart Stan, Jorge Goncalves, and Louis Wehenkel. 2006. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control. IEEE, 667--672.
[9]
Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. 2016. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems. 2298--2306.
[10]
Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. 2017. Inverse reward design. In Advances in Neural Information Processing Systems. 6765--6774.
[11]
Jessie Huang, Fa Wu, Doina Precup, and Yang Cai. 2018. Learning safe policies with expert guidance. In Advances in Neural Information Processing Systems. 9105--9114.
[12]
Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3 (2016), 160035.
[13]
Adam Kalai and Santosh Vempala. 2016. Efficient algorithms for on-line optimization. J. Comput. System Sci. 71 (2016).
[14]
L Kish. 1968. Survey Sampling. John Wiley & Sons, Inc., New York, London 1965, IX+ 643 S., 31 Abb., 56 Tab., Preis 83 s. Biometrische Zeitschrift 10, 1 (1968), 88--89.
[15]
Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. 2018. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine 24, 11 (2018), 1716.
[16]
Romain Laroche, Paul Trichelair, and Layla El Asri. 2017. Safe Policy Improvement with Baseline Bootstrapping. arXiv preprint arXiv:1712.06924 (2017).
[17]
Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. 2017. Ai safety gridworlds. arXiv preprint arXiv:1711.09883 (2017).
[18]
Robert Tyler Loftin, James MacGlashan, Bei Peng, Matthew E Taylor, Michael L Littman, Jeff Huang, and David L Roberts. 2014. A strategy-aware technique for learning behaviors from discrete human feedback. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
[19]
Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740 (2009).
[20]
Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. 2016. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2978--2981.
[21]
Niranjani Prasad, Li-Fang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. 2017. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units, In Proceedings of the Conference on Uncertainty in Artificial Intelligence. arXiv preprint arXiv:1704.06300.
[22]
Doina Precup, Richard S Sutton, and Satinder Singh. 2000. Eligibility Traces for Off-Policy Policy Evaluation. In ICML'00 Proceedings of the Seventeenth International Conference on Machine Learning.
[23]
Elad Sarafian, Aviv Tamar, and Sarit Kraus. 2018. Safe Policy Learning from Observations. arXiv preprint arXiv:1805.07805 (2018).
[24]
Rhodri Saunders and Dimitris Geogopoulos. 2018. Evaluating the Cost-Effectiveness of Proportional-Assist Ventilation Plus vs. Pressure Support Ventilation in the Intensive Care Unit in Two Countries. Frontiers in public health 6 (2018).
[25]
Susan M Shortreed, Eric Laber, Daniel J Lizotte, T Scott Stroup, Joelle Pineau, and Susan A Murphy. 2011. Informing sequential clinical decision-making through reinforcement learning: an empirical study. Machine learning 84, 1-2 (2011), 109--136.
[26]
Jonathan Sorg, Satinder P Singh, and Richard L Lewis. 2010. Internal rewards mitigate agent boundedness. In Proceedings of the 27th international conference on machine learning (ICML-10). 1007--1014.
[27]
Jonathan Daniel Sorg. 2011. The optimal reward problem: Designing effective reward for bounded agents. University of Michigan (2011).
[28]
Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015a. High confidence policy improvement. In International Conference on Machine Learning. 2380--2388.
[29]
Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015b. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
[30]
Chao Yu, Jiming Liu, and Shamim Nemati. 2019. Reinforcement learning in healthcare: a survey. arXiv preprint arXiv:1908.08796 (2019).

Cited By

View all
  • (2022)Dynamic inverse reinforcement learning for characterizing animal behaviorProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602421(29663-29676)Online publication date: 28-Nov-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning
April 2020
265 pages
ISBN:9781450370462
DOI:10.1145/3368555
This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. off-policy evaluation
  2. reinforcement learning
  3. reward design

Qualifiers

  • Research-article

Conference

ACM CHIL '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 27 of 110 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)18
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Dynamic inverse reinforcement learning for characterizing animal behaviorProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602421(29663-29676)Online publication date: 28-Nov-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media