skip to main content
research-article

Reinforcement Learning of Informed Initial Policies for Decentralized Planning

Published:08 December 2014Publication History
Skip Abstract Section

Abstract

Decentralized partially observable Markov decision processes (Dec-POMDPs) offer a formal model for planning in cooperative multiagent systems where agents operate with noisy sensors and actuators, as well as local information. Prevalent solution techniques are centralized and model based—limitations that we address by distributed reinforcement learning (RL). We particularly favor alternate learning, where agents alternately learn best responses to each other, which appears to outperform concurrent RL. However, alternate learning requires an initial policy. We propose two principled approaches to generating informed initial policies: a naive approach that lays the foundation for a more sophisticated approach. We empirically demonstrate that the refined approach produces near-optimal solutions in many challenging benchmark settings, staking a claim to being an efficient (and realistic) approximate solver in its own right. Furthermore, alternate best response learning seeded with such policies quickly learns high-quality policies as well.

References

  1. Christopher Amato, Jilles Steeve Dibangoye, and Shlomo Zilberstein. 2009. Incremental policy generation for finite-horizon DEC-POMDPs. In Proceedings of the 19th International Conference on Automated Planning and Scheduling (ICAPS'09). 2--9.Google ScholarGoogle Scholar
  2. Christopher Amato and Shlomo Zilberstein. 2009. Achieving goals in decentralized POMDPs. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems. 593--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bikramjit Banerjee. 2013. Pruning for Monte Carlo distributed reinforcement learning in decentralized POMDPs. In Proceedings of the 27th AAAI Conference on Artificial Intelligence. 88--94.Google ScholarGoogle Scholar
  4. Bikramjit Banerjee, Jeremy Lyle, Landon Kraemer, and Rajesh Yellamraju. 2012. Sample bounded distributed reinforcement learning for decentralized POMDPs. In Proceedings of the 26th AAAI Conference on Artificial Intelligence. 1256--1262.Google ScholarGoogle Scholar
  5. Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. 2002. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27, 4, 819--840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lucian Busoniu. 2010. MARL Toolbox Ver. 1.3. Retrieved November 3, 2014, from http://busoniu.net/repository.php.Google ScholarGoogle Scholar
  7. Lonnie Chrisman. 1992. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the 10th National Conference on Artificial Intelligence. 183--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th National Conference on Artificial Intelligence. 746--752. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. 1998. Planning and acting in partially observable stochastic domains. Artifical Intelligence 101, 99--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jelle R. Kok, Pieter Jan't Hoen, Bram Bakker, and Nikos A. Vlassis. 2005. Utile coordination: Learning interdependencies among cooperative agents. In Proceedings of the IEEE Symposium on Computational Intelligence and Games (CIG'05). 29--36.Google ScholarGoogle Scholar
  11. Andrew Kachites McCallum. 1995. Reinforcement Learning with Selective Perception and Hidden State. Ph.D. Dissertation. Department of Computer Science, University of Rochester, Rochester, NY.Google ScholarGoogle Scholar
  12. Francisco S. Melo and Manuela M. Veloso. 2011. Decentralized MDPs with sparse interactions. Artificial Intelligence 175, 11, 1757--1789. http://dblp.uni-trier.de/db/journals/ai/ai175.html#MeloV11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling. 1999. Learning finite-state controllers for partially observable environments. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI'99). 427--436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ranjit Nair, Milind Tambe, Makoto Yokoo, David Pynadath, and Stacy Marsella. 2003. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI'03). 705--711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sven Seuken and Shlomo Zilberstein. 2007. Improved memory-bounded dynamic programming for decentralized POMDPs. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI'07). 344--351.Google ScholarGoogle Scholar
  16. Guy Shani, Ronen I. Brafman, and Solomon E. Shimony. 2005. Model-based online learning of POMDPs. In Machine Learning: ECML 2005. Lecture Notes in Computer Science, Vol. 3720. Springer, 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Edward J. Sondik. 1971. The Optimal Control of Partially Observable Markov Decision Processes. Ph.D. Dissertation. Stanford University, Stanford, CA.Google ScholarGoogle Scholar
  18. Matthijs T. J. Spaan. 2013. Dec-POMDP Problem Domains. Retrieved November 3, 2014, from http://masplan.org/problem_domains.Google ScholarGoogle Scholar
  19. Matthijs T. J. Spaan, Frans A. Oliehoek, and Christopher Amato. 2011. Scaling up optimal heuristic search in Dec-POMDPs via incremental expansion. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI'11). 2027--2032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Richard Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Pradeep Varakantham, Jun Young Kwak, Matthew E. Taylor, Janusz Marecki, Paul Scerri, and Milind Tambe. 2009. Exploiting coordination locales in distributed POMDPs via social model shaping. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS'09). http://dblp.uni-trier.de/db/conf/aips/icaps2009.html#VarakanthamKTMST09.Google ScholarGoogle Scholar
  22. Nikos Vlassis. 2003. A Concise Introduction to Multiagent Systems and Distributed Artificial Intelligence. Morgan and Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chris Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 3, 279--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Feng Wu, Shlomo Zilberstein, and Xiaoping Chen. 2010. Rollout sampling policy iteration for decentralized POMDPs. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI'10). 666--673.Google ScholarGoogle Scholar
  25. H. Peyton Young. 2004. Strategic Learning and Its Limits. Oxford University Press.Google ScholarGoogle Scholar
  26. Chongjie Zhang and Victor Lesser. 2011. Coordinated multi-agent reinforcement learning in networked distributed POMDPs. In Proceedings of the 25th AAI Conference on Artificial Intelligence (AAAI'11). 764--770.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!