skip to main content
research-article
Public Access

Social Learning in Multi Agent Multi Armed Bandits

Published:17 December 2019Publication History
Skip Abstract Section

Abstract

Motivated by emerging need of learning algorithms for large scale networked and decentralized systems, we introduce a distributed version of the classical stochastic Multi-Arm Bandit (MAB) problem. Our setting consists of a large number of agents n that collaboratively and simultaneously solve the same instance of K armed MAB to minimize the average cumulative regret over all agents. The agents can communicate and collaborate among each other only through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. Agents in our model are decentralized, namely their actions only depend on their observed history in the past. We develop a novel algorithm in which agents, whenever they choose, communicate only arm-ids and not samples, with another agent chosen uniformly and independently at random. The per-agent regret scaling achieved by our algorithm is $\BigO łeft( \fracłceil\fracK n \rceil+łog(n) Δ łog(T) + \fracłog^3(n) łog łog(n) Δ^2 \right) $. Furthermore, any agent in our algorithm communicates (arm-ids to an uniformly and independently chosen agent) only a total of Θ(łog(T))$ times over a time interval of T. We compare our results to two benchmarks - one where there is no communication among agents and one corresponding to complete interaction, where an agent has access to the entire system history of arms played and rewards obtained of all agents. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in per-agent regret when compared to the case when agents do not collaborate and each agent is playing the standard MAB problem (where regret would scale linearly in K), and in communication complexity when compared to the full interaction setting which requires T communication attempts by an agent over T arm pulls. Our result thus demonstrates that even a minimal level of collaboration among the different agents enables a significant reduction in per-agent regret.

References

  1. Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, and Ananthram Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731--745, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. James Aspnes and Eric Ruppert. An introduction to population protocols. In Middleware for Network Eccentric and Mobile Applications, pages 97--120. Springer, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  3. Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13--p, 2010.Google ScholarGoogle Scholar
  4. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2--3):235--256, 2002.Google ScholarGoogle Scholar
  5. Orly Avner and Shie Mannor. Concurrent bandits and cognitive radio networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 66--81. Springer, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Orly Avner and Shie Mannor. Multi-user lax communications: a multi-armed bandit approach. In IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, pages 1--9. IEEE, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Baruch Awerbuch and Robert D Kleinberg. Competitive collaborative learning. In International Conference on Computational Learning Theory, pages 233--248. Springer, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Baraglia, P. Dazzia, M. Mordacchini, and L. Riccia. A peer-to-peer recommender system for self-emerging user communities based on gossip overlays. Journal of Computer and System Sciences, 79:291 -- 308, March 2013.Google ScholarGoogle Scholar
  9. Lilian Besson and Emilie Kaufmann. What doubling tricks can and can't do for multi-armed bandits. arXiv preprint arXiv:1803.06971, 2018.Google ScholarGoogle Scholar
  10. Ilai Bistritz and Amir Leshem. Distributed multi-player bandits-a game of thrones approach. In Advances in Neural Information Processing Systems, pages 7222--7232, 2018.Google ScholarGoogle Scholar
  11. Edward Boon, Leyland Pitt, and Esmail Salehi-Sangari. How to manage information sharing in online marketplaces -- an exploratory study. In Ideas in Marketing: Finding the New and Polishing the Old, pages 538--541. Springer International Publishing, 2015.Google ScholarGoogle Scholar
  12. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1--122, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1--122, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412:1832--1852, April 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Swapna Buccapatnam, Jian Tan, and Li Zhang. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 2605--2613. IEEE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  16. L Elisa Celis, Peter M Krafft, and Nisheeth K Vishnoi. A distributed learning dynamics in social groups. arXiv preprint arXiv:1705.03414, 2017.Google ScholarGoogle Scholar
  17. Nicolo Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. JOURNAL OF MACHINE LEARNING RESEARCH, 49:605--622, 2016.Google ScholarGoogle Scholar
  18. Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. In Advances in Neural Information Processing Systems, pages 737--745, 2013.Google ScholarGoogle Scholar
  19. Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated versus decentralized exploration in multi-agent multi-armed bandits. In IJCAI, pages 164--170, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  20. Francesco Colace, Massimo De Santo, Luca Greco, Vincenzo Moscato, and Antonio Picariello. A collaborative user-centered framework for recommending items in online social networks. Computers in Human Behavior, 51:694--704, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alan Demers, Dan Greene, Carl Houser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. Epidemic algorithms for replicated database maintenance. ACM SIGOPS Operating Systems Review, 22(1):8--32, 1988.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. John C Duchi, Sorathan Chaturapruek, and Christopher Ré. Asynchronous stochastic convex optimization. arXiv preprint arXiv:1508.00882, 2015.Google ScholarGoogle Scholar
  23. Glenn Ellison and Drew Fudenberg. Word-of-mouth communication and social learning. The Quarterly Journal of Economics, 110(1):93--125, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  24. Alan M Frieze and Geoffrey R Grimmett. The shortest-path problem for graphs with random arc-lengths. Discrete Applied Mathematics, 10(1):57--77, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  25. F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016.Google ScholarGoogle Scholar
  26. Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research, 16(1):3367--3402, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 854--862, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil'ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423--439, 2014.Google ScholarGoogle Scholar
  29. Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1--6. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  30. Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331--2345, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Varun Kanade, Zhenming Liu, and Bozidar Radunovic. Distributed non-stochastic experts. In Advances in Neural Information Processing Systems, pages 260--268, 2012.Google ScholarGoogle Scholar
  32. Richard Karp, Christian Schindelhauer, Scott Shenker, and Berthold Vocking. Randomized rumor spreading. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 565--574. IEEE, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE/ACM Trans. Netw., 26(4):1782--1795, August 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Peter M Krafft, Julia Zheng, Wei Pan, Nicolás Della Penna, Yaniv Altshuler, Erez Shmueli, Joshua B Tenenbaum, and Alex Pentland. Human collective intelligence as distributed bayesian inference. arXiv preprint arXiv:1608.01987, 2016.Google ScholarGoogle Scholar
  35. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4--22, 1985.Google ScholarGoogle Scholar
  36. Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative decision-making in multiarmed bandits: Frequentist and bayesian algorithms. arXiv preprint arXiv:1606.00911, 2016.Google ScholarGoogle Scholar
  37. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.Google ScholarGoogle Scholar
  38. Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539--548. ACM, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Haoyang Liu, Keqin Liu, Qing Zhao, et al. Learning in a changing world: Restless multi-armed bandit with unknown dynamics. IEEE Trans. Information Theory, 59(3):1902--1916, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. David Mart'inez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic multi-armed bandits. arXiv preprint arXiv:1810.04468, 2018.Google ScholarGoogle Scholar
  41. Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48--61, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  42. Boris Pittel. On spreading a rumor. SIAM Journal on Applied Mathematics, 47(1):213--223, 1987.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jonathan Rosenski, Ohad Shamir, and Liran Szlak. Multi-player bandits--a musical chairs approach. In International Conference on Machine Learning, pages 155--163, 2016.Google ScholarGoogle Scholar
  44. Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. arXiv preprint arXiv:1702.08704, 2017.Google ScholarGoogle Scholar
  45. Yevgeny Seldin, Peter L Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Prediction with limited advice and multiarmed bandits with paid observations. In ICML, pages 280--287, 2014.Google ScholarGoogle Scholar
  46. Prabodini Semasinghe, Setareh Maghsudi, and Ekram Hossain. Game theoretic mechanisms for resource management in massive wireless iot systems. IEEE Communications Magazine, 55(2):121--127, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Devavrat Shah. Gossip algorithms. Foundations and Trends® in Networking, 3(1):1--125, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Shahin Shahrampour and Ali Jadbabaie. Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control, 63(3):714--725, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  49. Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Multi-armed bandits in multi-agent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2786--2790. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  50. Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944--966, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Lili Su, Martin Zubeldia, and Nancy Lynch. Collaboratively learning the best option on graphs, using bounded local memory. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1):11, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. K. Sugawara, T. Kazama, and T. Watanabe. Foraging behavior of interacting robots with virtual pheromone. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3074 -- 3079 vol.3, 11 2004.Google ScholarGoogle ScholarCross RefCross Ref
  53. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.Google ScholarGoogle Scholar
  54. Balázs Szörényi, Róbert Busa-Fekete, István HegedHu s, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. Gossip-based distributed stochastic bandit algorithms. In Journal of Machine Learning Research Workshop and Conference Proceedings, volume 2, pages 1056--1064. International Machine Learning Societ, 2013.Google ScholarGoogle Scholar
  55. Cem Tekin, Simpson Zhang, and Mihaela van der Schaar. Distributed online learning in social recommender systems. IEEE Journal of Selected Topics in Signal Processing, 8(4):638--652, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  56. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285--294, 1933.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Social Learning in Multi Agent Multi Armed Bandits

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
      Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 3, Issue 3
      SIGMETRICS
      December 2019
      525 pages
      EISSN:2476-1249
      DOI:10.1145/3376928
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 December 2019
      Published in pomacs Volume 3, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!