Abstract
Motivated by emerging need of learning algorithms for large scale networked and decentralized systems, we introduce a distributed version of the classical stochastic Multi-Arm Bandit (MAB) problem. Our setting consists of a large number of agents n that collaboratively and simultaneously solve the same instance of K armed MAB to minimize the average cumulative regret over all agents. The agents can communicate and collaborate among each other only through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. Agents in our model are decentralized, namely their actions only depend on their observed history in the past. We develop a novel algorithm in which agents, whenever they choose, communicate only arm-ids and not samples, with another agent chosen uniformly and independently at random. The per-agent regret scaling achieved by our algorithm is $\BigO łeft( \fracłceil\fracK n \rceil+łog(n) Δ łog(T) + \fracłog^3(n) łog łog(n) Δ^2 \right) $. Furthermore, any agent in our algorithm communicates (arm-ids to an uniformly and independently chosen agent) only a total of Θ(łog(T))$ times over a time interval of T. We compare our results to two benchmarks - one where there is no communication among agents and one corresponding to complete interaction, where an agent has access to the entire system history of arms played and rewards obtained of all agents. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in per-agent regret when compared to the case when agents do not collaborate and each agent is playing the standard MAB problem (where regret would scale linearly in K), and in communication complexity when compared to the full interaction setting which requires T communication attempts by an agent over T arm pulls. Our result thus demonstrates that even a minimal level of collaboration among the different agents enables a significant reduction in per-agent regret.
- Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, and Ananthram Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731--745, 2011.Google Scholar
Digital Library
- James Aspnes and Eric Ruppert. An introduction to population protocols. In Middleware for Network Eccentric and Mobile Applications, pages 97--120. Springer, 2009.Google Scholar
Cross Ref
- Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13--p, 2010.Google Scholar
- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2--3):235--256, 2002.Google Scholar
- Orly Avner and Shie Mannor. Concurrent bandits and cognitive radio networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 66--81. Springer, 2014.Google Scholar
Digital Library
- Orly Avner and Shie Mannor. Multi-user lax communications: a multi-armed bandit approach. In IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, pages 1--9. IEEE, 2016.Google Scholar
Digital Library
- Baruch Awerbuch and Robert D Kleinberg. Competitive collaborative learning. In International Conference on Computational Learning Theory, pages 233--248. Springer, 2005.Google Scholar
Digital Library
- R. Baraglia, P. Dazzia, M. Mordacchini, and L. Riccia. A peer-to-peer recommender system for self-emerging user communities based on gossip overlays. Journal of Computer and System Sciences, 79:291 -- 308, March 2013.Google Scholar
- Lilian Besson and Emilie Kaufmann. What doubling tricks can and can't do for multi-armed bandits. arXiv preprint arXiv:1803.06971, 2018.Google Scholar
- Ilai Bistritz and Amir Leshem. Distributed multi-player bandits-a game of thrones approach. In Advances in Neural Information Processing Systems, pages 7222--7232, 2018.Google Scholar
- Edward Boon, Leyland Pitt, and Esmail Salehi-Sangari. How to manage information sharing in online marketplaces -- an exploratory study. In Ideas in Marketing: Finding the New and Polishing the Old, pages 538--541. Springer International Publishing, 2015.Google Scholar
- Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1--122, 2011.Google Scholar
Digital Library
- Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1--122, 2012.Google Scholar
Cross Ref
- Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412:1832--1852, April 2011.Google Scholar
Digital Library
- Swapna Buccapatnam, Jian Tan, and Li Zhang. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 2605--2613. IEEE, 2015.Google Scholar
Cross Ref
- L Elisa Celis, Peter M Krafft, and Nisheeth K Vishnoi. A distributed learning dynamics in social groups. arXiv preprint arXiv:1705.03414, 2017.Google Scholar
- Nicolo Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. JOURNAL OF MACHINE LEARNING RESEARCH, 49:605--622, 2016.Google Scholar
- Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. In Advances in Neural Information Processing Systems, pages 737--745, 2013.Google Scholar
- Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. Coordinated versus decentralized exploration in multi-agent multi-armed bandits. In IJCAI, pages 164--170, 2017.Google Scholar
Cross Ref
- Francesco Colace, Massimo De Santo, Luca Greco, Vincenzo Moscato, and Antonio Picariello. A collaborative user-centered framework for recommending items in online social networks. Computers in Human Behavior, 51:694--704, 2015.Google Scholar
Digital Library
- Alan Demers, Dan Greene, Carl Houser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. Epidemic algorithms for replicated database maintenance. ACM SIGOPS Operating Systems Review, 22(1):8--32, 1988.Google Scholar
Digital Library
- John C Duchi, Sorathan Chaturapruek, and Christopher Ré. Asynchronous stochastic convex optimization. arXiv preprint arXiv:1508.00882, 2015.Google Scholar
- Glenn Ellison and Drew Fudenberg. Word-of-mouth communication and social learning. The Quarterly Journal of Economics, 110(1):93--125, 1995.Google Scholar
Cross Ref
- Alan M Frieze and Geoffrey R Grimmett. The shortest-path problem for graphs with random arc-lengths. Discrete Applied Mathematics, 10(1):57--77, 1985.Google Scholar
Cross Ref
- F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016.Google Scholar
- Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research, 16(1):3367--3402, 2015.Google Scholar
Digital Library
- Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 854--862, 2013.Google Scholar
Digital Library
- Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil'ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423--439, 2014.Google Scholar
- Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1--6. IEEE, 2014.Google Scholar
Cross Ref
- Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331--2345, 2014.Google Scholar
Digital Library
- Varun Kanade, Zhenming Liu, and Bozidar Radunovic. Distributed non-stochastic experts. In Advances in Neural Information Processing Systems, pages 260--268, 2012.Google Scholar
- Richard Karp, Christian Schindelhauer, Scott Shenker, and Berthold Vocking. Randomized rumor spreading. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 565--574. IEEE, 2000.Google Scholar
Digital Library
- Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE/ACM Trans. Netw., 26(4):1782--1795, August 2018.Google Scholar
Digital Library
- Peter M Krafft, Julia Zheng, Wei Pan, Nicolás Della Penna, Yaniv Altshuler, Erez Shmueli, Joshua B Tenenbaum, and Alex Pentland. Human collective intelligence as distributed bayesian inference. arXiv preprint arXiv:1608.01987, 2016.Google Scholar
- Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4--22, 1985.Google Scholar
- Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative decision-making in multiarmed bandits: Frequentist and bayesian algorithms. arXiv preprint arXiv:1606.00911, 2016.Google Scholar
- Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.Google Scholar
- Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539--548. ACM, 2016.Google Scholar
Digital Library
- Haoyang Liu, Keqin Liu, Qing Zhao, et al. Learning in a changing world: Restless multi-armed bandit with unknown dynamics. IEEE Trans. Information Theory, 59(3):1902--1916, 2013.Google Scholar
Digital Library
- David Mart'inez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic multi-armed bandits. arXiv preprint arXiv:1810.04468, 2018.Google Scholar
- Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48--61, 2009.Google Scholar
Cross Ref
- Boris Pittel. On spreading a rumor. SIAM Journal on Applied Mathematics, 47(1):213--223, 1987.Google Scholar
Digital Library
- Jonathan Rosenski, Ohad Shamir, and Liran Szlak. Multi-player bandits--a musical chairs approach. In International Conference on Machine Learning, pages 155--163, 2016.Google Scholar
- Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. arXiv preprint arXiv:1702.08704, 2017.Google Scholar
- Yevgeny Seldin, Peter L Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Prediction with limited advice and multiarmed bandits with paid observations. In ICML, pages 280--287, 2014.Google Scholar
- Prabodini Semasinghe, Setareh Maghsudi, and Ekram Hossain. Game theoretic mechanisms for resource management in massive wireless iot systems. IEEE Communications Magazine, 55(2):121--127, 2017.Google Scholar
Digital Library
- Devavrat Shah. Gossip algorithms. Foundations and Trends® in Networking, 3(1):1--125, 2009.Google Scholar
Digital Library
- Shahin Shahrampour and Ali Jadbabaie. Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control, 63(3):714--725, 2018.Google Scholar
Cross Ref
- Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Multi-armed bandits in multi-agent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 2786--2790. IEEE, 2017.Google Scholar
Cross Ref
- Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944--966, 2015.Google Scholar
Digital Library
- Lili Su, Martin Zubeldia, and Nancy Lynch. Collaboratively learning the best option on graphs, using bounded local memory. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(1):11, 2019.Google Scholar
Digital Library
- K. Sugawara, T. Kazama, and T. Watanabe. Foraging behavior of interacting robots with virtual pheromone. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3074 -- 3079 vol.3, 11 2004.Google Scholar
Cross Ref
- Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.Google Scholar
- Balázs Szörényi, Róbert Busa-Fekete, István HegedHu s, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. Gossip-based distributed stochastic bandit algorithms. In Journal of Machine Learning Research Workshop and Conference Proceedings, volume 2, pages 1056--1064. International Machine Learning Societ, 2013.Google Scholar
- Cem Tekin, Simpson Zhang, and Mihaela van der Schaar. Distributed online learning in social recommender systems. IEEE Journal of Selected Topics in Signal Processing, 8(4):638--652, 2014.Google Scholar
Cross Ref
- William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285--294, 1933.Google Scholar
Cross Ref
Index Terms
Social Learning in Multi Agent Multi Armed Bandits
Recommendations
Robust Multi-Agent Multi-Armed Bandits
MobiHoc '21: Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile ComputingRecent works have shown that agents facing independent instances of a stochastic K-armed bandit can collaborate to decrease regret. However, these works assume that each agent always recommends their individual best-arm estimates to other agents, which ...
Social Learning in Multi Agent Multi Armed Bandits
SIGMETRICS '20: Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsWe introduce a novel decentralized, multi agent version of the classical Multi-Arm Bandit (MAB) problem, consisting of n agents, that collaboratively and simultaneously solve the same instance of K armed MAB to minimize individual regret. The agents can ...
Social Learning in Multi Agent Multi Armed Bandits
We introduce a novel decentralized, multi agent version of the classical Multi-Arm Bandit (MAB) problem, consisting of n agents, that collaboratively and simultaneously solve the same instance of K armed MAB to minimize individual regret. The agents can ...






Comments