skip to main content
research-article

Learning Proportionally Fair Allocations with Low Regret

Published:13 June 2018Publication History
Skip Abstract Section

Abstract

This paper addresses a generic sequential resource allocation problem, where in each round a decision maker selects an allocation of resources (servers) to a set of tasks consisting of a large number of jobs. A job of task i assigned to server j is successfully treated with probability θ_ij $ in a round, and the decision maker is informed on whether this job is completed at the end of the round. The probabilities θ_ij $'s are initially unknown and have to be learned. The objective of the decision maker is to sequentially assign jobs of various tasks to servers so that it rapidly learns and converges to the Proportionally Fair (PF) allocation (or other similar allocations achieving an appropriate trade-off between efficiency and fairness). We formulate the problem as a multi-armed bandit (MAB) optimization problem, and devise sequential assignment algorithms with low regret (defined as the difference in utility achieved by an oracle algorithm aware of the θ_ij $'s and by the proposed algorithm over a given number of slots). We first provide the properties of the so-called Restricted-PF (RPF) allocation, obtained by assuming that each task can only use a single server, and in particular show that it is very close to the PF allocation. We devise ES-RPF, an algorithm that learns the RPF allocation with regret no greater than $\mathcal O \bigl(m^3øver θ_\min Δ_\min łog(T)\big)$ after T slots, where m , θ_\min $, and Δ_\min $ represent the number of tasks, the minimum success rate $\min_i,j θ_ij $, and an appropriately defined notion of gap, respectively. We further provide regret lower bounds satisfied by any algorithm targeting the RPF allocation. Finally, we present ES-PF, an algorithm directly learning the PF allocation, and prove that its regret does not exceed $\mathcal O \bigl(\fracm^2s θ_\min \sqrtT łog(T)\big)$ after T slots, where s denotes the number of servers.

References

  1. Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, and Alexander Rakhlin. 2011. Stochastic Convex Optimization with Bandit Feedback. In Advances in Neural Information Processing Systems 24 (NIPS). 1035--1043 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite Time Analysis of the Multiarmed Bandit Problem. Machine Learning 47, 2--3 (2002), 235--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Thomas Bonald, Laurent Massoulié, Alexandre Proutière, and Jorma T. Virtamo. 2006. A Queueing Analysis of Max-min Fairness, Proportional Fairness and Balanced Fairness. Queueing Systems 53, 1--2 (2006), 65--84 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sébastien Bubeck and Nicolò Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Foundations and Trends in Machine Learning 5, 1 (2012), 1--222Google ScholarGoogle ScholarCross RefCross Ref
  6. Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, and Pinyan Lu. 2016. Combinatorial Multi-Armed Bandit with General Reward Functions. In Advances in Neural Information Processing Systems 29 (NIPS). 1651--1659 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wei Chen, Yajun Wang, and Yang Yuan. 2013. Combinatorial Multi-Armed Bandit: General Framework and Applications. In Proceedings of the 30th International Conference on Machine Learning (ICML). 151--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Richard Combes and Alexandre Proutière.2014. Unimodal Bandits without Smoothness. arXiv:1406.7447 (2014).Google ScholarGoogle Scholar
  9. Richard Combes and Alexandre Proutiere. 2014. Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms. arXiv:1405.5096 {cs.LG} (2014). http://arxiv.org/abs/1405.5096 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Richard Combes, Mohammad Sadegh Talebi, Alexandre Proutiere, and Marc Lelarge. 2015. Combinatorial Bandits Revisited. In Advances in Neural Information Processing Systems 28 (NIPS). 2107--2115 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Imre Csiszár and Paul C. Shields. 2004. Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory 1, 4 (2004), 417--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. 2008. Stochastic Linear Optimization under Bandit Feedback.. In Proceedings of the 21st Annual Conference on Learning Theory (COLT). 355--366.Google ScholarGoogle Scholar
  13. Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. 2012. Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards and Individual Observations. IEEE/ACM Transactions on Networking (TON) 20, 5 (2012), 1466--1478 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Aurélien Garivier and Olivier Cappé. 2011. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. In Proceedings of the 24th Annual Conference on Learning Theory (COLT). 359--376.Google ScholarGoogle Scholar
  15. Aurélien Garivier and Gilles Ménard, Pierre Stoltz. Jan. 2018. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems. Mathematics of Operations Research (Jan. 2018).Google ScholarGoogle Scholar
  16. Todd L. Graves and Tze Leung Lai. 1997. Asymptotically Efficient Adaptive Choice of Control Laws in Controlled Markov Chains. SIAM Journal on Control and Optimization 35, 3 (1997), 715--743. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ramesh Johari, Vijay Kamble, and Yash Kanoria. 2017. Matching while Learning. In Proceedings of 18th ACM Conference on Economics and Computation (EC) Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. 2016. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models. The Journal of Machine Learning Research 17, 1 (Jan. 2016), 1--42. http://dl.acm.org/ citation.cfm?id=2946645.2946646 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Frank P. Kelly, Aman K. Maulloo, and David K. H. Tan. 1998. Rate Control for Communication Networks: Shadow Prices, Proportional Fairness and Stability. Journal of the Operational Research Society 49, 3 (1998), 237--252.Google ScholarGoogle ScholarCross RefCross Ref
  20. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvári. 2015. Combinatorial Cascading Bandits. In Advances in Neural Information Processing Systems 28 (NIPS). 1450--1458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvári. 2015. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS). 535--543Google ScholarGoogle Scholar
  22. Tze Leung Lai and Herbert Robbins. 1985. Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics 6, 1 (1985), 4--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tor Lattimore, Koby Crammer, and Csaba Szepesvári. 2014. Optimal Resource Allocation with Semi-Bandit Feedback. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI). 477--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tor Lattimore, Koby Crammer, and Csaba Szepesvári. 2015. Linear Multi-Resource Allocation with Semi-bandit Feedback. In Advances in Neural Information Processing Systems 28 (NIPS). 964--972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jeonghoon Mo and Jean Walrand. 2000. Fair End-to-End Window-Based Congestion Control. IEEE/ACM Transactions on Networking 8, 5 (2000), 556--567 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Alexander Schrijver. 2000. A Course in Combinatorial Optimization. TU DelftGoogle ScholarGoogle Scholar
  27. Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. 2018. Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs. In International Conference on Algorithmic Learning Theory (ALT). 770--805Google ScholarGoogle Scholar
  28. Mohammad Sadegh Talebi, Zhenhua Zou, Richard Combes, Alexandre Proutiere, and Mikael Johansson. 2018. Stochastic Online Shortest Path Routing: The Value of Feedback. IEEE Transactions on Automatic Control 63, 4 (2018), 915--930.Google ScholarGoogle ScholarCross RefCross Ref
  29. Flemming Topsøe. 2006. Some Bounds for the Logarithmic Function. Inequality Theory and Applications 4 (2006), 137Google ScholarGoogle Scholar
  30. David Tse and Pramod Viswanath. 2005. Fundamentals of Wireless Communication. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Se-Young Yun and Alexandre Proutiere. 2015. Distributed Proportional Fair Load Balancing in Heterogenous Systems. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning Proportionally Fair Allocations with Low Regret

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Article Metrics

          • Downloads (Last 12 months)64
          • Downloads (Last 6 weeks)3

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!