skip to main content
research-article
Public Access

Learning to Control Renewal Processes with Bandit Feedback

Published:19 June 2019Publication History
Skip Abstract Section

Abstract

We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ, the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-tailed and some light-tailed completion time distributions, this interruption mechanism improves the reward linearly over time. Applications of this model include server scheduling, optimal free sampling strategies in advertising and adaptive content selection. From a learning perspective, the interrupt mechanism necessitates learning the whole arm distribution from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on median-of-means estimator for possibly heavy-tailed reward and completion time distributions. We show that, in a K-armed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(Kłog(τ)+KL) regret. We also prove that the regret under any admissible policy is Ømega(Kłog(τ)), which implies that UCB-BwI is order optimal.

References

  1. Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. 1987. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards. IEEE Trans. Automat. Control, Vol. 32, 11 (1987), 977--982.Google ScholarGoogle ScholarCross RefCross Ref
  2. Søren Asmussen. 2008. Applied probability and queues. Vol. 51. Springer Science & Business Media.Google ScholarGoogle Scholar
  3. Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with knapsacks. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Albert-Laszlo Barabasi. 2005. The origin of bursts and heavy tails in human dynamics. Nature, Vol. 435, 7039 (2005), 207.Google ScholarGoogle ScholarCross RefCross Ref
  5. Theophilus Benson, Aditya Akella, and David A Maltz. 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. ACM, 267--280.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Donald A Berry and Bert Fristedt. 1985. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). London: Chapman and Hall, Vol. 5 (1985), 71--87.Google ScholarGoogle ScholarCross RefCross Ref
  7. Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, Vol. 5, 1 (2012), 1--122.Google ScholarGoogle ScholarCross RefCross Ref
  8. Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. 2013. Bandits with heavy tail. IEEE Transactions on Information Theory, Vol. 59, 11 (2013), 7711--7717. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jun Cai and José Garrido. 1999. A unified approach to the study of tail probabilities of compound distributions. Journal of Applied Probability, Vol. 36, 4 (1999), 1058--1073.Google ScholarGoogle ScholarCross RefCross Ref
  10. Olivier Catoni et al. 2012. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l'Institut Henri Poincaré, Probabilités et Statistiques, Vol. 48. Institut Henri Poincaré, 1148--1185.Google ScholarGoogle Scholar
  11. Nicolo Cesa-Bianchi and Gábor Lugosi. 2012. Combinatorial bandits. J. Comput. System Sci., Vol. 78, 5 (2012), 1404--1422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Richard Combes, Alexandre Proutiere, Donggyu Yun, Jungseul Ok, and Yung Yi. 2014. Optimal rate sampling in 802.11 systems. In INFOCOM, 2014 Proceedings IEEE. IEEE, 2760--2767.Google ScholarGoogle ScholarCross RefCross Ref
  13. Varsha Dani, Thomas P Hayes, and Sham M Kakade. 2008. Stochastic linear optimization under bandit feedback. (2008).Google ScholarGoogle Scholar
  14. Brian C Dean, Michel X Goemans, and Jan Vondrdk. 2004. Approximating the stochastic knapsack problem: The benefit of adaptivity. In 45th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 208--217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Allen B Downey. 2001. Evidence for long-tailed distributions in the internet. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement. ACM, 229--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Robert G Gallager. 2013. Stochastic processes: theory for applications .Cambridge University Press.Google ScholarGoogle Scholar
  17. Harsh Gupta, Atilla Eryilmaz, and R Srikant. 2018. Low-Complexity, Low-Regret Link Rate Selection in Rapidly-Varying Wireless Channels. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 540--548.Google ScholarGoogle Scholar
  18. Allan Gut. 2009. Stopped random walks .Springer.Google ScholarGoogle Scholar
  19. András György, Levente Kocsis, Ivett Szabó, and Csaba Szepesvári. 2007. Continuous Time Associative Bandit Problems.. In IJCAI. 830--835. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mor Harchol-Balter. 1999. The E ect of Heavy-Tailed Job Size Distributions on Computer System Design.. In Proc. of ASA-IMS Conf. on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics .Google ScholarGoogle Scholar
  21. Predrag R Jelenković and Jian Tan. 2013. Characterizing heavy-tailed distributions induced by retransmissions. Advances in Applied Probability, Vol. 45, 1 (2013), 106--138.Google ScholarGoogle ScholarCross RefCross Ref
  22. Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. 2013. Online learning under delayed feedback. In International Conference on Machine Learning. 1453--1461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Haya Kaspi and Avishai Mandelbaum. 1998. Multi-armed bandits in discrete and continuous time. Annals of Applied Probability (1998), 1270--1290.Google ScholarGoogle Scholar
  24. Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. 2010. Regret bounds for sleeping experts and bandits. Machine learning, Vol. 80, 2--3 (2010), 245--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, Vol. 6, 1 (1985), 4--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Keqin Liu and Qing Zhao. 2011. Multi-armed bandit problems with heavy-tailed reward distributions. In Communication, control, and computing (allerton), 2011 49th annual allerton conference on. IEEE, 485--492.Google ScholarGoogle ScholarCross RefCross Ref
  27. Avi Mandelbaum. 1987. Continuous multi-armed bandits and multiparameter processes. The Annals of Probability (1987), 1527--1556.Google ScholarGoogle Scholar
  28. Stanislav Minsker et al. 2015. Geometric median and robust estimation in Banach spaces. Bernoulli, Vol. 21, 4 (2015), 2308--2335.Google ScholarGoogle ScholarCross RefCross Ref
  29. Rajeev Motwani, Steven Phillips, and Eric Torng. 1994. Nonclairvoyant scheduling. Theoretical computer science, Vol. 130, 1 (1994), 17--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jayakrishnan Nair, Adam Wierman, and Bert Zwart. 2013. The fundamentals of heavy-tails: properties, emergence, and identification. In ACM SIGMETRICS Performance Evaluation Review, Vol. 41. ACM, 387--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sidney I Resnick et al. 1997. Heavy tail modeling and teletraffic data: special invited paper. The Annals of Statistics, Vol. 25, 5 (1997), 1805--1869.Google ScholarGoogle ScholarCross RefCross Ref
  32. Robert Sheahan, Lester Lipsky, Pierre M Fiorini, and Søren Asmussen. 2006. On the completion time distribution for tasks that must restart from the beginning if a failure occurs. ACM SIGMETRICS Performance Evaluation Review, Vol. 34, 3 (2006), 24--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Peter Whittle. 1980. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) (1980), 143--149.Google ScholarGoogle Scholar
  34. Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2015. Thompson Sampling for Budgeted Multi-Armed Bandits.. In IJCAI. 3960--3966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and Tie-Yan Liu. 2016. Budgeted Multi-Armed Bandits with Multiple Plays. In IJCAI. 2210--2216. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning to Control Renewal Processes with Bandit Feedback

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!