skip to main content
research-article

Differentially Private Reinforcement Learning with Linear Function Approximation

Published:28 February 2022Publication History
Skip Abstract Section

Abstract

Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) under the constraints of differential privacy (DP). Compared to existing private RL algorithms that work only on tabular finite-state, finite-actions MDPs, we take the first step towards privacy-preserving learning in MDPs with large state and action spaces. Specifically, we consider MDPs with linear function approximation (in particular linear mixture MDPs) under the notion of joint differential privacy (JDP), where the RL agent is responsible for protecting users' sensitive data. We design two private RL algorithms that are based on value iteration and policy optimization, respectively, and show that they enjoy sub-linear regret performance while guaranteeing privacy protection. Moreover, the regret bounds are independent of the number of states, and scale at most logarithmically with the number of actions, making the algorithms suitable for privacy protection in nowadays large-scale personalized services. Our results are achieved via a general procedure for learning in linear mixture MDPs under changing regularizers, which not only generalizes previous results for non-private learning, but also serves as a building block for general private reinforcement learning.

References

  1. and Szepesvári]abbasi2011improvedYasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24: 2312--2320, 2011.Google ScholarGoogle Scholar
  2. Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In International Conference on Machine Learning, pages 32--40. PMLR, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463--474. PMLR, 2020.Google ScholarGoogle Scholar
  4. Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130--2138. PMLR, 2016.Google ScholarGoogle Scholar
  5. Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019.Google ScholarGoogle Scholar
  6. Justin A Boyan. Least-squares temporal difference learning. In ICML, pages 49--56, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22 (1): 33--57, 1996.Google ScholarGoogle Scholar
  8. Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635--658. Springer, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283--1294. PMLR, 2020.Google ScholarGoogle Scholar
  10. Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, pages 533--557. PMLR, 2019.Google ScholarGoogle Scholar
  11. T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14 (3): 1--24, 2011.Google ScholarGoogle Scholar
  12. Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 375--403. Springer, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. arXiv preprint arXiv:2112.10599, 2021.Google ScholarGoogle Scholar
  14. Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304--3312. PMLR, 2021.Google ScholarGoogle Scholar
  15. Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pages 1329--1338. PMLR, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Abhimanyu Dubey. No-regret algorithms for private gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pages 2062--2070. PMLR, 2021.Google ScholarGoogle Scholar
  17. John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429--438. IEEE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1--19. Springer, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. 2014.Google ScholarGoogle Scholar
  20. Yonathan Efroni, Lior Shani, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. arXiv preprint arXiv:2002.08243, 2020.Google ScholarGoogle Scholar
  21. Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differentially private regret minimization in reinforcement learning. arXiv preprint arXiv:2010.07778, 2020.Google ScholarGoogle Scholar
  22. Goren Gordon, Samuel Spaulding, Jacqueline Kory Westlund, Jin Joo Lee, Luke Plummer, Marayna Martinez, Madhurima Das, and Cynthia Breazeal. Affective personalization of a social robot tutor for children's second language skills. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  23. Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26: 2733--2741, 2013.Google ScholarGoogle Scholar
  24. He, Zhou, and Gu]he2021logarithmicJiafan He, Dongruo Zhou, and Quanquan Gu. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171--4180. PMLR, 2021 a .Google ScholarGoogle Scholar
  25. He, Zhou, and Gu]he2021nearlyJiafan He, Dongruo Zhou, and Quanquan Gu. Nearly optimal regret for learning adversarial mdps with linear function approximation. arXiv preprint arXiv:2102.08940, 2021 b .Google ScholarGoogle Scholar
  26. Justin Hsu, Zhiyi Huang, Aaron Roth, Tim Roughgarden, and Zhiwei Steven Wu. Private matchings and allocations. SIAM Journal on Computing, 45 (6): 1953--1984, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Bingshan Hu, Zhiming Huang, and Nishant A Mehta. Optimal algorithms for private online learning in a stochastic environment. arXiv preprint arXiv:2102.07929, 2021.Google ScholarGoogle Scholar
  28. Zeyu Jia, Lin Yang, Csaba Szepesvari, and Mengdi Wang. Model-based reinforcement learning with value-targeted regression. In Learning for Dynamics and Control, pages 666--686. PMLR, 2020.Google ScholarGoogle Scholar
  29. Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137--2143, 2020.Google ScholarGoogle Scholar
  30. Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.Google ScholarGoogle Scholar
  31. Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 403--410, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008--1014. Citeseer, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662--5670. PMLR, 2020.Google ScholarGoogle Scholar
  34. Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302--1338, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  35. Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.Google ScholarGoogle Scholar
  36. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661--670, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. arXiv preprint arXiv:2110.10133, 2021.Google ScholarGoogle Scholar
  38. Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.Google ScholarGoogle Scholar
  39. Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592--601, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. nd Krause(2019)]mutny2019efficientMojm'ir Mutnỳ and Andreas Krause. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. Advances in Neural Information Processing Systems 31, pages 9005--9016, 2019.Google ScholarGoogle Scholar
  41. Hajime Ono and Tsubasa Takahashi. Locally private distributed reinforcement learning. arXiv preprint arXiv:2001.11718, 2020.Google ScholarGoogle Scholar
  42. Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy. arXiv preprint arXiv:2007.03121, 2020.Google ScholarGoogle Scholar
  43. Touqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pages 5579--5588. PMLR, 2019.Google ScholarGoogle Scholar
  44. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889--1897. PMLR, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.Google ScholarGoogle Scholar
  46. Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. arXiv preprint arXiv:1810.00068, 2018.Google ScholarGoogle Scholar
  47. Akanksha Rai Sharma and Pranav Kaushik. Literature survey of statistical, deep and reinforcement learning in natural language processing. In 2017 International Conference on Computing, Communication and Automation (ICCCA), pages 350--354. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  48. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550 (7676): 354--359, 2017.Google ScholarGoogle Scholar
  49. Aristide Tossou and Christos Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  50. Aristide Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  51. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.Google ScholarGoogle Scholar
  52. Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754--9764. PMLR, 2020.Google ScholarGoogle Scholar
  53. Baoxiang Wang and Nidhi Hegde. Privacy-preserving q-learning with functional noise in continuous state spaces. arXiv preprint arXiv:1901.10634, 2019.Google ScholarGoogle Scholar
  54. Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.Google ScholarGoogle Scholar
  55. William Yang Wang, Jiwei Li, and Xiaodong He. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 19--21, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  56. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 (3--4): 229--256, 1992.Google ScholarGoogle Scholar
  57. Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995--7004. PMLR, 2019.Google ScholarGoogle Scholar
  58. Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael I Jordan. On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622, 2020.Google ScholarGoogle Scholar
  59. Yufan Zhao, Michael R Kosorok, and Donglin Zeng. Reinforcement learning design for cancer clinical trials. Statistics in medicine, 28 (26): 3294--3315, 2009.Google ScholarGoogle Scholar
  60. Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. arXiv preprint arXiv:2006.00701, 2020.Google ScholarGoogle Scholar
  61. Zhou, Gu, and Szepesvari]zhou2021nearlyDongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532--4576. PMLR, 2021 a .Google ScholarGoogle Scholar
  62. Zhou, He, and Gu]zhou2021provablyDongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793--12802. PMLR, 2021 b .Google ScholarGoogle Scholar
  63. Xingyu Zhou and Jian Tan. Local differential privacy for bayesian optimization. arXiv preprint arXiv:2010.06709, 2020.Google ScholarGoogle Scholar

Index Terms

  1. Differentially Private Reinforcement Learning with Linear Function Approximation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)144
        • Downloads (Last 6 weeks)6

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!