Abstract
Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) under the constraints of differential privacy (DP). Compared to existing private RL algorithms that work only on tabular finite-state, finite-actions MDPs, we take the first step towards privacy-preserving learning in MDPs with large state and action spaces. Specifically, we consider MDPs with linear function approximation (in particular linear mixture MDPs) under the notion of joint differential privacy (JDP), where the RL agent is responsible for protecting users' sensitive data. We design two private RL algorithms that are based on value iteration and policy optimization, respectively, and show that they enjoy sub-linear regret performance while guaranteeing privacy protection. Moreover, the regret bounds are independent of the number of states, and scale at most logarithmically with the number of actions, making the algorithms suitable for privacy protection in nowadays large-scale personalized services. Our results are achieved via a general procedure for learning in linear mixture MDPs under changing regularizers, which not only generalizes previous results for non-private learning, but also serves as a building block for general private reinforcement learning.
- and Szepesvári]abbasi2011improvedYasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24: 2312--2320, 2011.Google Scholar
- Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In International Conference on Machine Learning, pages 32--40. PMLR, 2017.Google Scholar
Digital Library
- Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463--474. PMLR, 2020.Google Scholar
- Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130--2138. PMLR, 2016.Google Scholar
- Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019.Google Scholar
- Justin A Boyan. Least-squares temporal difference learning. In ICML, pages 49--56, 1999.Google Scholar
Digital Library
- Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22 (1): 33--57, 1996.Google Scholar
- Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635--658. Springer, 2016.Google Scholar
Digital Library
- Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283--1294. PMLR, 2020.Google Scholar
- Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, pages 533--557. PMLR, 2019.Google Scholar
- T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14 (3): 1--24, 2011.Google Scholar
- Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 375--403. Springer, 2019.Google Scholar
Cross Ref
- Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. arXiv preprint arXiv:2112.10599, 2021.Google Scholar
- Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304--3312. PMLR, 2021.Google Scholar
- Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pages 1329--1338. PMLR, 2016.Google Scholar
Digital Library
- Abhimanyu Dubey. No-regret algorithms for private gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pages 2062--2070. PMLR, 2021.Google Scholar
- John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429--438. IEEE, 2013.Google Scholar
Digital Library
- Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1--19. Springer, 2008.Google Scholar
Digital Library
- Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. 2014.Google Scholar
- Yonathan Efroni, Lior Shani, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. arXiv preprint arXiv:2002.08243, 2020.Google Scholar
- Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differentially private regret minimization in reinforcement learning. arXiv preprint arXiv:2010.07778, 2020.Google Scholar
- Goren Gordon, Samuel Spaulding, Jacqueline Kory Westlund, Jin Joo Lee, Luke Plummer, Marayna Martinez, Madhurima Das, and Cynthia Breazeal. Affective personalization of a social robot tutor for children's second language skills. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.Google Scholar
Cross Ref
- Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26: 2733--2741, 2013.Google Scholar
- He, Zhou, and Gu]he2021logarithmicJiafan He, Dongruo Zhou, and Quanquan Gu. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171--4180. PMLR, 2021 a .Google Scholar
- He, Zhou, and Gu]he2021nearlyJiafan He, Dongruo Zhou, and Quanquan Gu. Nearly optimal regret for learning adversarial mdps with linear function approximation. arXiv preprint arXiv:2102.08940, 2021 b .Google Scholar
- Justin Hsu, Zhiyi Huang, Aaron Roth, Tim Roughgarden, and Zhiwei Steven Wu. Private matchings and allocations. SIAM Journal on Computing, 45 (6): 1953--1984, 2016.Google Scholar
Digital Library
- Bingshan Hu, Zhiming Huang, and Nishant A Mehta. Optimal algorithms for private online learning in a stochastic environment. arXiv preprint arXiv:2102.07929, 2021.Google Scholar
- Zeyu Jia, Lin Yang, Csaba Szepesvari, and Mengdi Wang. Model-based reinforcement learning with value-targeted regression. In Learning for Dynamics and Control, pages 666--686. PMLR, 2020.Google Scholar
- Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137--2143, 2020.Google Scholar
- Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.Google Scholar
- Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 403--410, 2014.Google Scholar
Digital Library
- Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008--1014. Citeseer, 2000.Google Scholar
Digital Library
- Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662--5670. PMLR, 2020.Google Scholar
- Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302--1338, 2000.Google Scholar
Cross Ref
- Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.Google Scholar
- Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661--670, 2010.Google Scholar
Digital Library
- Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. arXiv preprint arXiv:2110.10133, 2021.Google Scholar
- Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.Google Scholar
- Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592--601, 2015.Google Scholar
Digital Library
- nd Krause(2019)]mutny2019efficientMojm'ir Mutnỳ and Andreas Krause. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. Advances in Neural Information Processing Systems 31, pages 9005--9016, 2019.Google Scholar
- Hajime Ono and Tsubasa Takahashi. Locally private distributed reinforcement learning. arXiv preprint arXiv:2001.11718, 2020.Google Scholar
- Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy. arXiv preprint arXiv:2007.03121, 2020.Google Scholar
- Touqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pages 5579--5588. PMLR, 2019.Google Scholar
- John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889--1897. PMLR, 2015.Google Scholar
Digital Library
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.Google Scholar
- Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. arXiv preprint arXiv:1810.00068, 2018.Google Scholar
- Akanksha Rai Sharma and Pranav Kaushik. Literature survey of statistical, deep and reinforcement learning in natural language processing. In 2017 International Conference on Computing, Communication and Automation (ICCCA), pages 350--354. IEEE, 2017.Google Scholar
Cross Ref
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550 (7676): 354--359, 2017.Google Scholar
- Aristide Tossou and Christos Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.Google Scholar
Cross Ref
- Aristide Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.Google Scholar
Cross Ref
- Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.Google Scholar
- Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754--9764. PMLR, 2020.Google Scholar
- Baoxiang Wang and Nidhi Hegde. Privacy-preserving q-learning with functional noise in continuous state spaces. arXiv preprint arXiv:1901.10634, 2019.Google Scholar
- Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.Google Scholar
- William Yang Wang, Jiwei Li, and Xiaodong He. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 19--21, 2018.Google Scholar
Cross Ref
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 (3--4): 229--256, 1992.Google Scholar
- Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995--7004. PMLR, 2019.Google Scholar
- Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael I Jordan. On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622, 2020.Google Scholar
- Yufan Zhao, Michael R Kosorok, and Donglin Zeng. Reinforcement learning design for cancer clinical trials. Statistics in medicine, 28 (26): 3294--3315, 2009.Google Scholar
- Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. arXiv preprint arXiv:2006.00701, 2020.Google Scholar
- Zhou, Gu, and Szepesvari]zhou2021nearlyDongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532--4576. PMLR, 2021 a .Google Scholar
- Zhou, He, and Gu]zhou2021provablyDongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793--12802. PMLR, 2021 b .Google Scholar
- Xingyu Zhou and Jian Tan. Local differential privacy for bayesian optimization. arXiv preprint arXiv:2010.06709, 2020.Google Scholar
Index Terms
Differentially Private Reinforcement Learning with Linear Function Approximation
Recommendations
Differentially Private Reinforcement Learning with Linear Function Approximation
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsMotivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) ...
Differentially Private Reinforcement Learning with Linear Function Approximation
SIGMETRICS '22Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) ...
Differentially Private Reinforcement Learning
Information and Communications SecurityAbstractWith remarkable performance and extensive applications, reinforcement learning is becoming one of the most popular learning techniques. Often, the policy released by reinforcement learning model may contain sensitive information, and an ...






Comments