Abstract
Growing multi-stage attacks in computer networks impose significant security risks and necessitate the development of effective defense schemes that are able to autonomously respond to intrusions during vulnerability windows. However, the defender faces several real-world challenges, e.g., unknown likelihoods and unknown impacts of successful exploits. In this article, we leverage reinforcement learning to develop an innovative adaptive cyber defense to maximize the cost-effectiveness subject to the aforementioned challenges. In particular, we use Bayesian attack graphs to model the interactions between the attacker and networks. Then we formulate the defense problem of interest as a partially observable Markov decision process problem where the defender maintains belief states to estimate system states, leverages Thompson sampling to estimate transition probabilities, and utilizes reinforcement learning to choose optimal defense actions using measured utility values. The algorithm performance is verified via numerical simulations based on real-world attacks.
- K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. 2017. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34, 6 (2017), 26--38.Google Scholar
- Karl J. Âström. 1965. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications 10, 1 (1965), 174--205.Google Scholar
Cross Ref
- A. Becker, P. Kumar, and Ching Zong Wei. 1985. Adaptive control with the stochastic approximation algorithm: Geometry and convergence. IEEE Transactions on Automatic Control 30, 4 (1985), 330--338.Google Scholar
Cross Ref
- Richard E. Bellman and Stuart E. Dreyfus. 1962. Applied Dynamic Programming. Princeton University Press.Google Scholar
- David Bigelow, Thomas Hobson, Robert Rudd, William Streilein, and Hamed Okhravi. 2015. Timely rerandomization for mitigating memory disclosures. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS’15). 268--279.Google Scholar
Digital Library
- P. Chen, Z. Hu, J. Xu, M. Zhu, and P. Liu. 2018. Feedback control can make data structure layout randomization more cost-effective under zero-day attacks. Cybersecurity 1 (2018), 3.Google Scholar
Cross Ref
- Andrew Clark, Kun Sun, Linda Bushnell, and Radha Poovendran. 2015. A game-theoretic approach to IP address randomization in decoy-based cyber defense. In Proceedings of the 6th International Conference on Decision and Game Theory for Security (GameSec’15). 3--21.Google Scholar
Cross Ref
- George Cybenko, Sushil Jajodia, Michael P. Wellman, and Peng Liu. 2014. Adversarial and uncertain reasoning for adaptive cyber defense: Building the scientific foundation. In Proceedings of the International Conference on Information Systems Security (ICISS’14). 1--8.Google Scholar
Cross Ref
- F. Dai, Y. Hu, K. Zheng, and B. Wu. 2015. Exploring risk flow attack graph for security risk assessment. IET Information Security 9, 6 (2015), 344--353.Google Scholar
Digital Library
- Nir Friedman and Yoram Singer. 1999. Efficient Bayesian parameter estimation in large discrete domains. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II (NIPS’98). 417--423.Google Scholar
- Marcel Frigault, Lingyu Wang, Anoop Singhal, and Sushil Jajodia. 2008. Measuring network security using dynamic Bayesian network. In Proceedings of the 4th ACM Workshop on Quality of Protection (QoP’08). 23--30.Google Scholar
Digital Library
- Brian Gorenc and Fritz Sands. 2017. Hacker Machine Interface: The State of SCADA HMI Vulnerabilities. Technical Report. Trend Micro Zero Day Initiative Team.Google Scholar
- Z. Hu, M. Zhu, P. Chen, and P. Liu. 2019. On convergence rates of game theoretic reinforcement learning algorithms. Automatica 104, 6 (2019), 90--101.Google Scholar
Cross Ref
- Zhisheng Hu, Minghui Zhu, and Peng Liu. 2017. Online algorithms for adaptive cyber defense on Bayesian attack graphs. In Proceedings of the 4th ACM Workshop on Moving Target Defense in Association with the 2017 ACM Conference on Computer and Communications Security (MTD’17). 99--109.Google Scholar
Digital Library
- Jeff Hughes, Lawrence Carin, and George Cybenko. 2008. Cybersecurity strategies: The QuERIES methodology. Computer 41, 8 (2008), 20--26.Google Scholar
Digital Library
- S. Iannucci, Q. Chen, and S. Abdelwahed. 2016. High-performance intrusion response planning on many-core architectures. In Proceedings of the 2016 25th International Conference on Computer Communication and Networks (ICCCN’16). 1--6.Google Scholar
- Håvard Johansen, Dag Johansen, and Robbert van Renesse. 2007. FirePatch: Secure and Time-Critical Dissemination of Software Patches. Springer US, 373--384.Google Scholar
- Per Larsen, Andrei Homescu, Stefan Brunthaler, and Michael Franz. 2014. SoK: Automated software diversity. In Proceedings of the 2014 IEEE Symposium on Security and Privacy (SP’14).Google Scholar
Digital Library
- R. Lippmann, K. Ingols, C. Scott, K. Piwowarski, K. Kratkiewicz, M. Artz, and R. Cunningham. 2006. Validating and restoring defense in depth using attack graphs. In Proceedings of the 2006 IEEE Military Communications Conference (MILCOM’06). 1--10.Google Scholar
- Yu Liu and Hong Man. 2005. Network vulnerability assessment using Bayesian networks. In Proceedings of the 2005 Conference on Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security. 61--71.Google Scholar
Cross Ref
- Erik Miehling, Mohammad Rasouli, and Demosthenis Teneketzis. 2015. Optimal defense policies for partially observable spreading processes on Bayesian attack graphs. In Proceedings of the 2nd ACM Workshop on Moving Target Defense (MTD’15). 67--76.Google Scholar
Digital Library
- E. Miehling, M. Rasouli, and D. Teneketzis. 2018. A POMDP approach to the dynamic defense of large-scale cyber networks. IEEE Transactions on Information Forensics and Security 13, 10 (2018), 2490--2505.Google Scholar
Cross Ref
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533.Google Scholar
- Savita Mohurle and Manisha Patil. 2017. A brief study of wannacry threat: Ransomware attack 2017. International Journal of Advanced Research in Computer Science 8, 5 (2017), 1938--1940.Google Scholar
- Thanh H. Nguyen, Mason Wright, Michael P. Wellman, and Satinder Baveja. 2017. Multi-stage attack graph security games: Heuristic strategies, with empirical game-theoretic analysis. In Proceedings of the 2017 Workshop on Moving Target Defense (MTD’17). 87--97.Google Scholar
Digital Library
- S. Ossenbuhl, J. Steinberger, and H. Baier. 2015. Towards automated incident handling: How to select an appropriate response against a network-based attack? In Proceedings of the 2015 9th International Conference on IT Security Incident Management IT Forensics (IMF’15). 51--67.Google Scholar
- Xinming Ou, Wayne F. Boyer, and Miles A. McQueen. 2006. A scalable approach to attack graph generation. In Proceedings of the 13th ACM Conference on Computer and Communications Security (CCS’06). 336--345.Google Scholar
- Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. 2017. Learning unknown Markov decision processes: A Thompson sampling approach. In Advances in Neural Information Processing Systems 30 (NIPS’17). 1333--1342.Google Scholar
- N. Papernot, P. McDaniel, A. Sinha, and M. P. Wellman. 2018. SoK: Security and privacy in machine learning. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroSP’18). 399--414.Google Scholar
- Joelle Pineau, Geoff Gordon, and Sebastian Thrun. 2003. Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI’03). 1025--1030.Google Scholar
- Nayot Poolsappasit, Rinku Dewri, and Indrajit Ray. 2012. Dynamic security risk management using Bayesian attack graphs. IEEE Transactions on Dependable and Secure Computing 9, 1 (2012), 61--74.Google Scholar
Digital Library
- Pascal Poupart and Craig Boutilier. 2003. Value-directed compression of POMDPs. In Advances in Neural Information Processing Systems (NIPS’02). 1579--1586.Google Scholar
- Pascal Poupart and Craig Boutilier. 2004. VDCBPI: An approximate scalable algorithm for large POMDPs. In Advances in Neural Information Processing Systems (NIPS’04). 1081--1088.Google Scholar
- Tom Roeder and Fred B. Schneider. 2010. Proactive obfuscation. ACM Transactions on Computer Systems 28, 2 (2010), Article 4, 54 pages.Google Scholar
Digital Library
- Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. 2017. A tutorial on Thompson sampling. arXiv:1707.02038Google Scholar
- Carlos Sarraute, Olivier Buffet, and Jörg Hoffmann. 2012. POMDPs make better hackers: Accounting for uncertainty in penetration testing. In Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI’12). 1816--1824.Google Scholar
Digital Library
- Mike Schiffman. 2017. Common Vulnerability Scoring System v3.0: Specification Document. Retrieved September 19, 2020 from https://www.first.org/cvss/v3.0/specification-document.Google Scholar
- Guy Shani. 2007. Learning and Solving Partially Observable Markov Decision Processes. Ben Gurion University.Google Scholar
- Guy Shani, Joelle Pineau, and Robert Kaplow. 2013. A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems 27, 1 (2013), 1--51.Google Scholar
Digital Library
- Edward J. Sondik. 1978. The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research 26, 2 (1978), 282--304.Google Scholar
Digital Library
- Matthijs T. J. Spaan and Nikos Vlassis. 2005. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24, 1 (2005), 195--220.Google Scholar
Cross Ref
- Malcolm J. A. Strens. 2000. A Bayesian framework for reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). 943--950.Google Scholar
- Symantec. 2015. Internet Security Threat Report. Retrieved September 19, 2020 from https://library.cyentia.com/report/report_002191.html.Google Scholar
- W. R. Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 2 (1933), 285--294.Google Scholar
Cross Ref
- Michel Tokic. 2010. Adaptive -greedy exploration in reinforcement learning based on value differences. In KI 2010: Advances in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 6359. Springer, 203--210.Google Scholar
- Yan Virin, Guy Shani, Solomon Eyal Shimony, and Ronen I. Brafman. 2007. Scaling up: Solving POMDPs through value based clustering. In Proceedings of the National Conference on Artificial Intelligence (AAAI’07), Vol. 22. 1290--1295.Google Scholar
- Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3 (1992), 279--292.Google Scholar
Digital Library
- Peng Xie, J. H. Li, Xinming Ou, Peng Liu, and R. Levy. 2010. Using Bayesian networks for cyber security analysis. In Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems Networks (DSN’10). 211--220.Google Scholar
- Zhi Xin, Huiyu Chen, Hao Han, Bing Mao, and Li Xie. 2010. Misleading malware similarities analysis by automatic data structure obfuscation. In Proceedings of the 13th International Conference on Information Security (ISC’10).Google Scholar
Digital Library
- Lu Yu and Richard R. Brooks. 2013. Applying POMDP to moving target optimization. In Proceedings of the 8th Annual Cyber Security and Information Intelligence Research Workshop (CSIIRW’13). Article 49, 4 pages.Google Scholar
- Emmanuele Zambon and Damiano Bolzoni. 2006. Network Intrusion Detection Systems: False Positive Reduction Through Anomaly Detection. Retrieved September 19, 2020 from http://www.blackhat.com/presentations/bh-usa-06/BH-US-06-Zambon.pdf.Google Scholar
- Chenfeng Vincent Zhou, Christopher Leckie, and Shanika Karunasekera. 2010. A survey of coordinated attacks and collaborative intrusion detection. Computers 8 Security 29, 1 (2010), 124--140.Google Scholar
- Minghui Zhu, Zhisheng Hu, and Peng Liu. 2014. Reinforcement learning algorithms for adaptive cyber defense against Heartbleed. In Proceedings of the 1st ACM Workshop on Moving Target Defense (MTD’14). 51--58.Google Scholar
Digital Library
- Minghui Zhu and Sonia Martínez. 2014. On attack-resilient distributed formation control in operator-vehicle networks. SIAM Journal on Control and Optimization 52, 5 (2014), 3176--3202.Google Scholar
Cross Ref
- Quanyan Zhu and Tamer Başar. 2009. Dynamic policy-based IDS configuration. In Proceedings of the 48th IEEE Conference on Decision and Control (CDC’09) Held Jointly with the 2009 28th Chinese Control Conference. 8600--8605.Google Scholar
Cross Ref
- Quanyan Zhu, Hamidou Tembine, and Tamer Başar. 2013. Hybrid learning in stochastic games and its applications in network security. Reinforcement Learning and Approximate Dynamic Programming for Feedback Control 17, 14 (2013), 305--329.Google Scholar
- Cliff Changchun Zou, Weibo Gong, and Don Towsley. 2002. Code red worm propagation modeling and analysis. In Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS’02). 138--147.Google Scholar
Digital Library
Index Terms
Adaptive Cyber Defense Against Multi-Stage Attacks Using Learning-Based POMDP
Recommendations
Online Algorithms for Adaptive Cyber Defense on Bayesian Attack Graphs
MTD '17: Proceedings of the 2017 Workshop on Moving Target DefenseEmerging zero-day vulnerabilities in information and communications technology systems make cyber defenses very challenging. In particular, the defender faces uncertainties of; e.g., system states and the locations and the impacts of vulnerabilities. In ...
Reinforcement Learning for Adaptive Cyber Defense Against Zero-Day Attacks
Adversarial and Uncertain Reasoning for Adaptive Cyber DefenseAbstractIn this chapter, we leverage reinforcement learning as a unified framework to design effective adaptive cyber defenses against zero-day attacks. Reinforcement learning is an integration of control theory and machine learning. A salient feature of ...
Dynamical model for individual defence against cyber epidemic attacks
When facing the on‐going cyber epidemic threats, individuals usually set up cyber defences to protect their own devices. In general, the individual‐level cyber defence is considered to mitigate the cyber threat to some extent. However, few previous ...






Comments