Abstract
Cybersecurity experts have appraised the total global cost of malicious hacking activities to be $450 billion annually. Cyber Threat Intelligence (CTI) has emerged as a viable approach to combat this societal issue. However, existing processes are criticized as inherently reactive to known threats. To combat these concerns, CTI experts have suggested proactively examining emerging threats in the vast, international online hacker community. In this study, we aim to develop proactive CTI capabilities by exploring online hacker forums to identify emerging threats in terms of popularity and tool functionality. To achieve these goals, we create a novel Diachronic Graph Embedding Framework (D-GEF). D-GEF operates on a Graph-of-Words (GoW) representation of hacker forum text to generate word embeddings in an unsupervised manner. Semantic displacement measures adopted from diachronic linguistics literature identify how terminology evolves. A series of benchmark experiments illustrate D-GEF's ability to generate higher quality than state-of-the-art word embedding models (e.g., word2vec) in tasks pertaining to semantic analogy, clustering, and threat classification. D-GEF's practical utility is illustrated with in-depth case studies on web application and denial of service threats targeting PHP and Windows technologies, respectively. We also discuss the implications of the proposed framework for strategic, operational, and tactical CTI scenarios. All datasets and code are publicly released to facilitate scientific reproducibility and extensions of this work.
- Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J. Smola. 2013. Distributed large-scale natural graph factorization. In Proceedings of the 22nd International Conference on the World Wide Web (2013) (WWW 2013), 37--47. DOI:https://doi.org/10.1145/2488388.2488393Google Scholar
- Camilo Akimushkin, Diego Raphael Amancio, and Osvaldo Novais Oliveira. 2017. Text authorship identified using the dynamics of word co-occurrence networks. PLoS One 12, 1 (2017). DOI:https://doi.org/10.1371/journal.pone.0170527Google Scholar
- Nolan Arnold, Mohammadreza Ebrahimi, Ning Zhang, Ben Lazarine, Mark Patton, Hsinchun Chen, and Sagar Samtani. 2019. Dark-net ecosystem cyber-threat intelligence (CTI) tool. In Proceedings of the 2019 IEEE International Conference on Intelligence and Security Informatics (ISI 2019). DOI:https://doi.org/10.1109/ISI.2019.8823501Google Scholar
Cross Ref
- Amir Bakarov. 2018. A survey of word embeddings evaluation methods. (January 2018). Retrieved from http://arxiv.org/abs/1801.09536Google Scholar
- Albert-László Barabási. 2016. Network Science (1st ed.). Cambridge University Press.Google Scholar
- Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15, (2003), 1--24. Retrieved from papers2://publication/uuid/DF584584-4B36-4A91-B355-2EB36C58277B.Google Scholar
Digital Library
- Victor Benjamin and Hsinchun Chen. 2015. Developing understanding of hacker language through the use of lexical semantics. In Proceedings of the 2015 IEEE International Conference on Intelligence and Security Informatics: Securing the World through an Alignment of Technology, Intelligence, Humans and Organizations (ISI 2015), 79--84. DOI:https://doi.org/10.1109/ISI.2015.7165943Google Scholar
Cross Ref
- Victor Benjamin, Weifeng Li, Thomas Holt, and Hsinchun Chen. 2015. Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops. In Proceedings of the 2015 IEEE International Conference on Intelligence and Security Informatics (ISI), 85--90. DOI:https://doi.org/10.1109/ISI.2015.7165944Google Scholar
Cross Ref
- Matt Bromiley. 2016. Threat Intelligence: What it is, and how to use it effectively. SANS Institute. Retrieved June 5, 2017 from https://www.sans.org/reading-room/whitepapers/analyst/threat-intelligence-is-effectively-37282.Google Scholar
- Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15), 891--900. DOI:https://doi.org/10.1145/2806416.2806512Google Scholar
Digital Library
- Po-Yi Du, Ning Zhang, Mohammedreza Ebrahimi, Sagar Samtani, Ben Lazarine, Nolan Arnold, Rachael Dunn, Sandeep Suntwal, Guadalupe Angeles, Robert Schweitzer, and Hsinchun Chen. 2018. Identifying, collecting, and presenting hacker community data: Forums, IRC, carding shops, and DNMs. In Proceedings of the 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), 70--75. DOI:https://doi.org/10.1109/ISI.2018.8587327Google Scholar
Cross Ref
- Greg Durrett, Jonathan K. Kummerfeld, Taylor Berg-Kirkpatrick, Rebecca S. Portnoff, Sadia Afroz, Damon McCoy, Kirill Levchenko, and Vern Paxson. 2017. Identifying products in online cybercrime marketplaces: A dataset for fine-grained domain adaptation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017). DOI:https://doi.org/10.18653/v1/d17-1275Google Scholar
Cross Ref
- Malaka El, Emma McMahon, Sagar Samtani, Mark Patton, and Hsinchun Chen. 2017. Benchmarking vulnerability scanners: An experiment on SCADA devices and scientific instruments. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), 83--88. DOI:https://doi.org/10.1109/ISI.2017.8004879Google Scholar
Cross Ref
- Katheryn A. Farris, Ankit Shah, George Cybenko, Rajesh Ganesan, and Sushil Jajodia. 2018. VULCON: A system for vulnerability prioritization, mitigation, and management. ACM Trans. Priv. Secur. 21, 4 (June 2018), 1--28. DOI:https://doi.org/10.1145/3196884Google Scholar
Digital Library
- Palash Goyal, K. S. M. Tozammel Hossain, Ashok Deb, Nazgol Tavabi, Nathan Bartley, Andr'es Abeliuk, Emilio Ferrara, and Kristina Lerman. 2018. Discovering signals from Web sources to predict cyber attacks. (June 2018). Retrieved from http://arxiv.org/abs/1806.03342Google Scholar
- John Grisham, Sagar Samtani, Mark Patton, and Hsinchun Chen. 2017. Identifying mobile malware and key threat actors in online hacker forums for proactive cyber threat intelligence. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), 13--18. DOI:https://doi.org/10.1109/ISI.2017.8004867Google Scholar
Cross Ref
- Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), 855--864. DOI:https://doi.org/10.1145/2939672.2939754Google Scholar
Digital Library
- William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. (May 2016). Retrieved from http://arxiv.org/abs/1605.09096Google Scholar
- William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Cultural shift or linguistic drift? Comparing two computational measures of semantic change. In Proceedings of the Conference on Empirical Methods of Natural Language Process (November 2016), 2116--2121. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/28580459.Google Scholar
Cross Ref
- Christopher R. Harrell, Mark Patton, Hsinchun Chen, and Sagar Samtani. 2018. Vulnerability assessment, remediation, and automated reporting: Case studies of higher education institutions. In Proceedings of the 2018 IEEE International Conference on Intelligence and Security Informatics (ISI 2018). DOI:https://doi.org/10.1109/ISI.2018.8587380Google Scholar
Cross Ref
- J. A. Hartigan and M. A. Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Appl. Stat. 28, 1 (1979), 100. DOI:https://doi.org/10.2307/2346830Google Scholar
Cross Ref
- Yuexian Hou, Peng Zhang, Xingxing Xu, Xiaowei Zhang, and Wenjie Li. 2009. Nonlinear dimensionality reduction by locally linear inlaying. IEEE Trans. Neural Networks 20, 2 (2009), 300--315. DOI:https://doi.org/10.1109/TNN.2008.2005582Google Scholar
Digital Library
- Alice Hutchings and Thomas J. Holt. 2015. A crime script analysis of the online stolen data market. Br. J. Criminol. 55, 3 (May 2015), 596--614. DOI:https://doi.org/10.1093/bjc/azu106Google Scholar
Cross Ref
- Rob Jansen, Matthew Traudt, John Geddes, Chris Wacek, Micah Sherr, and Paul Syverson. 2018. KIST: Kernel-Informed Socket Transport for Tor. ACM Trans. Priv. Secur. 22, 1 (December 2018), 1--37. DOI:https://doi.org/10.1145/3278121Google Scholar
Digital Library
- Thomas N. Kipf and Max Welling. 2016. Variational graph auto-encoders. In Bayesian Deep Learning Workshop (NIPS 2016). Retrieved from http://arxiv.org/abs/1611.07308Google Scholar
- Rob Lee and Robert M. Lee. 2017. The hunter strikes back: The 2017 threat hunting survey. SANS Institute. Retrieved January 11, 2018 from https://www.sans.org/reading-room/whitepapers/analyst/hunter-strikes-back-2017-threat-hunting-survey-37760.Google Scholar
- Weifeng Li, Junming Yin, and Hsinchun Chen. 2016. Targeting key data breach services in underground supply chain. In Proceedings of the 2016 IEEE Conference on Intelligence and Security Informatics (ISI), 322--324. DOI:https://doi.org/10.1109/ISI.2016.7745501Google Scholar
Digital Library
- Emma McMahon, Mark Patton, Sagar Samtani, and Hsinchun Chen. 2018. Benchmarking vulnerability assessment tools for enhanced cyber-physical system (CPS) Resiliency. In Proceedings of the 2018 IEEE International Conference on Intelligence and Security Informatics, ISI 2018. DOI:https://doi.org/10.1109/ISI.2018.8587353Google Scholar
Cross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NPIS’13), 3111--3119. Retrieved from http://arxiv.org/abs/1310.4546Google Scholar
- Vivi Nastase, Rada Mihalcea, and Dragomir R. Radev. 2015. A survey of graphs in natural language processing. Nat. Lang. Eng. 21, 05 (November 2015), 665--698. DOI:https://doi.org/10.1017/S1351324915000340Google Scholar
Cross Ref
- Eric Nunes, Ahmad Diab, Andrew Gunn, Ericsson Marin, Vineet Mishra, Vivin Paliath, John Robertson, Jana Shakarian, Amanda Thart, and Paulo Shakarian. 2016. Darknet and deepnet mining for proactive cybersecurity threat intelligence. In Proceedings of the 2016 IEEE Conference on Intelligence and Security Informatics (ISI), 7--12. DOI:https://doi.org/10.1109/ISI.2016.7745435Google Scholar
Digital Library
- Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In Proceedings of the ACM SIGKDD International Conference on Knoelwdge Discovery Data Mining (Aug. 2016), 1105--1114. DOI:https://doi.org/10.1145/2939672.2939751Google Scholar
Digital Library
- Fabio Pagani, Oleksii Fedorov, and Davide Balzarotti. 2019. Introducing the temporal dimension to memory forensics. ACM Trans. Priv. Secur. 22, 2 (Mar. 2019), 1--21. DOI:https://doi.org/10.1145/3310355Google Scholar
Digital Library
- Sergio Pastrana and Guillermo Suarez-Tangil. 2019. A first look at the crypto-mining malware ecosystem: A decade of unrestricted wealth. (January 2019). Retrieved from http://arxiv.org/abs/1901.00846Google Scholar
Digital Library
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14), 701--710. DOI:https://doi.org/10.1145/2623330.2623732Google Scholar
Digital Library
- Giulio Ermanno Pibiri and Rossano Venturini. 2019. Handling massive n-gram datasets efficiently. ACM Trans. Inf. Syst. 37, 2 (Feb. 2019), 1--41. DOI:https://doi.org/10.1145/3302913Google Scholar
Digital Library
- Andrew Rosenberg and Julia Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 410--420.Google Scholar
- Francois Rousseau, Emmanouil Kiagias, and Michalis Vazirgiannis. 2015. Text categorization as a graph classification problem. In Proceedings of the 53rd Annual Meeting of the Association of Computer Linguistics and the 7th International Joint Conference on the Natural Language Processes (Volume 1 Long Papers) (2015), 1702--1712. Retrieved from http://www.aclweb.org/anthology/P15-1164.Google Scholar
Cross Ref
- Morteza Safaei Pour, Antonio Mangino, Kurt Friday, Matthias Rathbun, Elias Bou-Harb, Farkhund Iqbal, Sagar Samtani, Jorge Crichigno, and Nasir Ghani. 2020. On data-driven curation, learning, and analysis for inferring evolving internet-of-Things (IoT) botnets in the wild. Comput. Secur. (2020). DOI:https://doi.org/10.1016/j.cose.2019.101707Google Scholar
- Sagar Samtani, Maggie Abate, Victor Benjamin, and Weifeng Li. 2020. Cybersecurity as an industry: A cyber threat intelligence perspective. In The Palgrave Handbook of International Cybercrime and Cyberdeviance. DOI:https://doi.org/10.1007/978-3-319-78440-3_8Google Scholar
- Sagar Samtani, Kory Chinn, Cathy Larson, and Hsinchun Chen. 2016. A secure hacker assets portal: Cyber threat intelligence and malware analysis. In Proceedings of the 2016 IEEE Conference on Intelligence and Security Informatics (ISI), 19--24. DOI:https://doi.org/10.1109/ISI.2016.7745437Google Scholar
Digital Library
- Sagar Samtani, Ryan Chinn, and Hsinchun Chen. 2015. Exploring hacker assets in underground forums. In Proceedings of the 2015 IEEE International Conference on Intelligence and Security Informatics (ISI), 31--36. DOI:https://doi.org/10.1109/ISI.2015.7165935Google Scholar
Cross Ref
- Sagar Samtani, Ryan Chinn, Hsinchun Chen, and Jay F. Nunamaker. 2017. Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. J. Manag. Inf. Syst. 34, 4 (2017), 1023--1053.Google Scholar
Cross Ref
- Sagar Samtani, Shuo Yu, Hongyi Zhu, Mark Patton, and Hsinchun Chen. 2016. Identifying SCADA vulnerabilities using passive and active vulnerability assessment techniques. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data (ISI 2016). DOI:https://doi.org/10.1109/ISI.2016.7745438Google Scholar
Digital Library
- Sagar Samtani, Shuo Yu, Hongyi Zhu, Mark Patton, John Matherly, and HsinChun Chen. 2018. Identifying SCADA systems and their vulnerabilities on the Internet of Things: A text-mining approach. IEEE Intell. Syst. 33, 2 (March 2018), 63--73. DOI:https://doi.org/10.1109/MIS.2018.111145022Google Scholar
Cross Ref
- Anna Sapienza, Alessandro Bessi, Saranya Damodaran, Paulo Shakarian, Kristina Lerman, and Emilio Ferrara. 2017. Early warnings of cyber threats in online discussions. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 667--674. DOI:https://doi.org/10.1109/ICDMW.2017.94Google Scholar
Cross Ref
- Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 298--307. DOI:https://doi.org/10.18653/v1/D15-1036Google Scholar
Cross Ref
- Peter H. Schönemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1 (Mar. 1966), 1--10. DOI:https://doi.org/10.1007/BF02289451Google Scholar
Cross Ref
- Dave Shackleford. 2016. 2016 Security Analytics Survey. SANS Institute. Retrieved June 5, 2017 from https://www.sans.org/reading-room/whitepapers/analyst/2016-security-analytics-survey-37467.Google Scholar
- Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, Min Lyu, and Hongxia Jin. 2017. Differentially Private k-means clustering and a hybrid approach to private optimization. ACM Trans. Priv. Secur. 20, 4 (Oct. 2017), 1--33. DOI:https://doi.org/10.1145/3133201Google Scholar
Digital Library
- Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. LINE: Large-scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web (WWW’15), 1067--1077. DOI:https://doi.org/10.1145/2736277.2741093Google Scholar
Digital Library
- Nazgol Tavabi, Palash Goyal, Mohammed Almukaynizi, Paulo Shakarian, and Kristina Lerman. 2018. Font Size: DarkEmbed: Exploit prediction with neural language models. In Proceedings of the 30th AAAI Conference on Innovative Applications of Artificial Intelligence (IAAI-18), 7849--7854.Google Scholar
- Xabier Ugarte-Pedrero, Mariano Graziano, and Davide Balzarotti. 2019. A close look at a daily dataset of malware samples. ACM Trans. Priv. Secur. 22, 1 (Jan. 2019), 1--30. DOI:https://doi.org/10.1145/3291061Google Scholar
Digital Library
- Bin Wang, Angela Wang, Fenxiao Chen, Yuncheng Wang, and C. C. Jay Kuo. 2019. Evaluating word embedding models: Methods and experimental results. APSIPA Trans. Sig. Inf. Proc. DOI:https://doi.org/10.1017/ATSIP.2019.12Google Scholar
- Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Aug. 2016), 1225--1234. DOI:https://doi.org/10.1145/2939672.2939753Google Scholar
Digital Library
- Ryan Williams, Emma McMahon, Sagar Samtani, Mark Patton, and Hsinchun Chen. 2017. Identifying vulnerabilities of consumer Internet of Things (IoT) devices: A scalable approach. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics: Security and Big Data (ISI 2017). DOI:https://doi.org/10.1109/ISI.2017.8004904Google Scholar
Cross Ref
- Ryan Williams, Sagar Samtani, Mark Patton, and Hsinchun Chen. 2018. Incremental hacker forum exploit collection and classification for proactive cyber threat intelligence: An exploratory study. In Proceedings of the 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), 94--99. DOI:https://doi.org/10.1109/ISI.2018.8587336Google Scholar
Cross Ref
- Hao Yang, Xiulin Ma, Kun Du, Zhou Li, Haixin Duan, Xiaodong Su, Guang Liu, Zhifeng Geng, and Jianping Wu. 2017. How to learn Klingon without a dictionary: Detection and measurement of black keywords used by the underground economy. In Proceedings of the IEEE Symposium on Security and Privacy. DOI:https://doi.org/10.1109/SP.2017.11Google Scholar
Cross Ref
- Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. 2018. Dynamic word embeddings for evolving semantic discovery. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18), 673--681. DOI:https://doi.org/10.1145/3159652.3159703Google Scholar
Digital Library
- Kan Yuan, Haoran Lu, Xiaojing Liao, and Xiao Feng Wang. 2018. Reading thieves’ cant: Automatically identifying and understanding dark jargons from cybercrime marketplaces. In Proceedings of the 27th USENIX Security Symposium.Google Scholar
- Michael Zhai, Johnny Tan, and Jinho D. Choi. 2016. Intrinsic and extrinsic evaluations of word embeddings. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (2016), 4282--4283. Retrieved from http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12454/12257.Google Scholar
- Yiming Zhang, Yujie Fan, Shifu Hou, Jian Liu, Yanfang Ye, and Thirimachos Bourlai. 2018. Idetector: Automate underground forum analysis based on heterogeneous information network. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 1071--1078. DOI:https://doi.org/10.1109/ASONAM.2018.8508414Google Scholar
Cross Ref
- Kangzhi Zhao, Yong Zhang, Chunxiao Xing, Weifeng Li, and Hsinchun Chen. 2016. Chinese underground market jargon analysis based on unsupervised learning. In IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, ISI 2016. DOI:https://doi.org/10.1109/ISI.2016.7745450Google Scholar
Digital Library
- Ziming Zhao, Mukund Sankaran, Gail-Joon Ahn, Thomas J. Holt, Yiming Jing, and Hongxin Hu. 2016. Mules, seals, and attacking tools: Analyzing 12 online marketplaces. IEEE Secur. Priv. 14, 3 (May 2016), 32--43. DOI:https://doi.org/10.1109/MSP.2016.46Google Scholar
Digital Library
- Hongyi Zhu, Sagar Samtani, Randall Brown, and Hsinchun Chen. 2020. A deep learning approach for recognizing activity of daily living (ADL) for senior care: Exploiting interaction dependency and temporal patterns. Forthcom. MIS Q. (2020). Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3595738.Google Scholar
- Hongyi Zhu, Sagar Samtani, Hsinchun Chen, and Jay F Nunamaker, Jr. 2020. Human identification for activities of daily living: A deep transfer learning approach. J. Manag. Inf. Syst. 37, 2 (2020), 457--483. DOI:https://doi.org/10.1080/07421222.2020.1759961Google Scholar
Cross Ref
Index Terms
Proactively Identifying Emerging Hacker Threats from the Dark Web: A Diachronic Graph Embedding Framework (D-GEF)
Recommendations
Informing Cyber Threat Intelligence through Dark Web Situational Awareness: The AZSecure Hacker Assets Portal
To increase situational awareness, major cybersecurity platforms offer Cyber Threat Intelligence (CTI) about emerging cyber threats, key threat actors, and their modus operandi. However, this intelligence is often reactive, as it analyzes event log files ...
On Cyber Threats to Smart Digital Environments
ICSDE'18: Proceedings of the 2nd International Conference on Smart Digital EnvironmentCyber threats and attacks have significantly increased in complexity and quantity throughout this past year. In this paper, the top fifteen cyber threats and trends are articulated in detail to provide awareness throughout the community and raising ...
Threat led advanced persistent threat penetration test
Cyber security attacks have been on the rise in recent years. One of the most destructive attacks are known as advanced persistent threat (APT) attacks which can inflict massive damages to a network. A common approach of testing the security of an IT ...






Comments