skip to main content
10.1145/3128572.3140446acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Malware Classification and Class Imbalance via Stochastic Hashed LZJD

Published: 03 November 2017 Publication History

Abstract

There are currently few methods that can be applied to malware classification problems which don't require domain knowledge to apply. In this work, we develop our new SHWeL feature vector representation, by extending the recently proposed Lempel-Ziv Jaccard Distance. These SHWeL vectors improve upon LZJD's accuracy, outperform byte n-grams, and allow us to build efficient algorithms for both training (a weakness of byte n-grams) and inference (a weakness of LZJD). Furthermore, our new SHWeL method also allows us to directly tackle the class imbalance problem, which is common for malware-related tasks. Compared to existing methods like SMOTE, SHWeL provides significantly improved accuracy while reducing algorithmic complexity to O(N). Because our approach is developed without the use of domain knowledge, it can be easily re-applied to any new domain where there is a need to classify byte sequences.

References

[1]
2015. Microsoft Malware Classification Challenge (BIG 2015). (2015). https://www.kaggle.com/c/malware-classification/
[2]
Daniel Arp, Michael Spreitzenbarth, Hubner Malte, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Network and Distributed System Security (NDSS) February (2014), 23--26.
[3]
Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID'07). Springer-Verlag, Berlin, Heidelberg, 178--197. http://dl.acm.org/citation.cfm?id=1776434.1776449
[4]
Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware. Analysis 4637, 1 (2007), 178--197. http://portal.acm.org/citation.cfm?id=1776449
[5]
G. E. a P. a Batista, a L. C. Bazzan, and M. C. Monard. 2004. Balancing Training Data for Automated Annotation of Keywords: a Case Study. Revista Tecnologia da Informação 3, 2 (2004), 15--20.
[6]
Rebecca Schuller Borbely. 2015. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques (2015), 1--8.
[7]
Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145--1159.
[8]
Andrei Z. Broder. 1997. On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences 1997 (SEQUENCES '97). IEEE Computer Society, Washington, DC, USA, 21--29. http://dl.acm.org/citation.cfm?id=829502.830043
[9]
Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. 2010. The Balanced Accuracy and Its Posterior Distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR '10). IEEE Computer Society, Washington, DC, USA, 3121--3124.
[10]
Manuel Cebrián, Manuel Alfonseca, Alfonso Ortega, and others. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Communications in Information & Systems 5, 4 (2005), 367--384.
[11]
Nitesh Chawla, Kevin Bowyer, Lawrence Hall, and Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Artificial Intelligence Research 16 (2002), 321--357. http://arxiv.org/abs/1106.1813
[12]
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning. In Proceedings of the 2005 International Conference on Advances in Intelligent Computing - Volume Part I (ICIC'05). Springer-Verlag, Berlin, Heidelberg, 878--887.
[13]
Matthew Hayes, Andrew Walenstein, and Arun Lakhotia. 2008. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology 5, 4 (2008), 335--343.
[14]
Olivier Henchiri and Nathalie Japkowicz. 2006. A Feature Selection and Evaluation Scheme for Computer Virus Detection. In Proceedings of the Sixth International Conference on Data Mining (ICDM '06). IEEE Computer Society, Washington, DC, USA, 891--895.
[15]
Sergey Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM '10). IEEE Computer Society, Washington, DC, USA, 246--255.
[16]
Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis. In Proceedings of the 18th ACM conference on Computer and communications security - CCS. ACM Press, New York, New York, USA, 309--320.
[17]
J. Zico Kolter and Marcus A. Maloof. 2006. Learning to Detect and Classify Malicious Executables in the Wild. Journal of Machine Learning Research 7 (12 2006), 2721--2744. http://dl.acm.org/citation.cfm?id=1248547.1248646
[18]
Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365.html
[19]
Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M. B. Vitanyi. 2004. The Similarity Metric. IEEE Transactions on Information Theory 50, 12 (2004), 3250--3264.
[20]
Ping Li. 2015. 0-Bit Consistent Weighted Sampling. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 665--674.
[21]
Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd C. König. 2011. Hashing Algorithms for Large-Scale Learning. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672--2680. http://papers.nips.cc/paper/4403-hashing-algorithms-for-large-scale-learning.pdf
[22]
Yuping Li, Sathya Chandran Sundaramurthy, Alexandru G. Bardas, Xinming Ou, Doina Caragea, Xin Hu, and Jiyong Jang. 2015. Experimental Study of Fuzzy Hashing in Malware Clustering Analysis. In 8th Workshop on Cyber Security Experimentation and Test (CSET 15). USENIX Association, Washington, D.C. https://www.usenix.org/conference/cset15/workshop-program/presentation/li
[23]
Mark Manasse, Frank McSherry, and Kunal Talwar. 2008. Consistent Weighted Sampling. Technical Report. https://www.microsoft.com/en-us/research/publication/consistent-weighted-sampling/
[24]
Zane Markel and Michael Bilzor. 2014. Building a machine learning classifier for malware detection. In 2014 Second Workshop on Anti-malware Testing Research (WATeR). IEEE, 1--4.
[25]
Robert Moskovitch, Nir Nissim, and Yuval Elovici. 2009. Malicious Code Detection Using Active Learning. In Privacy, Security, and Trust in KDD. 74--91.
[26]
Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim, Nathalie Japkowicz, and Yuval Elovici. 2009. Unknown malcode detection and the imbalance problem. Journal in Computer Virology 5, 4 (11 2009), 295--308.
[27]
Om Patri, Michael Wojnowicz, and Matt Wolff. 2017. Discovering Malware with Time Series Shapelets. In Proceedings of the 50th Hawaii International Conference on System Sciences.
[28]
Edward Raff. 2017. JSAT: Java Statistical Analysis Tool, a Library for Machine Learning. Journal of Machine Learning Research 18, 23 (2017), 1--5. http://jmlr.org/papers/v18/16-131.html
[29]
Edward Raff and Charles Nicholas. 2017. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '17. ACM Press, New York, New York, USA, 1007--1015.
[30]
Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean, and Charles Nicholas. 2016. An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques (9 2016).
[31]
D. Krishna Sandeep Reddy and Arun K. Pujari. 2006. N-gram analysis for computer virus detection. Journal in Computer Virology 2, 3 (11 2006), 231--239.
[32]
J.-Michael Roberts. 2011. Virus Share. (2011). https://virusshare.com/
[33]
Christian Rossow, Christian J. Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten van Steen. 2012. Prudent Practices for Designing Malware Experiments: Status Quo and Outlook. In 2012 IEEE Symposium on Security and Privacy. IEEE, 65--79.
[34]
M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. 2001. Data Mining Methods for Detection of New Malicious Executables. In Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001. IEEE Comput. Soc, 38--49.
[35]
Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 1 (2009), 16--29.
[36]
Salvatore J. Stolfo, Ke Wang, and Wei-Jen Li. 2007. Towards Stealthy Malware Detection. In Malware Detection, Mihai Christodorescu, Somesh Jha, Douglas Maughan, Dawn Song, and Cliff Wang (Eds.). Springer US, Boston, MA, 231--249.
[37]
Stephanie Wehner. 2007. Analyzing Worms and Network Traffic Using Compression. Journal of Computer Security 15, 3 (8 2007), 303--320. http://dl.acm.org/citation.cfm?id=1370628.1370630
[38]
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. ACM Press, New York, New York, USA, 1113--1120.
[39]
Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1035--1043.
[40]
Yanfang Ye, Tao Li, Kai Huang, Qingshan Jiang, and Yong Chen. 2010. Hierarchical Associative Classifier (HAC) for Malware Detection from the Large and Imbalanced Gray List. Journal of Intelligent Information Systems 35, 1 (8 2010), 1--20.
[41]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (5 1977), 337--343.
[42]
Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5 (9 1978), 530--536.

Cited By

View all
  • (2024)MalOSDF: An Opcode Slice-Based Malware Detection Framework Using Active and Ensemble LearningElectronics10.3390/electronics1302035913:2(359)Online publication date: 15-Jan-2024
  • (2024)Class‐Imbalanced Problems in Malware Analysis and Detection in Classification AlgorithmsEmerging Threats and Countermeasures in Cybersecurity10.1002/9781394230600.ch4(61-81)Online publication date: 13-Nov-2024
  • (2023)A Streamlined Framework of Metamorphic Malware Classification via Sampling and Parallel ProcessingElectronics10.3390/electronics1221442712:21(4427)Online publication date: 27-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AISec '17: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
November 2017
140 pages
ISBN:9781450352024
DOI:10.1145/3128572
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cyber security
  2. lzjd
  3. malware classification
  4. shwel

Qualifiers

  • Research-article

Conference

CCS '17
Sponsor:

Acceptance Rates

AISec '17 Paper Acceptance Rate 11 of 36 submissions, 31%;
Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MalOSDF: An Opcode Slice-Based Malware Detection Framework Using Active and Ensemble LearningElectronics10.3390/electronics1302035913:2(359)Online publication date: 15-Jan-2024
  • (2024)Class‐Imbalanced Problems in Malware Analysis and Detection in Classification AlgorithmsEmerging Threats and Countermeasures in Cybersecurity10.1002/9781394230600.ch4(61-81)Online publication date: 13-Nov-2024
  • (2023)A Streamlined Framework of Metamorphic Malware Classification via Sampling and Parallel ProcessingElectronics10.3390/electronics1221442712:21(4427)Online publication date: 27-Oct-2023
  • (2023)Efficient Malware Analysis Using Metric EmbeddingsDigital Threats: Research and Practice10.1145/36156695:1(1-20)Online publication date: 16-Aug-2023
  • (2022)Separating the Wheat from the Chaff: Using Indexing and Sub-Sequence Mining Techniques to Identify Related Crashes During Bug Triage2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS57517.2022.00014(31-42)Online publication date: Dec-2022
  • (2021)Machine-Learning-Based Android Malware Family Classification Using Built-In and Custom PermissionsApplied Sciences10.3390/app11211024411:21(10244)Online publication date: 1-Nov-2021
  • (2021)Unleashing the hidden power of compiler optimization on binary code difference: an empirical studyProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454035(142-157)Online publication date: 19-Jun-2021
  • (2021)DroidRadar: Android Malware Detection Based on Global Sensitive Graph Embedding2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom53373.2021.00115(802-809)Online publication date: Oct-2021
  • (2021)Data Augmentation in Training Deep Learning Models for Malware Family Classification2021 International Conference on Machine Learning and Cybernetics (ICMLC)10.1109/ICMLC54886.2021.9737271(1-6)Online publication date: 4-Dec-2021
  • (2021)Empirical Evaluation of Minority Oversampling Techniques in the Context of Android Malware Detection2021 28th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC53868.2021.00042(349-359)Online publication date: Dec-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media