skip to main content
10.1145/3097983.3098111acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance

Published: 04 August 2017 Publication History

Abstract

The Normalized Compression Distance (NCD) has been used in a number of domains to compare objects with varying feature types. This flexibility comes from the use of general purpose compression algorithms as the means of computing distances between byte sequences. Such flexibility makes NCD particularly attractive for cases where the right features to use are not obvious, such as malware classification. However, NCD can be computationally demanding, thereby restricting the scale at which it can be applied. We introduce an alternative metric also inspired by compression, the Lempel-Ziv Jaccard Distance (LZJD). We show that this new distance has desirable theoretical properties, as well as comparable or superior performance for malware classification, while being easy to implement and orders of magnitude faster in practice.

References

[1]
Nadia Alshahwan, Earl T Barr, David Clark, and George Danezis 2015. Detecting Malware with Information Complexity. (2 2015). showURL%http://arxiv.org/abs/1502.07661
[2]
Daniel Arp, Michael Spreitzenbarth, Hubner Malte, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Network and Distributed System Security (NDSS) February (2014), 23--26. https://doi.org/10.14722/ndss.2014.23247
[3]
Michael Bailey, Jon Oberheide, Jon Andersen, Z Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID'07). Springer-Verlag, Berlin, Heidelberg, 178--197. http://dl.acm.org/citation.cfm?id=1776434.1776449
[4]
Rebecca Schuller Borbely. 2015. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques (2015), 1--8. 1007/978-3-642-04342-0_7
[5]
Nicholas Tran. 2007. The normalized compression distance and image distinguishability Proc. SPIE 6492, Human Vision and Electronic Imaging XII, Bernice E. Rogowitz, Thrasyvoulos N. Pappas, and Scott J. Daly (Eds.), Vol. Vol. 64921D. https://doi.org/10.1117/12.704334
[6]
Stephanie Wehner. 2007. Analyzing Worms and Network Traffic Using Compression. J. Comput. Secur., Vol. 15, 3 (8 2007), 303--320. ISSN0926-227X http://dl.acm.org/citation.cfm?id=1370628.1370630
[7]
Wing Wong and Mark Stamp 2006. Hunting for metamorphic engines. Journal in Computer Virology Vol. 2, 3 (2006), 211--229. ISSN1772-9904 https://doi.org/10.1007/s11416-006-0028-7
[8]
Wei Yan, Zheng Zhang, and Nirwan Ansari 2008. Revealing Packed Malware. IEEE Security and Privacy Vol. 6, 5 (9 2008), 65--69. ISSN1540-7993 https://doi.org/10.1109/MSP.2008.126
[9]
Jacob Ziv and Abraham Lempel 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory Vol. 23, 3 (5 1977), 337--343. ISSN0018--9448 https://doi.org/10.1109/TIT.1977.1055714
[10]
Jacob Ziv and Abraham Lempel 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory Vol. 24, 5 (9 1978), 530--536. ISSN0018-9448 https://doi.org/10.1109/TIT.1978.1055934

Cited By

View all
  • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
  • (2024)Crowdsourcing Malware Family Annotation: Joint Class-Determined Tag Extraction and Weakly-Tagged Sample InferenceIEEE Transactions on Network and Service Management10.1109/TNSM.2024.337360121:4(4763-4776)Online publication date: Aug-2024
  • (2024)Anomaly Detection in Video Using Compression2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00027(127-133)Online publication date: 7-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cyber security
  2. jaccard similarity
  3. lempel-ziv
  4. malware classification
  5. normalized compression distance

Qualifiers

  • Research-article

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)5
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
  • (2024)Crowdsourcing Malware Family Annotation: Joint Class-Determined Tag Extraction and Weakly-Tagged Sample InferenceIEEE Transactions on Network and Service Management10.1109/TNSM.2024.337360121:4(4763-4776)Online publication date: Aug-2024
  • (2024)Anomaly Detection in Video Using Compression2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00027(127-133)Online publication date: 7-Aug-2024
  • (2023)Recasting self-attention with holographic reduced representationsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618431(490-507)Online publication date: 23-Jul-2023
  • (2023)Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model SelectionACM Transactions on Privacy and Security10.1145/362456726:4(1-27)Online publication date: 13-Nov-2023
  • (2023)Nation-State Threat Actor Attribution Using Fuzzy HashingIEEE Access10.1109/ACCESS.2022.323340311(1148-1165)Online publication date: 2023
  • (2023)EvadeDroid: A Practical Evasion Attack on Machine Learning for Black-box Android Malware DetectionComputers & Security10.1016/j.cose.2023.103676(103676)Online publication date: Dec-2023
  • (2023)An efficient two-stage pipeline model with filtering algorithm for mislabeled malware detectionComputers & Security10.1016/j.cose.2023.103499135(103499)Online publication date: Dec-2023
  • (2023)Marvolo: Programmatic Data Augmentation for Deep Malware DetectionMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43412-9_16(270-285)Online publication date: 17-Sep-2023
  • (2022)A Survey of the Recent Trends in Deep Learning Based Malware DetectionJournal of Cybersecurity and Privacy10.3390/jcp20400412:4(800-829)Online publication date: 28-Sep-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media