skip to main content
research-article

A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark

Published:23 November 2021Publication History
Skip Abstract Section

Abstract

Multi-dimensional data anonymization approaches (e.g., Mondrian) ensure more fine-grained data privacy by providing a different anonymization strategy applied for each attribute. Many variations of multi-dimensional anonymization have been implemented on different distributed processing platforms (e.g., MapReduce, Spark) to take advantage of their scalability and parallelism supports. According to our critical analysis on overheads, either existing iteration-based or recursion-based approaches do not provide effective mechanisms for creating the optimal number of and relative size of resilient distributed datasets (RDDs), thus heavily suffer from performance overheads. To solve this issue, we propose a novel hybrid approach for effectively implementing a multi-dimensional data anonymization strategy (e.g., Mondrian) that is scalable and provides high-performance. Our hybrid approach provides a mechanism to create far fewer RDDs and smaller size partitions attached to each RDD than existing approaches. This optimal RDD creation and operations approach is critical for many multi-dimensional data anonymization applications that create tremendous execution complexity. The new mechanism in our proposed hybrid approach can dramatically reduce the critical overheads involved in re-computation cost, shuffle operations, message exchange, and cache management.

REFERENCES

  1. [1] IPUMS International. (2007). Retrieved 25 Sept 2021 from https://international.ipums.org/international/.Google ScholarGoogle Scholar
  2. [2] Al-Zobbi Mohammed, Shahrestani Seyed, and Ruan Chun. 2016. Sensitivity-based anonymization of big data. In Proceedings of the 2016 IEEE 41st Conference on Local Computer Networks Workshops. IEEE, 5864.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Andrew J., Karthikeyan J., and Jebastin Jeffy. 2019. Privacy preserving big data publication on cloud using mondrian anonymization techniques and deep neural networks. In Proceedings of the 2019 5th International Conference on Advanced Computing & Communication Systems. IEEE, 722727.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Antonatos Spiros, Braghin Stefano, Holohan Naoise, Gkoufas Yiannis, and Aonghusa Pol Mac. 2018. Prima: An end-to-end framework for privacy at scale. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering. IEEE, 15311542.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Ashkouti Farough, Sheikhahmadi Amir, and Keyhan Khamforoosh. 2021. DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark. Information Sciences 546 (2021), 124.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Asuncion Arthur and Newman David. 2007. UCI Machine Learning Repository. Retrieved 25 Sept 2021 from http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  8. [8] Bazai Sibghat Ullah and Jang-Jaccard Julian. 2019. SparkDA: RDD-based high-performance data anonymization technique for Spark platform. In Proceedings of the International Conference on Network and System Security. J. Liu and X. Huang (Eds.), Lecture Notes in Computer Science, Vol. 11928. Springer, 646662.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Bazai Sibghat Ullah and Jang-Jaccard Julian. 2020. In-memory data anonymization using scalable and high performance RDD design. Electronics 9, 10 (2020), 1732.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Bazai Sibghat Ullah, Jang-Jaccard Julian, and Alavizadeh Hooman. 2021. Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark. Electronics 10, 5 (2021), 589.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Bilogrevic Igor, Freudiger Julien, Cristofaro Emiliano De, and Uzun Ersin. 2014. What’s the gist? Privacy-preserving aggregation of user profiles. In Proceedings of the 19th European Symposium on Research in Computer Security. Springer, 128145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Chakravorty Antorweep, Rong Chunming, Jayaram K. R., and Tao Shu. 2017. Scalable, efficient anonymization with INCOGNITO-framework & algorithm. In Proceedings of the 2017 IEEE International Congress on Big Data. IEEE, 3948.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chakravorty Antorweep, Wlodarczyk Tomasz Wiktor, and Rong Chunming. 2014. A scalable k-anonymization solution for preserving privacy in an aging-in-place welfare intercloud. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering. IEEE, 424431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Eyupoglu Can, Aydin Muhammed Ali, Zaim Abdul Halim, and Sertbas Ahmet. 2018. An efficient big data anonymization algorithm based on chaos and perturbation techniques. Entropy 20, 5 (2018), 373.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Fung Benjamin C. M., Wang Ke, and Yu Philip S.. 2005. Top-down specialization for information and privacy preservation. In Proceedings of the 21st International Conference on Data Engineering. IEEE, 205216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Gao Zhi-Qiang and Zhang Long-Jun. 2017. DPHKMS: An efficient hybrid clustering preserving differential privacy in Spark. In Proceedings of the International Conference on Emerging Internetworking, Data & Web Technologies. Springer, 367377.Google ScholarGoogle Scholar
  17. [17] Ghinita Gabriel, Karras Panagiotis, Kalnis Panos, and Mamoulis Nikos. 2009. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems 34, 2 (2009), 147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Gu Jiaqi, Watanabe Yugo H., Mazza William A., Shkapsky Alexander, Yang Mohan, Ding Ling, and Zaniolo Carlo. 2019. RaSQL: Greater power and performance for big data analytics with recursive-aggregate-SQL on Spark. In Proceedings of the 2019 International Conference on Management of Data. 467484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Katsogridakis Pavlos, Papagiannaki Sofia, and Pratikakis Polyvios. 2017. Execution of recursive queries in Apache Spark. In Proceedings of the European Conference on Parallel Processing. F. Rivera, T. Pena, and J. Cabaleiro (Eds.), Lecture Notes in Computer Science, Vol. 10417. Springer, 289302.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Kifer Daniel and Gehrke Johannes. 2006. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 217228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Lee Suan, Kang Seok, Kim Jinho, and Yu Eun Jung. 2019. Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Cluster Computing 22, 1 (2019), 20632087.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] LeFevre Kristen, DeWitt David J., and Ramakrishnan Raghu. 2006. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering. IEEE, 2525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] LeFevre Kristen, DeWitt David J., and Ramakrishnan Raghu. 2006. Workload-aware anonymization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 277286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Jiuyong, Liu Jixue, Baig Muzammil, and Wong Raymond Chi-Wing. 2011. Information based data anonymization for classification utility. Data & Knowledge Engineering 70, 12 (2011), 10301045. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Li Jianzhong, Ooi Beng Chin, and Wang Weiping. 2008. Anonymizing streaming data for privacy protection. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE, 13671369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Lichman M.. 2013. UCI Machine Learning Repository-Census+ Income Dataset. Retrieved 25 Sept 2021 from http://archive.ics.uci.edu/ml/datasets/Census-Income?(KDD).Google ScholarGoogle Scholar
  27. [27] McCaa Robert. 2013. The big census data revolution: IPUMS-international. trans-border access to decades of census samples for three-fourths of the world and more. Revista de Demografia Historica 30, 1 (2013), 69.Google ScholarGoogle Scholar
  28. [28] Mehta Brijesh B. and Rao Udai Pratap. 2017. Privacy preserving big data publishing: A scalable k-anonymization approach using MapReduce. IET Software 11, 5 (2017), 271276.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Nezarat Amin and Yavari Khadije. 2019. A distributed method based on Mondrian algorithm for big data anonymization. In Proceedings of the International Congress on High-Performance Computing and Big Data Analysis. L. Grandinetti, S. Mirtaheri, and R. Shahbazian (Eds.), Communications in Computer and Information Science, Vol. 891. Springer, 8497.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Nicolae Bogdan, Costa Carlos H. A., Misale Claudia, Katrinis Kostas, and Park Yoonho. 2016. Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Transactions on Parallel and Distributed Systems 28, 6 (2016), 16631674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Pomares-Quimbaya Alexandra, Sierra-Múnera Alejandro, Mendoza-Mendoza Jaime, Malaver-Moreno Julián, Carvajal Hernán, and Moncayo Victor. 2019. Anonylitics: From a small data to a big data anonymization system for analytical projects. In Proceedings of the 21st International Conference on Enterprise Information Systems. 6171.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Cattral F. Oppacher. R.. 2007. Poker Hand Data Set. Retrieved 25 Sept 2021 from https://archive.ics.uci.edu/ml/datasets/Poker+Hand.Google ScholarGoogle Scholar
  33. [33] Rogala Marek, Hidders Jan, and Sroka Jacek. 2016. DatalogRA: Datalog with recursive aggregation in the Spark RDD model. In Proceedings of the 4th International Workshop on Graph Data Management Experiences and Systems. 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Salas Julián and Domingo-Ferrer Josep. 2018. Some basics on privacy techniques, anonymization and their big data challenges. Mathematics in Computer Science 12, 3 (2018), 263274.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Sattar A. H. M. Sarowar, Li Jiuyong, Ding Xiaofeng, Liu Jixue, and Vincent Millist. 2013. A general framework for privacy preserving data publishing. Knowledge-Based Systems 54, C (2013), 276287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Shi Juwei, Qiu Yunjie, Minhas Umar Farooq, Jiao Limei, Wang Chen, Reinwald Berthold, and Özcan Fatma. 2015. Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 21102121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Shkapsky Alexander, Yang Mohan, Interlandi Matteo, Chiu Hsuan, Condie Tyson, and Zaniolo Carlo. 2016. Big data analytics with datalog queries on Spark. In Proceedings of the 2016 International Conference on Management of Data. 11351149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Sopaoglu Ugur and Abul Osman. 2017. A top-down k-anonymization implementation for Apache Spark. In Proceedings of the 2017 IEEE International Conference on Big Data. IEEE, 45134521.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Thompson Daniel M., Feigenbaum James J., Hall Andrew B., and Yoder Jesse. 2019. Who Becomes a Member of Congress? Evidence from De-Anonymized Census Data. Technical Report. National Bureau of Economic Research.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Wai Ei Nyein Chan, Tsai Pei-Wei, and Pan Jeng-Shyang. 2016. Hierarchical PSO clustering on MapReduce for scalable privacy preservation in big data. In Proceedings of the 10th International Conference on Genetic and Evolutionary Computing. Springer, 3644.Google ScholarGoogle Scholar
  41. [41] Wang Ke, Yu Philip S., and Chakraborty Sourav. 2004. Bottom-up generalization: A data mining solution to privacy protection. In Proceedings of the 4th IEEE International Conference on Data Mining. IEEE, 249256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Xu Jian, Wang Wei, Pei Jian, Wang Xiaoyuan, Shi Baile, and Fu Ada Wai-Chee. 2006. Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785790. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Zaharia Matei. 2016. An Architecture for Fast and General Data Processing on Large Clusters. Morgan and Claypool. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Zaharia Matei, Chowdhury Mosharaf, Das Tathagata, Dave Ankur, Ma Justin, McCauley Murphy, Franklin Michael J., Shenker Scott, and Stoica Ion. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Zaharia Matei, Chowdhury Mosharaf, Franklin Michael J., Shenker Scott, and Stoica Ion. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10–10 (2010), 95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Zaharia Matei, Xin Reynold S., Wendell Patrick, Das Tathagata, Armbrust Michael, Dave Ankur, Meng Xiangrui, Rosen Josh, Venkataraman Shivaram, Franklin Michael J., Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A unified engine for big data processing. Communications of the ACM 59, 11 (2016), 5665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Zhang Kaihui, Tanimura Yusuke, Nakada Hidemoto, and Ogawa Hirotaka. 2017. Understanding and improving disk-based intermediate data caching in Spark. In Proceedings of the 2017 IEEE International Conference on Big Data. IEEE, 25082517.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zhang Xuyun, Dou Wanchun, Pei Jian, Nepal Surya, Yang Chi, Liu Chang, and Chen Jinjun. 2015. Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. IEEE Transactions on Computers 64, 8 (2015), 22932307.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Zhang Xuyun, Qi Lianyong, He Qiang, and Dou Wanchun. 2016. Scalable iterative implementation of Mondrian for big data multidimensional anonymisation. In Proceedings of the 9th International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage. Springer, 311320.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Zhang Xuyun, Yang Chi, Nepal Surya, Liu Chang, Dou Wanchun, and Chen Jinjun. 2013. A MapReduce based approach of scalable multidimensional anonymization for big data privacy preservation on cloud. In Proceedings of the 2013 International Conference on Cloud and Green Computing. IEEE, 105112.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Privacy and Security
      ACM Transactions on Privacy and Security  Volume 25, Issue 1
      February 2022
      219 pages
      ISSN:2471-2566
      EISSN:2471-2574
      DOI:10.1145/3485162
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 November 2021
      • Accepted: 1 September 2021
      • Revised: 1 August 2021
      • Received: 1 October 2020
      Published in tops Volume 25, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!