Abstract
Multi-dimensional data anonymization approaches (e.g., Mondrian) ensure more fine-grained data privacy by providing a different anonymization strategy applied for each attribute. Many variations of multi-dimensional anonymization have been implemented on different distributed processing platforms (e.g., MapReduce, Spark) to take advantage of their scalability and parallelism supports. According to our critical analysis on overheads, either existing iteration-based or recursion-based approaches do not provide effective mechanisms for creating the optimal number of and relative size of resilient distributed datasets (RDDs), thus heavily suffer from performance overheads. To solve this issue, we propose a novel hybrid approach for effectively implementing a multi-dimensional data anonymization strategy (e.g., Mondrian) that is scalable and provides high-performance. Our hybrid approach provides a mechanism to create far fewer RDDs and smaller size partitions attached to each RDD than existing approaches. This optimal RDD creation and operations approach is critical for many multi-dimensional data anonymization applications that create tremendous execution complexity. The new mechanism in our proposed hybrid approach can dramatically reduce the critical overheads involved in re-computation cost, shuffle operations, message exchange, and cache management.
- [1] IPUMS International. (2007). Retrieved 25 Sept 2021 from https://international.ipums.org/international/.Google Scholar
- [2] . 2016. Sensitivity-based anonymization of big data. In Proceedings of the 2016 IEEE 41st Conference on Local Computer Networks Workshops. IEEE, 58–64.Google Scholar
Cross Ref
- [3] . 2019. Privacy preserving big data publication on cloud using mondrian anonymization techniques and deep neural networks. In Proceedings of the 2019 5th International Conference on Advanced Computing & Communication Systems. IEEE, 722–727.Google Scholar
Cross Ref
- [4] . 2018. Prima: An end-to-end framework for privacy at scale. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering. IEEE, 1531–1542.Google Scholar
Cross Ref
- [5] . 2021. DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark. Information Sciences 546 (2021), 1–24.Google Scholar
Cross Ref
- [6] . 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 1–52. Google Scholar
Digital Library
- [7] . 2007. UCI Machine Learning Repository. Retrieved 25 Sept 2021 from http://archive.ics.uci.edu/ml.Google Scholar
- [8] . 2019. SparkDA: RDD-based high-performance data anonymization technique for Spark platform. In Proceedings of the International Conference on Network and System Security. J. Liu and X. Huang (Eds.), Lecture Notes in Computer Science, Vol. 11928. Springer, 646–662.Google Scholar
Digital Library
- [9] . 2020. In-memory data anonymization using scalable and high performance RDD design. Electronics 9, 10 (2020), 1732.Google Scholar
Cross Ref
- [10] . 2021. Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark. Electronics 10, 5 (2021), 589.Google Scholar
Cross Ref
- [11] . 2014. What’s the gist? Privacy-preserving aggregation of user profiles. In Proceedings of the 19th European Symposium on Research in Computer Security. Springer, 128–145. Google Scholar
Digital Library
- [12] . 2017. Scalable, efficient anonymization with INCOGNITO-framework & algorithm. In Proceedings of the 2017 IEEE International Congress on Big Data. IEEE, 39–48.Google Scholar
Cross Ref
- [13] . 2014. A scalable k-anonymization solution for preserving privacy in an aging-in-place welfare intercloud. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering. IEEE, 424–431. Google Scholar
Digital Library
- [14] . 2018. An efficient big data anonymization algorithm based on chaos and perturbation techniques. Entropy 20, 5 (2018), 373.Google Scholar
Cross Ref
- [15] . 2005. Top-down specialization for information and privacy preservation. In Proceedings of the 21st International Conference on Data Engineering. IEEE, 205–216. Google Scholar
Digital Library
- [16] . 2017. DPHKMS: An efficient hybrid clustering preserving differential privacy in Spark. In Proceedings of the International Conference on Emerging Internetworking, Data & Web Technologies. Springer, 367–377.Google Scholar
- [17] . 2009. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems 34, 2 (2009), 1–47. Google Scholar
Digital Library
- [18] . 2019. RaSQL: Greater power and performance for big data analytics with recursive-aggregate-SQL on Spark. In Proceedings of the 2019 International Conference on Management of Data. 467–484. Google Scholar
Digital Library
- [19] . 2017. Execution of recursive queries in Apache Spark. In Proceedings of the European Conference on Parallel Processing. F. Rivera, T. Pena, and J. Cabaleiro (Eds.), Lecture Notes in Computer Science, Vol. 10417. Springer, 289–302.Google Scholar
Cross Ref
- [20] . 2006. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 217–228. Google Scholar
Digital Library
- [21] . 2019. Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Cluster Computing 22, 1 (2019), 2063–2087.Google Scholar
Digital Library
- [22] . 2006. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering. IEEE, 25–25. Google Scholar
Digital Library
- [23] . 2006. Workload-aware anonymization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 277–286. Google Scholar
Digital Library
- [24] . 2011. Information based data anonymization for classification utility. Data & Knowledge Engineering 70, 12 (2011), 1030–1045. Google Scholar
Digital Library
- [25] . 2008. Anonymizing streaming data for privacy protection. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE, 1367–1369. Google Scholar
Digital Library
- [26] . 2013. UCI Machine Learning Repository-Census+ Income Dataset. Retrieved 25 Sept 2021 from http://archive.ics.uci.edu/ml/datasets/Census-Income?(KDD).Google Scholar
- [27] . 2013. The big census data revolution: IPUMS-international. trans-border access to decades of census samples for three-fourths of the world and more. Revista de Demografia Historica 30, 1 (2013), 69.Google Scholar
- [28] . 2017. Privacy preserving big data publishing: A scalable k-anonymization approach using MapReduce. IET Software 11, 5 (2017), 271–276.Google Scholar
Cross Ref
- [29] . 2019. A distributed method based on Mondrian algorithm for big data anonymization. In Proceedings of the International Congress on High-Performance Computing and Big Data Analysis. L. Grandinetti, S. Mirtaheri, and R. Shahbazian (Eds.), Communications in Computer and Information Science, Vol. 891. Springer, 84–97.Google Scholar
Cross Ref
- [30] . 2016. Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Transactions on Parallel and Distributed Systems 28, 6 (2016), 1663–1674. Google Scholar
Digital Library
- [31] . 2019. Anonylitics: From a small data to a big data anonymization system for analytical projects. In Proceedings of the 21st International Conference on Enterprise Information Systems. 61–71.Google Scholar
Cross Ref
- [32] . 2007. Poker Hand Data Set. Retrieved 25 Sept 2021 from https://archive.ics.uci.edu/ml/datasets/Poker+Hand.Google Scholar
- [33] . 2016. DatalogRA: Datalog with recursive aggregation in the Spark RDD model. In Proceedings of the 4th International Workshop on Graph Data Management Experiences and Systems. 1–6. Google Scholar
Digital Library
- [34] . 2018. Some basics on privacy techniques, anonymization and their big data challenges. Mathematics in Computer Science 12, 3 (2018), 263–274.Google Scholar
Cross Ref
- [35] . 2013. A general framework for privacy preserving data publishing. Knowledge-Based Systems 54, C (2013), 276–287. Google Scholar
Digital Library
- [36] . 2015. Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2110–2121. Google Scholar
Digital Library
- [37] . 2016. Big data analytics with datalog queries on Spark. In Proceedings of the 2016 International Conference on Management of Data. 1135–1149. Google Scholar
Digital Library
- [38] . 2017. A top-down k-anonymization implementation for Apache Spark. In Proceedings of the 2017 IEEE International Conference on Big Data. IEEE, 4513–4521.Google Scholar
Cross Ref
- [39] . 2019. Who Becomes a Member of Congress? Evidence from De-Anonymized Census Data.
Technical Report . National Bureau of Economic Research.Google ScholarCross Ref
- [40] . 2016. Hierarchical PSO clustering on MapReduce for scalable privacy preservation in big data. In Proceedings of the 10th International Conference on Genetic and Evolutionary Computing. Springer, 36–44.Google Scholar
- [41] . 2004. Bottom-up generalization: A data mining solution to privacy protection. In Proceedings of the 4th IEEE International Conference on Data Mining. IEEE, 249–256. Google Scholar
Digital Library
- [42] . 2006. Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–790. Google Scholar
Digital Library
- [43] . 2016. An Architecture for Fast and General Data Processing on Large Clusters. Morgan and Claypool. Google Scholar
Digital Library
- [44] . 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2–2. Google Scholar
Digital Library
- [45] . 2010. Spark: Cluster computing with working sets. HotCloud 10, 10–10 (2010), 95. Google Scholar
Digital Library
- [46] . 2016. Apache Spark: A unified engine for big data processing. Communications of the ACM 59, 11 (2016), 56–65. Google Scholar
Digital Library
- [47] . 2017. Understanding and improving disk-based intermediate data caching in Spark. In Proceedings of the 2017 IEEE International Conference on Big Data. IEEE, 2508–2517.Google Scholar
Cross Ref
- [48] . 2015. Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. IEEE Transactions on Computers 64, 8 (2015), 2293–2307.Google Scholar
Digital Library
- [49] . 2016. Scalable iterative implementation of Mondrian for big data multidimensional anonymisation. In Proceedings of the 9th International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage. Springer, 311–320.Google Scholar
Cross Ref
- [50] . 2013. A MapReduce based approach of scalable multidimensional anonymization for big data privacy preservation on cloud. In Proceedings of the 2013 International Conference on Cloud and Green Computing. IEEE, 105–112.Google Scholar
Digital Library
Index Terms
A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark
Recommendations
SparkDA: RDD-Based High-Performance Data Anonymization Technique for Spark Platform
Network and System SecurityAbstractRecent proposals in data anonymization have mostly been focused around MapReduce, though the advantages of Spark have been well documented. To address this concern, we propose a new novel data anonymization technique for Apache Spark. SparkDA, our ...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...






Comments