Abstract
Due to the Internet of Things evolution, the clinical data is exponentially growing and using smart technologies. The generated big biomedical data is confidential, as it contains a patient’s personal information and findings. Usually, big biomedical data is stored over the cloud, making it convenient to be accessed and shared. In this view, the data shared for research purposes helps to reveal useful and unexposed aspects. Unfortunately, sharing of such sensitive data also leads to certain privacy threats. Generally, the clinical data is available in textual format (e.g., perception reports). Under the domain of natural language processing, many research studies have been published to mitigate the privacy breaches in textual clinical data. However, there are still limitations and shortcomings in the current studies that are inevitable to be addressed. In this article, a novel framework for textual medical data privacy has been proposed as Deep-Confidentiality. The proposed framework improves Medical Entity Recognition (MER) using deep neural networks and sanitization compared to the current state-of-the-art techniques. Moreover, the new and generic utility metric is also proposed, which overcomes the shortcomings of the existing utility metric. It provides the true representation of sanitized documents as compared to the original documents. To check our proposed framework’s effectiveness, it is evaluated on the i2b2-2010 NLP challenge dataset, which is considered one of the complex medical data for MER. The proposed framework improves the MER with 7.8% recall, 7% precision, and 3.8% F1-score compared to the existing deep learning models. It also improved the data utility of sanitized documents up to 13.79%, where the value of the k is 3.
- [1] . 2016. A hybrid solution for extracting structured medical information from unstructured data in medical records via a double-reading/entry system. BMC Medical Informatics and Decision Making 16, 1 (2016), 114.Google Scholar
Cross Ref
- [2] . 2007. The NCBI dbGaP database of genotypes and phenotypes. Nature Genetics 39, 10 (2007), 1181–1186.Google Scholar
Cross Ref
- [3] . 2005. UK Biobank: From concept to reality. Pharmacogenomics 6, 6 (2005), 639–646.Google Scholar
- [4] . 2019. The security of big data in fog-enabled IoT applications including blockchain: A survey. Sensors 19, 8 (2019), 1788.Google Scholar
Cross Ref
- [5] . 1996. Health insurance portability and accountability act of 1996. Public Law 104 (1996), 191.Google Scholar
- [6] . 2018. Data Protection: A Practical Guide to UK and EU Law. Oxford University Press. Google Scholar
Digital Library
- [7] . 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002), 557–570. Google Scholar
Digital Library
- [8] . 2007. L-Diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 3–es. Google Scholar
Digital Library
- [9] . 2007. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering.
IEEE ,Los Alamitos, CA , 106–115.Google ScholarCross Ref
- [10] . 2014. The algorithmic foundations of differential privacy.Foundations and Trends in Theoretical Computer Science 9, 3–4 (2014), 211–407. Google Scholar
Digital Library
- [11] . 2017. Differential average diversity: An efficient privacy mechanism for electronic health records. Journal of Medical Imaging and Health Informatics 7, 6 (2017), 1177–1187.Google Scholar
Cross Ref
- [12] . 2020. N-Sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Computer Communications 161 (2020), 160–171.Google Scholar
Cross Ref
- [13] . 2017. Toward sensitive document release with privacy guarantees. Engineering Applications of Artificial Intelligence 59 (2017), 23–34. Google Scholar
Digital Library
- [14] . 2020. Leveraging synonymy and polysemy to improve semantic similarity assessments based on intrinsic information content. Artificial Intelligence Review 53 (2020), 2023–2041.Google Scholar
Cross Ref
- [15] . 2013. Automatic general-purpose sanitization of textual documents. IEEE Transactions on Information Forensics and Security 8, 6 (2013), 853–862. Google Scholar
Digital Library
- [16] . 2013. Minimizing the disclosure risk of semantic correlations in document sanitization. Information Sciences 249 (2013), 110–123.Google Scholar
Cross Ref
- [17] . 2014. Utility-preserving privacy protection of textual healthcare documents. Journal of Biomedical Informatics 52 (2014), 189–198. Google Scholar
Digital Library
- [18] . 2014. Privacy protection of textual medical documents. In Proceedings of the 2014 IEEE Network Operations and Management Symposium (NOMS’14).
IEEE ,Los Alamitos, CA , 1–6.Google ScholarCross Ref
- [19] . 2016. C-sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology 67, 1 (2016), 148–163. Google Scholar
Digital Library
- [20] . 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association 18, 5 (2011), 552–556.Google Scholar
Cross Ref
- [21] . 2010. Extracting medication information from clinical text. Journal of the American Medical Informatics Association 17, 5 (2010), 514–518.Google Scholar
Cross Ref
- [22] . 2013. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. Journal of the American Medical Informatics Association 20, 5 (2013), 806–813.Google Scholar
Cross Ref
- [23] . 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of Biomedical Informatics 58 (2015), S11–S19. Google Scholar
Digital Library
- [24] . 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics (* SEM’13), Volume 2: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval’13). 1–9.Google Scholar
- [25] . 2014. Overview of the ShARe/CLEF eHealth evaluation lab 2014. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. 172–191.Google Scholar
Cross Ref
- [26] . 2013. Overview of the ShARe/CLEF eHealth evaluation lab 2013. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. 212–231. Google Scholar
Digital Library
- [27] . 2014. SemEval-2014 task 7: Analysis of clinical text. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14).Google Scholar
Cross Ref
- [28] . 2015. SemEval-2015 task 6: Clinical TempEval. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). 806–814.Google Scholar
Cross Ref
- [29] . 2016. SemEval-2016 task 12: Clinical TempEval. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval’16). 1052–1062.Google Scholar
Cross Ref
- [30] . 2007. Biomedical named entity recognition: A poor knowledge HMM-based approach. In Proceedings of the International Conference on Application of Natural Language to Information Systems. 382–387. Google Scholar
Digital Library
- [31] . 2013. Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts. Journal of Biomedical Informatics 46, 6 (2013), 1088–1098. Google Scholar
Digital Library
- [32] . 2017. A biomedical named entity recognition using machine learning classifiers and rich feature set. IJCSNS 17, 1 (2017), 170.Google Scholar
- [33] . 2005. Exploitation of linguistic features using a CRF-based biomedical named entity recognizer. In Proceedings of BioLINK, Vol. 2005.Google Scholar
- [34] . 2017. Long short-term memory RNN for biomedical named entity recognition. BMC Bioinformatics 18, 1 (2017), 462.Google Scholar
Cross Ref
- [35] . 2018. GRAM-CNN: A deep learning approach with local context for named entity recognition in biomedical text. Bioinformatics 34, 9 (2018), 1547–1554.Google Scholar
Cross Ref
- [36] . 2017. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33, 14 (2017), i37–i48.Google Scholar
Cross Ref
- [37] . 2017. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of Biomedical Informatics 76 (2017), 102–109. Google Scholar
Digital Library
- [38] . 2018. Clinical concept extraction with contextual word embedding.
arxiv:1810.10566 Google Scholar - [39] . 2019. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association 26, 11 (
July 2019), 1297–1304.DOI: DOI: http://dx.doi.org/10.1093/jamia/ocz096Google ScholarCross Ref
- [40] . 2019. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (
Sept. 2019), 1234–1240.DOI: DOI: http://dx.doi.org/10.1093/bioinformatics/btz682Google ScholarCross Ref
- [41] . 2019. Differentially private model publishing for deep learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP’19).
IEEE ,Los Alamitos, CA , 332–349.Google ScholarCross Ref
- [42] . 2020. Privacy-preserving deep learning NLP models for cancer registries. IEEE Transactions on Emerging Topics in Computing 9, 3 (2020), 1219–1230.Google Scholar
Cross Ref
- [43] . 2020. Rethinking privacy preserving deep learning: How to evaluate and thwart privacy attacks.
arxiv:2006.11601 Google Scholar - [44] . 2020. Privacy in deep learning: A survey.
arxiv:2004.12254 Google Scholar - [45] . 2008. Efficient techniques for document sanitization. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08).
ACM ,New York, NY , 843–852.DOI: DOI: http://dx.doi.org/10.1145/1458082.1458194 Google ScholarCross Ref
- [46] . 2011. A machine learning based system for semi-automatically redacting documents. In Proceedings of the 23rd IAAI Conference.Google Scholar
- [47] . 2012. t-Plausibility: Generalizing words to desensitize text.Transactions on Data Privacy 5, 3 (2012), 505–534. Google Scholar
Digital Library
- [48] . 2018. Research of clinical named entity recognition based on bi-LSTM-CRF. Journal of Shanghai Jiaotong University (Science) 23, 3 (2018), 392–397.Google Scholar
Cross Ref
- [49] . 2017. Entity recognition from clinical texts via recurrent neural network. BMC Medical Informatics and Decision Making 17, 2 (2017), 67.Google Scholar
Cross Ref
- [50] 2015. Keras. Retrieved September 22, 2021 from https://github.com/fchollet/keras.Google Scholar
- [51] . 2006. Bootstrapping named entity recognition with automatically generated gazetteer lists. In Proceedings of the Student Research Workshop. Google Scholar
Digital Library
- [52] . 2020. Home Page. Retrieved May 4, 2020 from http://www.beaumont.ie/.Google Scholar
Index Terms
Deep-Confidentiality: An IoT-Enabled Privacy-Preserving Framework for Unstructured Big Biomedical Data
Recommendations
Privacy Preserving Unstructured Big Data Analytics
Big data analytics has created opportunities for researchers to process huge amount of data but created a big threat to privacy of individual. Data processed by big data analytics platforms may have personal information which need to be taken care of ...
Towards privacy preserving unstructured big data publishing
Various sources and sophisticated tools are used to gather and process the comparatively large volume of data or big data that sometimes leads to privacy disclosure (at broader or finer level) for the data owner. Privacy preserving data publishing ...
Privacy preserving big data publishing: a scalable k‐anonymization approach using MapReduce
Big data is collected and processed using different sources and tools that lead to privacy issues. Privacy preserving data publishing techniques such as k‐anonymity, l‐diversity, and t‐closeness are used to de‐identify the data; however, the chances of re‐...






Comments