Abstract
Many modern entity recognition systems, including the current state-of-the-art de-identification systems, are based on bidirectional long short-term memory (biLSTM) units augmented by a conditional random field (CRF) sequence optimizer. These systems process the input sentence by sentence. This approach prevents the systems from capturing dependencies over sentence boundaries and makes accurate sentence boundary detection a prerequisite. Since sentence boundary detection can be problematic especially in clinical reports, where dependencies and co-references across sentence boundaries are abundant, these systems have clear limitations. In this study, we built a new system on the framework of one of the current state-of-the-art de-identification systems, NeuroNER, to overcome these limitations. This new system incorporates context embeddings through forward and backward \(n\)-grams without using sentence boundaries. Our context-enhanced de-identification (CEDI) system captures dependencies over sentence boundaries and bypasses the sentence boundary detection problem altogether. We enhanced this system with deep affix features and an attention mechanism to capture the pertinent parts of the input. The CEDI system outperforms NeuroNER on the 2006 i2b2 de-identification challenge dataset, the 2014 i2b2 shared task de-identification dataset, and the 2016 CEGS N-GRID de-identification dataset (\(p < 0.01\)). All datasets comprise narrative clinical reports in English but contain different note types varying from discharge summaries to psychiatric notes. Enhancing CEDI with deep affix features and the attention mechanism further increased performance.
- [1] . 2018. Contextual string embeddings for sequence labeling. Proc. 27th Int. Conf. Comput. Linguist. (2018), 1638–1649. Retrieved from https://github.com/zalandoresearch/flair.Google Scholar
- [2] . 2015. Neural machine translation by jointly learning to align and translate. ICLR (2015), 1–15.
DOI: https://doi.org/10.1146/annurev.neuro.26.041002.131047Google Scholar - [3] . 2013. 2013 SemEval-2013 Task 9: Extraction of drug-drug interactions from biomedical texts. Assoc. Compu- tational Linguist 2, (2013), 341–350.Google Scholar
- [4] . 1994. Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project.
DOI: https://doi.org/10.1007/978-1-4614-3858-8_100840Google Scholar - [5] . 2016. Neural summarization by extracting sentences and words. 54th Annu. Meet. Assoc. Comput. Linguist. ACL 2016 - Long Pap. 1, (2016), 484–494.
DOI: https://doi.org/10.18653/v1/p16-1046Google Scholar - [6] . 1990. RL4 : A tool for knowledge-based induction. In Proceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence. 24–30.Google Scholar
Cross Ref
- [7] . 2017. De-identification of patient notes with recurrent neural networks. J. Am. Med. Informatics Assoc. 24, 3 (2017), 596–606.
DOI: https://doi.org/10.1093/jamia/ocw156Google Scholar - [8] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Mlm (2018). Retrieved from http://arxiv.org/abs/1810.04805.Google Scholar
- [9] . 2013. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J. Am. Med. Informatics Assoc. 20, 1 (2013), 77–83.
DOI: https://doi.org/10.1136/amiajnl-2012-001020Google ScholarCross Ref
- [10] . 2009. Joint parsing and named entity recognition. June (2009), 326–334. Google Scholar
Digital Library
- [11] . 2008. A software tool for removing patient identifying information from clinical documents. J. Am. Med. Informatics Assoc. 15, 5 (2008), 601–610.
DOI: https://doi.org/10.1197/jamia.M2702Google Scholar - [12] . 2020. Marginal likelihood training of BILSTM-CRF for biomedical named entity recognition from disjoint label sets. Proc. 2018 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2018 (2020), 2824–2829.
DOI: https://doi.org/10.18653/v1/d18-1306Google Scholar - [13] . 2015. CRFs based de-identification of medical records. J. Biomed. Inform. 58, (2015), S39–S46.
DOI: https://doi.org/10.1016/j.jbi.2015.08.012 Google ScholarDigital Library
- [14] . 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
DOI: https://doi.org/10.1162/neco.1997.9.8.1735 Google ScholarDigital Library
- [15] . 2016. Structured prediction models for RNN based sequence labeling in clinical text. EMNLP 2016 - Conf. Empir. Methods Nat. Lang. Process. Proc. (2016), 856–865.
DOI: https://doi.org/10.18653/v1/d16-1082Google Scholar - [16] . 2016. Bidirectional recurrent neural networks for medical event detection in electronic health records. (2016). Retrieved from http://arxiv.org/abs/1606.07953.Google Scholar
- [17] . 2015. An easy-to-use clinical text de-identification tool for clinical scientists: NLM scrubber. In AMIA 2015 Annual Symposium, 1522.
DOI: https://doi.org/10.13140/RG.2.2.13587.37921Google Scholar - [18] . 2016. Neural architectures for named entity recognition. (2016).
DOI: https://doi.org/10.18653/v1/N16-1030Google Scholar - [19] . 2020. Green algorithms: Quantifying the carbon emissions of computation. arXiv 2100707, (2020), 1–10.
DOI: https://doi.org/10.1002/advs.202100707Google Scholar - [20] . 2019. An empirical test of GRUs and deep contextualized word representations on de-identification. Stud. Health Technol. Inform. 264, (2019), 218–222.
DOI: https://doi.org/10.3233/SHTI190215Google Scholar - [21] . 2017. Adverse drug event detection in tweets with semi-supervised convolutional neural networks. In Proceedings of the 26th International World Wide Web Conference, Perth, Australia. Google Scholar
Digital Library
- [22] . 2017. Structured learning for temporal relation extraction from clinical records. 15th Conf. Eur. Chapter Assoc. Comput. Linguist. EACL 2017 - Proc. Conf. 1, 1 (2017), 1150–1158.
DOI: https://doi.org/10.18653/v1/e17-1108Google Scholar - [23] . 2015. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. J. Biomed. Inform. 58, (2015), S47–S52.
DOI: https://doi.org/10.1016/j.jbi.2015.06.009 Google ScholarDigital Library
- [24] . 2017. De-identification of clinical notes via recurrent neural network and conditional random field. J. Biomed. Inform. 75, (2017), S34–S42.
DOI: https://doi.org/10.1016/j.jbi.2017.05.023 Google ScholarDigital Library
- [25] . 2018. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34, 8 (2018), 1381–1388.
DOI: https://doi.org/10.1093/bioinformatics/btx761Google ScholarCross Ref
- [26] . 2015. Effective approaches to attention-based neural machine translation. 1412--1421. Retrieved from http://arxiv.org/abs/1508.04025.Google Scholar
- [27] . 2010. Named entity recognition in questions: Towards a golden collection. Proc. 7th Int. Conf. Lang. Resour. Eval. Lr. 2010 (2010), 574–580.Google Scholar
- [28] . 1989. Computer-intensive Methods for Testing Hypotheses: An Introduction. John Wiley & Sons, Inc, New York.Google Scholar
- [29] . 2018. A deep reinforced model for abstractive summarization. 6th Int. Conf. Learn. Represent. ICLR 2018 - Conf. Track Proc. i (2018), 1–12.Google Scholar
- [30] . 2014. Glove: Global vectors for word representation. Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. (2014), 1532–1543.
DOI: https://doi.org/10.3115/v1/D14-1162Google Scholar - [31] . 2018. Deep contextualized word representations. (2018). Retrieved from http://arxiv.org/abs/1802.05365.Google Scholar
- [32] . 1986. Induction of decision trees. Mach. Learn. 1, 1 (1986), 81–106.
DOI: https://doi.org/10.1007/bf00116251 Google ScholarDigital Library
- [33] . 2016. Attending to characters in neural sequence labeling models. COLING 2016-26th Int. Conf. Comput. Linguist. Proc. COLING 2016 Tech. Pap. (2016), 309–318.Google Scholar
- [34] . 2015. A neural attention model for abstractive sentence summarization. (2015).
DOI: https://doi.org/10.18653/v1/D15-1044Google Scholar - [35] . 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
DOI: https://doi.org/10.1038/nature16961Google ScholarCross Ref
- [36] . 2017. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1. J. Biomed. Inform. 75, (2017), S4–S18.
DOI: https://doi.org/10.1016/j.jbi.2017.06.011 Google ScholarDigital Library
- [37] . 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1. J. Biomed. Inform. 58, (2015), S11–S19.
DOI: https://doi.org/10.1016/j.jbi.2015.06.007 Google ScholarDigital Library
- [38] . 2013. Overview of the ShARe/CLEF eHealth evaluation lab 2013. Inf. Access Eval. Multilinguality, Multimodality, Vis. (2013), 212–231.
DOI: https://doi.org/10.1007/978-3-642-40802-1_24 Google ScholarDigital Library
- [39] . 1996. Replacing personally-identifying information in medical records, the Scrub system. AMIA Annu Symp Proc (1996), 333–7. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2233179&tool=pmcentrez&rendertype=abstract.Google Scholar
- [40] . 2018. Extracting ADRs from drug labels using Bi-LSTM and CRFs. AMIA 2018 Annu. Symp. (2018).Google Scholar
- [41] . 2015. Greenhouse gas emission of diets in the Netherlands and associations with food, energy and macronutrient intakes. Public Health Nutr. 18, 13 (2015), 2433–2445.
DOI: https://doi.org/10.1017/S1368980014002821Google Scholar - [42] . 2002. A successful technique for removing names in pathology reports using an augmented search and replace method. AMIA Annu Symp (2002), 777–81.
DOI: https://doi.org/D020002380[pii]Google Scholar - [43] . 2005. Reducing question answering input data using named entity recognition. In Proceedings 8th International Conference on Text, Speech & Dialogue. Google Scholar
Digital Library
- [44] . 2007. Evaluating the state of the art in automatic de-identification. J. Am. Med. Informatics Assoc. 14, 5 (2007), 550–563.
DOI: https://doi.org/10.1197/jamia.M2444.IntroductionGoogle ScholarCross Ref
- [45] . 2010. Extracting medication information from clinical text. J. Am. Med. Informatics Assoc. 17, 5 (2010), 514–518.
DOI: https://doi.org/10.1136/jamia.2010.003947Google ScholarCross Ref
- [46] . 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Informatics Assoc. 18, 5 (2011), 552–556.
DOI: https://doi.org/10.1136/amiajnl-2011-000203Google ScholarCross Ref
- [47] . 2017. Attention is all you need. 31st Conf. Neural Inf. Process. Syst. NIPS (June 2017), 1082–1086.
DOI: https://doi.org/10.1145/2964284.2984064 Google ScholarDigital Library
- [48] . 2018. A comparison of word embeddings for the biomedical natural language processing. J. Biomed. Inform. 87, July (2018), 12–20.
DOI: https://doi.org/10.1016/j.jbi.2018.09.008Google ScholarCross Ref
- [49] . 2015. Overview of the BioCreative V chemical disease relation (CDR) task. Proc. Fifth BioCreative Chall. Eval. Work. (2015), 154–166.Google Scholar
- [50] . 2018. Improving clinical named entity recognition with global neural attention. In APWeb-WAIM. 264–279.
DOI: https://doi.org/10.1007/978-3-319-96893-3_20Google Scholar - [51] . 2017. A bidirectional LSTM and conditional random fields approach to medical named entity recognition. Proc. Int. Conf. Adv. Intell. Syst. Informatics (2017).
DOI: https://doi.org/10.1007/978-3-319-64861-3Google Scholar - [52] . 2015. Show, attend and tell: Neural image caption generation with visual attention. (2015). Retrieved from http://arxiv.org/abs/1502.03044.Google Scholar
- [53] . 2018. Deep affix features improve neural named entity recognizers. Proc. Seventh Jt. Conf. Lex. Comput. Semant. (2018), 167–172.
DOI: https://doi.org/10.18653/v1/S18-2021Google Scholar - [54] . 2015. Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 58, (2015), S30–S38.
DOI: https://doi.org/10.1016/j.jbi.2015.06.015 Google ScholarDigital Library
- [55] . 2019. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19, Suppl 5 (2019), 1–9.
DOI: https://doi.org/10.1186/s12911-019-0935-4Google Scholar - [56] . 2017. Neural named entity recognition using a self-attention mechanism. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence
(ICTAI) , IEEE, 652–656.DOI: https://doi.org/10.1109/ICTAI.2017.00104Google Scholar
Index Terms
A Context-Enhanced De-identification System
Recommendations
Annotating longitudinal clinical narratives for de-identification
Display Omitted De-identification shared task for longitudinal clinical records.Protected Health Information in records replaced with realistic surrogates.First corpus of its kind available for distribution.Used for Track 1 of the 2014 i2b2/UTHealth NLP ...
Automatic de-identification of electronic medical records using token-level and character-level conditional random fields
Display Omitted We proposed a hybrid system to automatically de-identify electronic medical records.PHIs are identified by token-level and character-level conditional random fields.The character-level CRFs is used to avoid boundary errors caused by ...
Bottom-up context-sensitive algorithms for Bengali parser in natural language processing
This paper embodies the design of parsing algorithms tangibly for a Bengali parser. To design parsing algorithms a detailed study on linguistics and grammar has been performed. A detailed study also has been made on the various techniques and algorithms ...






Comments