skip to main content
research-article

A Context-Enhanced De-identification System

Published:15 October 2021Publication History
Skip Abstract Section

Abstract

Many modern entity recognition systems, including the current state-of-the-art de-identification systems, are based on bidirectional long short-term memory (biLSTM) units augmented by a conditional random field (CRF) sequence optimizer. These systems process the input sentence by sentence. This approach prevents the systems from capturing dependencies over sentence boundaries and makes accurate sentence boundary detection a prerequisite. Since sentence boundary detection can be problematic especially in clinical reports, where dependencies and co-references across sentence boundaries are abundant, these systems have clear limitations. In this study, we built a new system on the framework of one of the current state-of-the-art de-identification systems, NeuroNER, to overcome these limitations. This new system incorporates context embeddings through forward and backward \(n\)-grams without using sentence boundaries. Our context-enhanced de-identification (CEDI) system captures dependencies over sentence boundaries and bypasses the sentence boundary detection problem altogether. We enhanced this system with deep affix features and an attention mechanism to capture the pertinent parts of the input. The CEDI system outperforms NeuroNER on the 2006 i2b2 de-identification challenge dataset, the 2014 i2b2 shared task de-identification dataset, and the 2016 CEGS N-GRID de-identification dataset (\(p < 0.01\)). All datasets comprise narrative clinical reports in English but contain different note types varying from discharge summaries to psychiatric notes. Enhancing CEDI with deep affix features and the attention mechanism further increased performance.

REFERENCES

  1. [1] Akbik Alan, Blythe Duncan, and Vollgraf Roland. 2018. Contextual string embeddings for sequence labeling. Proc. 27th Int. Conf. Comput. Linguist. (2018), 16381649. Retrieved from https://github.com/zalandoresearch/flair.Google ScholarGoogle Scholar
  2. [2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. ICLR (2015), 115. DOI: https://doi.org/10.1146/annurev.neuro.26.041002.131047Google ScholarGoogle Scholar
  3. [3] Segura Bedmar Isabel, Martinez Paloma, and Zazo Maria Herrero. 2013. 2013 SemEval-2013 Task 9: Extraction of drug-drug interactions from biomedical texts. Assoc. Compu- tational Linguist 2, (2013), 341350.Google ScholarGoogle Scholar
  4. [4] Buchanan Bruce G. and Shortliffe Edward H.. 1994. Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. DOI: https://doi.org/10.1007/978-1-4614-3858-8_100840Google ScholarGoogle Scholar
  5. [5] Cheng Jianpeng and Lapata Mirella. 2016. Neural summarization by extracting sentences and words. 54th Annu. Meet. Assoc. Comput. Linguist. ACL 2016 - Long Pap. 1, (2016), 484494. DOI: https://doi.org/10.18653/v1/p16-1046Google ScholarGoogle Scholar
  6. [6] Clearwater S. H. and Provost F. J.. 1990. RL4 : A tool for knowledge-based induction. In Proceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence. 2430.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Dernoncourt Franck, Lee Ji Young, Uzuner Ozlem, and Szolovits Peter. 2017. De-identification of patient notes with recurrent neural networks. J. Am. Med. Informatics Assoc. 24, 3 (2017), 596606. DOI: https://doi.org/10.1093/jamia/ocw156Google ScholarGoogle Scholar
  8. [8] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Mlm (2018). Retrieved from http://arxiv.org/abs/1810.04805.Google ScholarGoogle Scholar
  9. [9] Ferrández Oscar, South Brett R., Shen Shuying, Jeffrey Friedlin F., Samore Matthew H., and Meystre Stéphane M.. 2013. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J. Am. Med. Informatics Assoc. 20, 1 (2013), 7783. DOI: https://doi.org/10.1136/amiajnl-2012-001020Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Finkel Jenny Rose and Manning Christopher D.. 2009. Joint parsing and named entity recognition. June (2009), 326334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Jeff Friedlin F. and McDonald Clement J.. 2008. A software tool for removing patient identifying information from clinical documents. J. Am. Med. Informatics Assoc. 15, 5 (2008), 601610. DOI: https://doi.org/10.1197/jamia.M2702Google ScholarGoogle Scholar
  12. [12] Greenberg Nathan, Bansal Trapit, Verga Patrick, and McCallum Andrew. 2020. Marginal likelihood training of BILSTM-CRF for biomedical named entity recognition from disjoint label sets. Proc. 2018 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2018 (2020), 28242829. DOI: https://doi.org/10.18653/v1/d18-1306Google ScholarGoogle Scholar
  13. [13] He Bin, Guan Yi, Cheng Jianyi, Cen Keting, and Hua Wenlan. 2015. CRFs based de-identification of medical records. J. Biomed. Inform. 58, (2015), S39S46. DOI: https://doi.org/10.1016/j.jbi.2015.08.012 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 17351780. DOI: https://doi.org/10.1162/neco.1997.9.8.1735 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Jagannatha Abhyuday N. and Yu Hong. 2016. Structured prediction models for RNN based sequence labeling in clinical text. EMNLP 2016 - Conf. Empir. Methods Nat. Lang. Process. Proc. (2016), 856865. DOI: https://doi.org/10.18653/v1/d16-1082Google ScholarGoogle Scholar
  16. [16] Jagannatha Abhyuday and Yu Hong. 2016. Bidirectional recurrent neural networks for medical event detection in electronic health records. (2016). Retrieved from http://arxiv.org/abs/1606.07953.Google ScholarGoogle Scholar
  17. [17] Kayaalp Mehmet, Browne Allen C., Dodd Zeyno A., Sagan Pamela, and McDonald Clement J.. 2015. An easy-to-use clinical text de-identification tool for clinical scientists: NLM scrubber. In AMIA 2015 Annual Symposium, 1522. DOI: https://doi.org/10.13140/RG.2.2.13587.37921Google ScholarGoogle Scholar
  18. [18] Lample Guillaume, Ballesteros Miguel, Subramanian Sandeep, Kawakami Kazuya, and Dyer Chris. 2016. Neural architectures for named entity recognition. (2016). DOI: https://doi.org/10.18653/v1/N16-1030Google ScholarGoogle Scholar
  19. [19] Lannelongue Loïc, Grealey Jason, and Inouye Michael. 2020. Green algorithms: Quantifying the carbon emissions of computation. arXiv 2100707, (2020), 110. DOI: https://doi.org/10.1002/advs.202100707Google ScholarGoogle Scholar
  20. [20] Lee Kahyun, Filannino Michele, and Uzuner Özlem. 2019. An empirical test of GRUs and deep contextualized word representations on de-identification. Stud. Health Technol. Inform. 264, (2019), 218222. DOI: https://doi.org/10.3233/SHTI190215Google ScholarGoogle Scholar
  21. [21] Lee Kathy, Qadir Ashequl, Hasan Sadid A., Datla Vivek, Prakash Aaditya, Liu Joey, and Farri Oladimeji. 2017. Adverse drug event detection in tweets with semi-supervised convolutional neural networks. In Proceedings of the 26th International World Wide Web Conference, Perth, Australia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Leeuwenberg Artuur and Moens Marie Francine. 2017. Structured learning for temporal relation extraction from clinical records. 15th Conf. Eur. Chapter Assoc. Comput. Linguist. EACL 2017 - Proc. Conf. 1, 1 (2017), 11501158. DOI: https://doi.org/10.18653/v1/e17-1108Google ScholarGoogle Scholar
  23. [23] Liu Zengjian, Chen Yangxin, Tang Buzhou, Wang Xiaolong, Chen Qingcai, Li Haodi, Wang Jingfeng, Deng Qiwen, and Zhu Suisong. 2015. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. J. Biomed. Inform. 58, (2015), S47S52. DOI: https://doi.org/10.1016/j.jbi.2015.06.009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Liu Zengjian, Tang Buzhou, Wang Xiaolong, and Chen Qingcai. 2017. De-identification of clinical notes via recurrent neural network and conditional random field. J. Biomed. Inform. 75, (2017), S34S42. DOI: https://doi.org/10.1016/j.jbi.2017.05.023 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Luo Ling, Yang Zhihao, Yang Pei, Zhang Yin, Wang Lei, Lin Hongfei, and Wang Jian. 2018. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34, 8 (2018), 13811388. DOI: https://doi.org/10.1093/bioinformatics/btx761Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Luong Minh-Thang, Pham Hieu, and D. Manning Christopher. 2015. Effective approaches to attention-based neural machine translation. 1412--1421. Retrieved from http://arxiv.org/abs/1508.04025.Google ScholarGoogle Scholar
  27. [27] Mendes Ana Cristina, Coheur Luísa, and Vaz Lobo Paula. 2010. Named entity recognition in questions: Towards a golden collection. Proc. 7th Int. Conf. Lang. Resour. Eval. Lr. 2010 (2010), 574580.Google ScholarGoogle Scholar
  28. [28] Noreen Eric W.. 1989. Computer-intensive Methods for Testing Hypotheses: An Introduction. John Wiley & Sons, Inc, New York.Google ScholarGoogle Scholar
  29. [29] Paulus Romain, Xiong Caiming, and Socher Richard. 2018. A deep reinforced model for abstractive summarization. 6th Int. Conf. Learn. Represent. ICLR 2018 - Conf. Track Proc. i (2018), 112.Google ScholarGoogle Scholar
  30. [30] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. Glove: Global vectors for word representation. Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. (2014), 15321543. DOI: https://doi.org/10.3115/v1/D14-1162Google ScholarGoogle Scholar
  31. [31] Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. (2018). Retrieved from http://arxiv.org/abs/1802.05365.Google ScholarGoogle Scholar
  32. [32] Quinlan J. R.. 1986. Induction of decision trees. Mach. Learn. 1, 1 (1986), 81106. DOI: https://doi.org/10.1007/bf00116251 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Rei Marek, Crichton Gamal K. O., and Pyysalo Sampo. 2016. Attending to characters in neural sequence labeling models. COLING 2016-26th Int. Conf. Comput. Linguist. Proc. COLING 2016 Tech. Pap. (2016), 309318.Google ScholarGoogle Scholar
  34. [34] Rush Alexander M., Chopra Sumit, and Weston Jason. 2015. A neural attention model for abstractive sentence summarization. (2015). DOI: https://doi.org/10.18653/v1/D15-1044Google ScholarGoogle Scholar
  35. [35] Silver David, Huang Aja, Maddison Chris J., Guez Arthur, Sifre Laurent, Van Den Driessche George, Schrittwieser Julian, Antonoglou Ioannis, Panneershelvam Veda, Lanctot Marc, Dieleman Sander, Grewe Dominik, Nham John, Nal Kalchbrenner , Sutskever Ilya, Lillicrap Timothy, Leach Madeleine, Kavukcuoglu Koray, Graepel Thore, and Hassabis Demis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484489. DOI: https://doi.org/10.1038/nature16961Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Stubbs Amber, Filannino Michele, and Uzuner Özlem. 2017. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1. J. Biomed. Inform. 75, (2017), S4S18. DOI: https://doi.org/10.1016/j.jbi.2017.06.011 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Stubbs Amber, Kotfila Christopher, and Uzuner Özlem. 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1. J. Biomed. Inform. 58, (2015), S11S19. DOI: https://doi.org/10.1016/j.jbi.2015.06.007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Suominen Hanna, Salanterä Sanna, Velupillai Sumithra, Chapman Wendy W., Savova Guergana, Elhadad Noemie, Pradhan Sameer, South Brett R., Mowery Danielle L., Jones Gareth J. F., Leveling Johannes, Kelly Liadh, Goeuriot Lorraine, Martinez David, and Zuccon Guido. 2013. Overview of the ShARe/CLEF eHealth evaluation lab 2013. Inf. Access Eval. Multilinguality, Multimodality, Vis. (2013), 212231. DOI: https://doi.org/10.1007/978-3-642-40802-1_24 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Sweeney L.. 1996. Replacing personally-identifying information in medical records, the Scrub system. AMIA Annu Symp Proc (1996), 333–7. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2233179&tool=pmcentrez&rendertype=abstract.Google ScholarGoogle Scholar
  40. [40] Tao Carson, Michele Filannino, and Uzuner Özlem. 2018. Extracting ADRs from drug labels using Bi-LSTM and CRFs. AMIA 2018 Annu. Symp. (2018).Google ScholarGoogle Scholar
  41. [41] Temme Elisabeth H. M., Toxopeus Ido B., Kramer Gerard F. H., Brosens Marinka C. C., Drijvers José M. M., Tyszler Marcelo, and Ocké Marga C.. 2015. Greenhouse gas emission of diets in the Netherlands and associations with food, energy and macronutrient intakes. Public Health Nutr. 18, 13 (2015), 24332445. DOI: https://doi.org/10.1017/S1368980014002821Google ScholarGoogle Scholar
  42. [42] Thomas Sean M., Mamlin Burke, Schadow Gunther, and McDonald Clement. 2002. A successful technique for removing names in pathology reports using an augmented search and replace method. AMIA Annu Symp (2002), 777–81. DOI: https://doi.org/D020002380[pii]Google ScholarGoogle Scholar
  43. [43] Toral Antonio, Llopis Fernando, Munoz Rafael, and Noguera Elisa. 2005. Reducing question answering input data using named entity recognition. In Proceedings 8th International Conference on Text, Speech & Dialogue. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Uzuner Ozlem, Luo Yuan, and Szolovits Peter. 2007. Evaluating the state of the art in automatic de-identification. J. Am. Med. Informatics Assoc. 14, 5 (2007), 550563. DOI: https://doi.org/10.1197/jamia.M2444.IntroductionGoogle ScholarGoogle ScholarCross RefCross Ref
  45. [45] Uzuner Özlem, Solti Imre, and Cadag Eithon. 2010. Extracting medication information from clinical text. J. Am. Med. Informatics Assoc. 17, 5 (2010), 514518. DOI: https://doi.org/10.1136/jamia.2010.003947Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Uzuner Özlem, South Brett R., Shen Shuying, and DuVall Scott L.. 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Informatics Assoc. 18, 5 (2011), 552556. DOI: https://doi.org/10.1136/amiajnl-2011-000203Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. 31st Conf. Neural Inf. Process. Syst. NIPS (June 2017), 10821086. DOI: https://doi.org/10.1145/2964284.2984064 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Wang Yanshan, Liu Sijia, Afzal Naveed, Rastegar-Mojarad Majid, Wang Liwei, Shen Feichen, Kingsbury Paul, and Liu Hongfang. 2018. A comparison of word embeddings for the biomedical natural language processing. J. Biomed. Inform. 87, July (2018), 1220. DOI: https://doi.org/10.1016/j.jbi.2018.09.008Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wei Chih-hsuan, Peng Yifan, Leaman Robert, Davis Allan Peter, Mattingly Carolyn J., Li Jiao, Wiegers Thomas C., and Lu Zhiyong. 2015. Overview of the BioCreative V chemical disease relation (CDR) task. Proc. Fifth BioCreative Chall. Eval. Work. (2015), 154166.Google ScholarGoogle Scholar
  50. [50] Xu Guohai, Wang Chengyu, and He Xiaofeng. 2018. Improving clinical named entity recognition with global neural attention. In APWeb-WAIM. 264279. DOI: https://doi.org/10.1007/978-3-319-96893-3_20Google ScholarGoogle Scholar
  51. [51] Xu Kai, Zhou Zhanfan, Hao Tianyong, and Liu Wenyin. 2017. A bidirectional LSTM and conditional random fields approach to medical named entity recognition. Proc. Int. Conf. Adv. Intell. Syst. Informatics (2017). DOI: https://doi.org/10.1007/978-3-319-64861-3Google ScholarGoogle Scholar
  52. [52] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhutdinov Ruslan, Zemel Richard, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. (2015). Retrieved from http://arxiv.org/abs/1502.03044.Google ScholarGoogle Scholar
  53. [53] Yadav Vikas, Sharp Rebecca, and Bethard Steven. 2018. Deep affix features improve neural named entity recognizers. Proc. Seventh Jt. Conf. Lex. Comput. Semant. (2018), 167172. DOI: https://doi.org/10.18653/v1/S18-2021Google ScholarGoogle Scholar
  54. [54] Yang Hui and Garibaldi Jonathan M.. 2015. Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 58, (2015), S30S38. DOI: https://doi.org/10.1016/j.jbi.2015.06.015 Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Yang Xi, Lyu Tianchen, Li Qian, Lee Chih Yin, Bian Jiang, Hogan William R., and Wu Yonghui. 2019. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med. Inform. Decis. Mak. 19, Suppl 5 (2019), 19. DOI: https://doi.org/10.1186/s12911-019-0935-4Google ScholarGoogle Scholar
  56. [56] Zukov-Gregoric Andrej, Bachrach Yoram, Minkovsky Pasha, Coope Sam, and Maksak Bogdan. 2017. Neural named entity recognition using a self-attention mechanism. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 652656. DOI: https://doi.org/10.1109/ICTAI.2017.00104Google ScholarGoogle Scholar

Index Terms

  1. A Context-Enhanced De-identification System

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Article Metrics

      • Downloads (Last 12 months)63
      • Downloads (Last 6 weeks)2

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!