skip to main content
research-article

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Published:15 October 2021Publication History
Skip Abstract Section

Abstract

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.

REFERENCES

  1. [1] Alsentzer Emily, Murphy John, Boag William, Weng Wei-Hung, Jindi Di, Naumann Tristan, and McDermott Matthew. 2019. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. 7278. https://doi.org/10.18653/v1/W19-1909Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Apidianaki Marianna, Mohammad Saif M., May Jonathan, Shutova Ekaterina, Bethard Steven, and Carpuat Marine (Eds.). 2018. Proceedings of the 12th International Workshop on Semantic Evaluation, Sem[email protected] 2018, New Orleans, Louisiana, USA, June 5-6, 2018. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/S18-1/.Google ScholarGoogle Scholar
  3. [3] Arighi Cecilia N., Roberts Phoebe M., Agarwal Shashank, Bhattacharya Sanmitra, Cesareni Gianni, Chatr-aryamontri Andrew, Clematide Simon, et al. 2011. BioCreative III interactive task: An overview. BMC Bioinformatics 12, 8 (Oct. 2011), S4. https://doi.org/10.1186/1471-2105-12-S8-S4Google ScholarGoogle Scholar
  4. [4] Axelrod Amittai, He Xiaodong, and Gao Jianfeng. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 355362. https://www.aclweb.org/anthology/D11-1033. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Baker Simon, Ali Imran, Silins Ilona, Pyysalo Sampo, Guo Yufan, Högberg Johan, Stenius Ulla, and Korhonen Anna. 2017. Cancer hallmarks analytics tool (CHAT): A text mining approach to organize and evaluate scientific literature on cancer. Bioinformatics 33, 24 (2017), 39733981.Google ScholarGoogle Scholar
  6. [6] Baker Simon, Silins Ilona, Guo Yufan, Ali Imran, Högberg Johan, Stenius Ulla, and Korhonen Anna. 2015. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32, 3 (2015), 432440.Google ScholarGoogle Scholar
  7. [7] Soares Livio Baldini, FitzGerald Nicholas, Ling Jeffrey, and Kwiatkowski Tom. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 28952905. https://doi.org/10.18653/v1/P19-1279Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Beltagy Iz, Lo Kyle, and Cohan Arman. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 36153620. https://doi.org/10.18653/v1/D19-1371Google ScholarGoogle Scholar
  9. [9] Bethard Steven, Carpuat Marine, Apidianaki Marianna, Mohammad Saif M., Cer Daniel M., and Jurgens David (Eds.). 2017. Proceedings of the 11th International Workshop on Semantic Evaluation, [email protected] 2017, Vancouver, Canada, August 3-4, 2017. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/S17-2/.Google ScholarGoogle Scholar
  10. [10] Bethard Steven, Cer Daniel M., Carpuat Marine, Jurgens David, Nakov Preslav, and Zesch Torsten (Eds.). 2016. Proceedings of the 10th International Workshop on Semantic Evaluation, [email protected] 2016, San Diego, CA, USA, June 16-17, 2016. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/S16-1/.Google ScholarGoogle Scholar
  11. [11] Bravo Àlex, Piñero Janet, Queralt-Rosinach Núria, Rautschka Michael, and Furlong Laura I.. 2015. Extraction of relations between genes and diseases from text and large-scale data analysis: Implications for translational research. BMC Bioinformatics 16, 1 (2015), 55.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Brown Peter F., Pietra Vincent J. Della, Desouza Peter V., Lai Jennifer C., and Mercer Robert L.. 1992. Class-based n-gram models of natural language. Computational Linguistics 18, 4 (1992), 467480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Caruana Rich. 1997. Multitask learning. Machine Learning 28, 1 (1997), 4175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Crichton Gamal, Pyysalo Sampo, Chiu Billy, and Korhonen Anna. 2017. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinformatics 18, 1 (2017), 368.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Demner-Fushman Dina, Cohen Kevin Bretonnel, Ananiadou Sophia, and Tsujii Junichi (Eds.). 2019. Proceedings of the 18th BioNLP Workshop and Shared Task, [email protected] 2019, Florence, Italy, August 1, 2019. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/W19-50/.Google ScholarGoogle Scholar
  16. [16] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 41714186.Google ScholarGoogle Scholar
  17. [17] Diab Mona T., Baldwin Timothy, and Baroni Marco (Eds.). 2013. Proceedings of the 7th International Workshop on Semantic Evaluation, [email protected] 2013, Atlanta, Georgia, USA, June 14-15, 2013. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/S13-2/.Google ScholarGoogle Scholar
  18. [18] Doğan Rezarta Islamaj, Leaman Robert, and Lu Zhiyong. 2014. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics 47 (2014), 110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Du Jingcheng, Chen Qingyu, Peng Yifan, Xiang Yang, Tao Cui, and Lu Zhiyong. 2019. ML-Net: Multi-label classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association 26, 11 (2019), 12791285. https://doi.org/10.1093/jamia/ocz085Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hanahan Douglas and Weinberg Robert A.. 2000. The hallmarks of cancer. Cell 100, 1 (2000), 5770.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Herrero-Zazo María, Segura-Bedmar Isabel, Martínez Paloma, and Declerck Thierry. 2013. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of Biomedical Informatics 46, 5 (2013), 914920. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Howard Jeremy and Ruder Sebastian. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328339. https://doi.org/10.18653/v1/P18-1031Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Jia Robin, Wong Cliff, and Poon Hoifung. 2019. Document-level \(N\)-ary relation extraction with multiscale representation learning. In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19).Google ScholarGoogle Scholar
  25. [25] Jin Qiao, Dhingra Bhuwan, Liu Zhengping, Cohen William, and Lu Xinghua. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 25672577. https://doi.org/10.18653/v1/D19-1259Google ScholarGoogle Scholar
  26. [26] Johnson Alistair E. W., Pollard Tom J., Shen Lu, Lehman Li-Wei H., Feng Mengling, Ghassemi Mohammad, Moody Benjamin, Szolovits Peter, Celi Leo Anthony, and Mark Roger G.. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (May 2016), 160035. https://doi.org/10.1038/sdata.2016.35Google ScholarGoogle Scholar
  27. [27] Kim Jin-Dong, Ohta Tomoko, Tsuruoka Yoshimasa, Tateisi Yuka, and Collier Nigel. 2004. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA/BioNLP’04). 7378. https://www.aclweb.org/anthology/W04-1213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Kim Jin-Dong, Wang Yue, Takagi Toshihisa, and Yonezawa Akinori. 2011. Overview of Genia Event Task in BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop (BioNLP Shared Task’11). 7–15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Kim Sun, Dogan Rezarta Islamaj, Chatr-aryamontri Andrew, Tyers Mike, Wilbur W. John, and Comeau Donald C.. 2015. Overview of BioCreative V BioC track. In Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, Sevilla, Spain. 19.Google ScholarGoogle Scholar
  30. [30] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15). http://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  31. [31] Krallinger Martin, Rabal Obdulia, Akhondi Saber A., Pérez Martın Pérez, Santamaría Jesús, Rodríguez G. P., G. Tsatsaronis, et al. 2017. Overview of the BioCreative VI Chemical-Protein Interaction Track. In Proceedings of the 6th BioCreative Challenge Evaluation Workshop, Vol. 1. 141146.Google ScholarGoogle Scholar
  32. [32] Kudo Taku and Richardson John. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 6671. https://doi.org/10.18653/v1/D18-2012Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lafferty John, McCallum Andrew, and Pereira Fernando C. N.. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Lee Jinhyuk, Yoon Wonjin, Kim Sungdong, Kim Donghyeon, Kim Sunkyu, So Chan Ho, and Kang Jaewoo. 2019. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2019), 12341240. https://doi.org/10.1093/bioinformatics/btz682Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Li Jiao, Sun Yueping, Johnson Robin J., Sciaky Daniela, Wei Chih-Hsuan, Leaman Robert, Davis Allan Peter, Mattingly Carolyn J., Wiegers Thomas C., and Lu Zhiyong. 2016. BioCreative V CDR task corpus: A resource for chemical disease relation extraction. Database. Online, May 8, 2016.Google ScholarGoogle Scholar
  36. [36] Liang Percy. 2005. Semi-Supervised Learning for Natural Language. Ph.D. Dissertation. Massachusetts Institute of Technology, Cambridge, MA.Google ScholarGoogle Scholar
  37. [37] Liu Xiaodong, Cheng Hao, He Pengcheng, Chen Weizhu, Wang Yu, Poon Hoifung, and Gao Jianfeng. 2020. Adversarial training for large neural language models. arXiv:2004.08994.Google ScholarGoogle Scholar
  38. [38] Liu Xiaodong, Gao Jianfeng, He Xiaodong, Deng Li, Duh Kevin, and Wang Ye-Yi. 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 912921.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692.Google ScholarGoogle Scholar
  40. [40] Mao Yuqing, Auken Kimberly Van, Li Donghui, Arighi Cecilia N., McQuilton Peter, Hayman G. Thomas, Tweedie Susan, et al. 2014. Overview of the gene ontology task at BioCreative IV. Database. Online, August 25, 2014.Google ScholarGoogle Scholar
  41. [41] Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv:1301. 3781.Google ScholarGoogle Scholar
  42. [42] Nentidis Anastasios, Bougiatiotis Konstantinos, Krithara Anastasia, and Paliouras Georgios. 2019. Results of the seventh edition of the BioASQ challenge. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 553568.Google ScholarGoogle Scholar
  43. [43] Ney Hermann, Essen Ute, and Kneser Reinhard. 1994. On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language 8, 1 (1994), 138. https://doi.org/10.1006/csla.1994.1001Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Nye Benjamin, Li Junyi Jessy, Patel Roma, Yang Yinfei, Marshall Iain J., Nenkova Ani, and Wallace Byron C.. 2018. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In Proceedings of the Conference of the Association for Computational Linguistics, Vol. 2018. 197.Google ScholarGoogle Scholar
  45. [45] Peng Yifan, Yan Shankai, and Lu Zhiyong. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task. 5865. https://doi.org/10.18653/v1/W19-5006Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Peters Matthew, Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 22272237. https://doi.org/10.18653/v1/N18-1202Google ScholarGoogle Scholar
  48. [48] Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving language understanding by generative pre-training. https://www.cs.ubc.ca/amuham01/LING530/papers/radford2018improving.pdf.Google ScholarGoogle Scholar
  49. [49] Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  50. [50] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 167. http://jmlr.org/papers/v21/20-074.html.Google ScholarGoogle Scholar
  51. [51] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 17151725. https://doi.org/10.18653/v1/P16-1162Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Si Yuqi, Wang Jingqi, Xu Hua, and Roberts Kirk. 2019. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association 26, 11 (2019), 12971304.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Smith Larry, Tanabe Lorraine K., Ando Rie Johnson nee, Kuo Cheng-Ju, Chung I.-Fang, Hsu Chun-Nan, Lin Yu-Shi, et al. 2008. Overview of BioCreative II gene mention recognition. Genome Biology 9 (2008), S2.Google ScholarGoogle Scholar
  54. [54] Soğancıoğlu Gizem, Öztürk Hakime, and Özgür Arzucan. 2017. BIOSSES: A semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33, 14 (2017), i49–i58.Google ScholarGoogle Scholar
  55. [55] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Wang Alex, Pruksachatkun Yada, Nangia Nikita, Singh Amanpreet, Michael Julian, Hill Felix, Levy Omer, and Bowman Samuel. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems. 32663280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Wang Alex, Singh Amanpreet, Michael Julian, Hill Felix, Levy Omer, and Bowman Samuel R.. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2019 International Conference on Learning Representations (ICLR’19).Google ScholarGoogle Scholar
  58. [58] Wang Hai and Poon Hoifung. 2018. Deep probabilistic logic: A unifying framework for indirect supervision. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP’18).Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Xu Yichong, Liu Xiaodong, Shen Yelong, Liu Jingjing, and Gao Jianfeng. 2019. Multi-task learning with sample re-weighting for machine reading comprehension. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 26442655. https://doi.org/10.18653/v1/N19-1271Google ScholarGoogle Scholar
  60. [60] Zhang M. and Zhou Z.. 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 18191837. https://doi.org/10.1109/TKDE.2013.39Google ScholarGoogle Scholar
  61. [61] Zhang Yijia, Zheng Wei, Lin Hongfei, Wang Jian, Yang Zhihao, and Dumontier Michel. 2018. Drug–drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics 34, 5 (2018), 828835.Google ScholarGoogle Scholar
  62. [62] Zhu Yukun, Kiros Ryan, Zemel Rich, Salakhutdinov Ruslan, Urtasun Raquel, Torralba Antonio, and Fidler Sanja. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15).Google ScholarGoogle Scholar

Index Terms

  1. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Computing for Healthcare
        ACM Transactions on Computing for Healthcare  Volume 3, Issue 1
        January 2022
        255 pages
        ISSN:2691-1957
        EISSN:2637-8051
        DOI:10.1145/3485154
        Issue’s Table of Contents

        Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2021
        • Accepted: 1 March 2021
        • Revised: 1 January 2021
        • Received: 1 July 2020
        Published in health Volume 3, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!