Abstract
Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.
- [1] . 2019. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. 72–78. https://doi.org/10.18653/v1/W19-1909Google Scholar
Cross Ref
- [2] , , , , , and (Eds.). 2018. Proceedings of the 12th International Workshop on Semantic Evaluation, Sem[email protected] 2018, New Orleans, Louisiana, USA, June 5-6, 2018. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/S18-1/.Google Scholar
- [3] . 2011. BioCreative III interactive task: An overview. BMC Bioinformatics 12, 8 (Oct. 2011), S4. https://doi.org/10.1186/1471-2105-12-S8-S4Google Scholar
- [4] . 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 355–362. https://www.aclweb.org/anthology/D11-1033. Google Scholar
Digital Library
- [5] . 2017. Cancer hallmarks analytics tool (CHAT): A text mining approach to organize and evaluate scientific literature on cancer. Bioinformatics 33, 24 (2017), 3973–3981.Google Scholar
- [6] . 2015. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics 32, 3 (2015), 432–440.Google Scholar
- [7] . 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2895–2905. https://doi.org/10.18653/v1/P19-1279Google Scholar
Cross Ref
- [8] . 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP’19) . 3615–3620. https://doi.org/10.18653/v1/D19-1371Google Scholar - [9] , , , , , and (Eds.). 2017. Proceedings of the 11th International Workshop on Semantic Evaluation, [email protected] 2017, Vancouver, Canada, August 3-4, 2017. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/S17-2/.Google Scholar
- [10] , , , , , and (Eds.). 2016. Proceedings of the 10th International Workshop on Semantic Evaluation, [email protected] 2016, San Diego, CA, USA, June 16-17, 2016. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/S16-1/.Google Scholar
- [11] . 2015. Extraction of relations between genes and diseases from text and large-scale data analysis: Implications for translational research. BMC Bioinformatics 16, 1 (2015), 55.Google Scholar
Cross Ref
- [12] . 1992. Class-based n-gram models of natural language. Computational Linguistics 18, 4 (1992), 467–480. Google Scholar
Digital Library
- [13] . 1997. Multitask learning. Machine Learning 28, 1 (1997), 41–75. Google Scholar
Digital Library
- [14] . 2017. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinformatics 18, 1 (2017), 368.Google Scholar
Cross Ref
- [15] , , , and (Eds.). 2019. Proceedings of the 18th BioNLP Workshop and Shared Task, [email protected] 2019, Florence, Italy, August 1, 2019. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/W19-50/.Google Scholar
- [16] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 4171–4186.Google Scholar
- [17] , , and (Eds.). 2013. Proceedings of the 7th International Workshop on Semantic Evaluation, [email protected] 2013, Atlanta, Georgia, USA, June 14-15, 2013. Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/S13-2/.Google Scholar
- [18] . 2014. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics 47 (2014), 1–10. Google Scholar
Digital Library
- [19] . 2019. ML-Net: Multi-label classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association 26, 11 (2019), 1279–1285. https://doi.org/10.1093/jamia/ocz085Google Scholar
Cross Ref
- [20] . 2000. The hallmarks of cancer. Cell 100, 1 (2000), 57–70.Google Scholar
Cross Ref
- [21] . 2013. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of Biomedical Informatics 46, 5 (2013), 914–920. Google Scholar
Digital Library
- [22] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google Scholar
Digital Library
- [23] . 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328–339. https://doi.org/10.18653/v1/P18-1031Google Scholar
Cross Ref
- [24] . 2019. Document-level \(N\)-ary relation extraction with multiscale representation learning. In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
(NAACL-HLT’19) .Google Scholar - [25] . 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP’19) . 2567–2577. https://doi.org/10.18653/v1/D19-1259Google Scholar - [26] . 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (May 2016), 160035. https://doi.org/10.1038/sdata.2016.35Google Scholar
- [27] . 2004. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications
(NLPBA/BioNLP’04) . 73–78. https://www.aclweb.org/anthology/W04-1213. Google ScholarDigital Library
- [28] . 2011. Overview of Genia Event Task in BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop
(BioNLP Shared Task’11) . 7–15. Google ScholarDigital Library
- [29] . 2015. Overview of BioCreative V BioC track. In Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, Sevilla, Spain. 1–9.Google Scholar
- [30] . 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations
(ICLR’15) . http://arxiv.org/abs/1412.6980.Google Scholar - [31] . 2017. Overview of the BioCreative VI Chemical-Protein Interaction Track. In Proceedings of the 6th BioCreative Challenge Evaluation Workshop, Vol. 1. 141–146.Google Scholar
- [32] . 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 66–71. https://doi.org/10.18653/v1/D18-2012Google Scholar
Cross Ref
- [33] . 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282–289. Google Scholar
Digital Library
- [34] . 2019. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2019), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682Google Scholar
Cross Ref
- [35] . 2016. BioCreative V CDR task corpus: A resource for chemical disease relation extraction. Database. Online, May 8, 2016.Google Scholar
- [36] . 2005. Semi-Supervised Learning for Natural Language. Ph.D. Dissertation. Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
- [37] . 2020. Adversarial training for large neural language models. arXiv:2004.08994.Google Scholar
- [38] . 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 912–921.Google Scholar
Cross Ref
- [39] . 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692.Google Scholar
- [40] . 2014. Overview of the gene ontology task at BioCreative IV. Database. Online, August 25, 2014.Google Scholar
- [41] . 2013. Efficient estimation of word representations in vector space. arXiv:1301. 3781.Google Scholar
- [42] . 2019. Results of the seventh edition of the BioASQ challenge. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 553–568.Google Scholar
- [43] . 1994. On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language 8, 1 (1994), 1–38. https://doi.org/10.1006/csla.1994.1001Google Scholar
Cross Ref
- [44] . 2018. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In Proceedings of the Conference of the Association for Computational Linguistics, Vol. 2018. 197.Google Scholar
- [45] . 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task. 58–65. https://doi.org/10.18653/v1/W19-5006Google Scholar
Cross Ref
- [46] . 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP’14) . 1532–1543.Google ScholarCross Ref
- [47] . 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2227–2237. https://doi.org/10.18653/v1/N18-1202Google Scholar
- [48] . 2018. Improving language understanding by generative pre-training. https://www.cs.ubc.ca/amuham01/LING530/papers/radford2018improving.pdf.Google Scholar
- [49] . 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
- [50] . 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html.Google Scholar
- [51] . 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1715–1725. https://doi.org/10.18653/v1/P16-1162Google Scholar
Cross Ref
- [52] . 2019. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association 26, 11 (2019), 1297–1304.Google Scholar
Cross Ref
- [53] . 2008. Overview of BioCreative II gene mention recognition. Genome Biology 9 (2008), S2.Google Scholar
- [54] . 2017. BIOSSES: A semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33, 14 (2017), i49–i58.Google Scholar
- [55] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google Scholar
Digital Library
- [56] . 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems. 3266–3280. Google Scholar
Digital Library
- [57] . 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2019 International Conference on Learning Representations
(ICLR’19) .Google Scholar - [58] . 2018. Deep probabilistic logic: A unifying framework for indirect supervision. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
(EMNLP’18) .Google ScholarCross Ref
- [59] . 2019. Multi-task learning with sample re-weighting for machine reading comprehension. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 2644–2655. https://doi.org/10.18653/v1/N19-1271Google Scholar
- [60] . 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 1819–1837. https://doi.org/10.1109/TKDE.2013.39Google Scholar
- [61] . 2018. Drug–drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics 34, 5 (2018), 828–835.Google Scholar
- [62] . 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision
(ICCV’15) .Google Scholar
Index Terms
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
Recommendations
Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningInformation overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical ...
Pretrained domain-specific language model for natural language processing tasks in the AEC domain
AbstractAs an essential task for the architecture, engineering, and construction (AEC) industry, information processing and acquiring from unstructured textual data based on natural language processing (NLP) are gaining increasing attention. ...
Highlights- The first domain corpora for the architecture, engineering, and construction (AEC) domain are proposed.
Automatic taxonomy extraction in different languages using wikipedia and minimal language-specific information
CICLing'12: Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part IKnowledge bases extracted from Wikipedia are particularly useful for various NLP and Semantic Web applications due to their co- verage, actuality and multilingualism. This has led to many approaches for automatic knowledge base extraction from ...






Comments