It is now well established that biomedical text requires methods targeted for the domain. Developments in deep learning and a series of successful shared challenges have contributed to a steady progress in techniques for natural language processing of biomedical text. Contributing to this on-going progress and particularly focusing on computational methods, this special issue was created to encourage research in novel approaches for analyzing biomedical text. The six papers selected for the issue offer a diversity of novel methods that leverage biomedical text for research and clinical uses.
A well-established practice in pretraining deep learning models for biomedical applications has been to adopt a most promising model that was already pretrained on general domain natural language corpus and then “add” additional pre-training with biomedical corpora. In “ DOI: Domain-specific language model pretraining for biomedical natural language processing”, Gu et al. successfully challenge this approach. The authors conducted an experiment where multiple standard benchmarks were used to compare a model that was pre-trained entirely and only on biomedical corpus with models that were pretrained using the “add” on approach. Results showed an impressive improvement in favor of pretraining only with biomedical corpus. The study provides an excellent data-point in support of clarity in model training rather than accumulation.
Tariq et al. also find using domain-aware tokenization and embeddings to be more effective in their paper “ DOI: Bridging the Gap Between Structured and Free-form Radiology Reporting: A Case-study on Coronary CT Angiography”. They compare a variety of models constructed to predict the severity of cardiovascular disease from the language used within free-text radiology reports. Models that used medical-domain-aware tokenization and word embeddings of the reports were consistently more effective than raw word-based. The better models are able to accurately predict disease severity under real-world conditions of diverse terminology from different radiologists and unbalanced class size.
Two papers address the problem of maintaining the privacy of clinical documents, though from widely different perspectives. De-identification is the most used approach to eliminate PHI (Protected Health Information) in clinical documents before making the data available to NLP researchers. In “ DOI: A Context-enhanced De-identification System”, Kahyun et al. describe an improved de-identification technique for clinical records. Their context-enhanced de-identification system called CEDI uses attention mechanisms in a long short-term memory (LSTM) network to capture the appropriate context. This context allows the system to detect dependencies that cross sentence boundaries, an important feature since clinical reports often contain such dependencies. Nonetheless, accurate and broad-coverage de-identification of unstructured data remains challenging, and lack of trust in the process (of de-identification) can be a serious limiting factor for data release.
In “ DOI: Differentially Private Medical Texts Generation using Generative Neural Networks”, Aziz et al. take a different approach to dealing with privacy of clinical documents. They propose synthetic generation of clinical documents with high accuracy as a practical alternative. Using self-attention based neural networks and differential privacy (i.e., the ability to control the level of privacy relative to the original document) in their method, they demonstrate modern generative approaches can be effectively used here. Novel metrics based on token level distribution, document level similarity for an outcome, and adversarial classification at corpus level were used to measure the goodness of their approach. The results suggest a viable alternative to de-identification.
Increasingly, social media is complementing Electronic Health Records as a valuable source of patients’ disease status and responses to treatments. Exponential growth of social media and relaxed concerns about patient privacy in this channel means availability of larger and less constrained data for analysis. In “ DOI: Supporting Personalized Health Care with Social Media Analytics”, Grani et al. have developed novel methods for characterizing adverse drug events reported by patients in web forums. The study included social media posts of patients who were receiving treatments for hypothyroidism. A particularly novel aspect of their work was using two adversarial networks (as in a GAN) to generate compressed latent vectors for social media posts, which were subsequently clustered to identify important discussion topics related to treatment responses and ADRs. One of the networks (the classic “discriminator”) is an auto-encoder which is regularized by the adversarial network (the “generator”) which learns to produce realistic synthetic posts. Through detailed analysis of patient response clusters, using the results from topic modeling, this paper establishes a novel methodology for analyzing exponentially increasing posts from web forums.
The final paper, “ DOI: GeCoAgent: A Conversational Agent for Empowering Genomic Data Extraction and Analysis” by Crovari et al. uses natural language processing not to analyze texts but to support a conversational interface between a genomics researcher and a system for managing genomics experiments. The goal of the overall GeCoAgent system is to enable biologists with limited computer skills to independently exploit the computational tools available to manage and interpret data arising from genomics experiments. The language processing within the system allows the biologist to interact with it through dialogue, enhancing the biologist user's experience and capabilities.
We want to thank the diligent work, often under time pressure, by the reviewers of the papers submitted to this special issue. Without their volunteer efforts this special issue would not have been possible. We leaned on several of them multiple times through personal requests, and they always came through. All of us at ACM and its readers are truly indebted for their contributions. We also acknowledge the scientific and professional contributions of the authors of all submitted papers, and their immense patience while we conducted the review process during the once in a century world-wide pandemic. Lastly, we are grateful to the Editors-in-Chief of ACM Health, John A. Stankovic and Insup Lee, for trusting us with this special issue and for the support of the editorial staff, especially Victoria White.
Recommendations
Special Issue: Learning and creativity Part 2
Special Issue Part 1 (Issue 3) and Part 2 (Issue 4) of AIEDAM are based on a workshop on Learning and Creativity held at the 2002 conference on Artificial Intelligence in Design, AID '02 (www.cad.strath.ac.uk/AID02_workshop/Workshop_webpage.html; Gero, ...
Special Issue Introduction: The IEEE Computer Society's 60th Anniversary
Mike Hinchey, Software Engineering Laboratory, NASA Goddard Space Flight Center A milestone year has provided the inspiration for several special activities and awards, including a special issue of Computer that addresses various aspects of software ...
Special Issue: Learning and creativity Part 1
Special Issue Part 1 (Issue 3) and Part 2 (Issue 4) of AIEDAM are based on a workshop on Learning and Creativity held at the 2002 conference on Artificial Intelligence in Design, AID '02 (www.cad.strath.ac.uk/AID02_workshop/Workshop_webpage.html; Gero, ...






Comments