Enabling Patient-side Disease Prediction via the Integration of Patient Narratives

Disease prediction holds considerable significance in modern healthcare, because of its crucial role in facilitating early intervention and implementing effective prevention measures. However, most recent disease prediction approaches heavily rely on laboratory test outcomes (e.g., blood tests and medical imaging from X-rays). Gaining access to such data for precise disease prediction is often a complex task from the standpoint of a patient and is always only available post-patient consultation. To make disease prediction available from patient-side, we propose Personalized Medical Disease Prediction (PoMP), which predicts diseases using patient health narratives including textual descriptions and demographic information. By applying PoMP, patients can gain a clearer comprehension of their conditions, empowering them to directly seek appropriate medical specialists and thereby reducing the time spent navigating healthcare communication to locate suitable doctors. We conducted extensive experiments using real-world data from Haodf to showcase the effectiveness of PoMP.


INTRODUCTION
Disease prediction has become a highly prioritized and essential aspect in healthcare and related fields in recent years [12].The ability to forecast illnesses offers invaluable benefits such as early detection and intervention, particularly crucial for conditions like cancer or heart disease where timely treatment is pivotal.Moreover, predicting chronic diseases (e.g., diabetes) can lead to lifestyle adjustments and timely medications, which potentially halt or mitigate disease progression.Additionally, disease prediction provides invaluable insights into potential health issues before patients seek medical attention, which is particularly beneficial in resource-limited situations.It also benefits patients who, due to limited knowledge of their specific conditions, invest significant time in communication to find the most appropriate doctors.
However, to our best of knowledge, current disease prediction techniques, encompassing both traditional statistical methods [1] and advanced deep learning approaches [12], rely heavily on the data obtained through clinical assessments, including laboratory tests (e.g., blood and urine tests) and diagnostic imaging (e.g., Xrays and CT scans).Unfortunately, such doctor-side comprehensive health data typically become available only after patients engage with healthcare professionals.Consequently, patient-side narratives (e.g., individuals experiencing symptoms) lacking professional terminology and accurate descriptions may face significant challenges in accessing appropriate medical guidance.This challenge is further amplified with the growing popularity of online doctor consultations, a trend accelerated by the Covid-19 pandemic.
To address the outlined challenges and elevate the performance of disease prediction approaches, we propose Personalized Medical Disease Prediction (PoMP), which predicts diseases according to patient-side narratives including patient-provided textual descriptions and patient demographic information.PoMP enables rapid comprehension of potential health conditions for individuals and seamless connections with doctors specializing in relevant medical disciplines.This innovation simplifies the typically complex process of identifying the appropriate medical department for consultation, thereby significantly reducing the time and effort expended by patients in navigating the healthcare system.
In summary, our contributions are as follows: Dataset Collection: To assess the efficacy of PoMP, we collected records of patient-doctor consultations from Haodf1 , a leading online doctor consultation platform in China.Existing publicly available datasets for disease prediction usually focus on the various patient indicators during hospitalization but fall short in capturing patient narratives [3,8].In this work, we acquired narratives from the patient's perspective, including textual descriptions, as well as their basic demographic information (such as age and gender).Additionally, we collected the corresponding diagnoses made by the doctors for further analysis and assessment.We believe that this dataset will serve as a valuable resource for future research.
Patient-side Disease Prediction: To the best of our knowledge, PoMP is the first method capable of predicting a patient's diseases exclusively through patient-side narratives, without relying on any diagnostic test outcomes.PoMP presents a promising approach and introduces the possibilities in patient-side disease prediction.Two-tiered Generic Architecture: Diseases can be categorized into various levels according to different criteria.Take pneumonia as an example, it can be further broken down into subcategories such as pulmonary nodules, lung adenocarcinoma, etc.To leverage the hierarchical nature of disease classification, we introduce a two-tiered classifier architecture.This method first predicts broad categories and then narrows down to specific disease predictions.
Our experimental results on the Haodf dataset have shown that this approach achieves state-of-the-art (SOTA) performance in 6 out of 7 evaluation scenarios.

METHODOLOGY 2.1 Preliminaries
Disease prediction is the process of using patient's medical profiles , for predicting a probable disease   ∈  .Such medical profiles  = { , ,  } typically contains the following three types of information: i) Textual descriptions  , ii) Numerical continuous data , and iii) Categorical discrete data .More specifically, we gathered narratives from patients covering various perspectives: i) Patient-provided textual descriptions  : This encompasses text description in natural language that can be obtained from patient self-introductions.It includes chronic disease  chronic , surgery history  surgery , radiotherapy history  therapy , medication usage  usage , observed symptoms  symptom , and allergy history  allergy .
ii) Patient demographic information  and : This encompasses basic demographic details including gender  gender , age  age , height  height , weight  weight , pregnancy situation  pregancy , and disease duration  duration .

Model Details
In this work, we propose a generic model, named Personalized Medical Disease Prediction (PoMP), to predict diseases according to patient health narratives.We first construct distinct encoders customized for each narrative type.Subsequently, we establish a two-tiered classifier for disease predictions, wherein we first predict the disease category and subsequently the specific disease.Lastly, we discuss about our training regime tailored for the two-tiered generic disease prediction framework.

Textual Description Encoder.
To effectively capture the semantic knowledge and contextual information in patient-provided textual descriptions, we adopt a Sentence Transformer [11] for encoding  .Sentence Transformers are pre-trained language models on extensive natural language datasets, capable of considering entire sentences and producing embeddings that encapsulate the overall meaning of the text.
Specifically, we begin by adopting a prompt [6] to better leverage the knowledge learned in a pre-trained language model as follows: where [TYPE] is a type of textual descriptions and [TEXT] denotes the corresponding textual descriptions  [TYPE] ∈  .
Then, we concatenate all prompt into a unified sentence as follows: (1) Sentence Transformer firstly applies a tokenizer to convert  into tokens Token  and generates an attention mask Mask  .Then, Sentence Transformer apply an encoder to convert Token  in to embeddings Emb  as follows: Emb  = Encoder(Token  ).
(3) Next, we apply a mean-pooling to reduce the spatial dimensions of feature maps while retaining important information as follows: where  denotes a minimum value to avoid divided by zero.Lastly, we apply a normalize layer to generate the ultimate textual description embeddings as follows: 2.2.2 Demographic Information Encoder.As mentioned in Section 2.1, Demographic information is composed of continuous data and discrete data.We handle them through different processes.
For continuous data, we employ normalization to scale the values within the range [0, 1], ensuring efficient convergence of model parameters during training as follows: In the context of patient-side disease prediction,  comprises the following components: For discrete data, we apply one-hot embeddings as follows: For patient-side disease prediction,  = { gender ,  pregancy }.Subsequently, both continuous and discrete data undergo encoding via a multi-head attention layer as follows: where head  = Attention(   ,    ,    ).
After predicting the category, the model can narrow down to the potential diseases, thereby simplifying the subsequent prediction task.The category with the highest score, denoted as  ′max cate , is selected as the candidate category.Subsequently, we apply a category-specific Softmax() to predict the specific disease within the chosen category: where Linear  ′max cate and Softmax  ′max cate are category-specific linear and softmax functions, respectively.

Training. After receiving the category prediction 𝑦 ′
cate and disease prediction  ′ dise , we define the training objective and loss function as follows: Objective: Because different categories may contain overlapping disease, incorrect category predictions can still lead to correct disease predictions.However, for the prediction chain to align with human cognition, we only consider predictions as correct if both the category and disease are accurately predicted.Loss function: To integrate category prediction loss with disease prediction loss, we utilize a weighted cross-entropy loss defined as follows: ) where  denotes the number of category (or disease) labels and  is a weight hyper-parameter.

DATASET
To evaluate the effectiveness of PoMP, we created the Haodf dataset.We collected comprehensive patient-doctor consultation records including patient-side narratives across six prevalent disease categories, which were further classified based on their associated risk levels.These categories (see Table 1) include i) low-risk categories: Common Cold (Cold) and Pneumonia (Pneu.);ii) medium-risk categories: Diabetes (Diab.) and Depression (Depr.);and iii) high-risk categories: Coronary Heart Disease (CHD) and Lung Cancer (Lung.).To demonstrate the potential correlation between disease and patient demographics, we conducted an analysis of gender and age distributions across all six disease categories (see Figure 1a and Figure 1b).We observed distinct variations in susceptible populations across different diseases.

EXPERIMENT 4.1 Baselines
In our experiments, we compare PoMP against six widely-adopted Natural Language Processing (NLP) models.These models are standard implementations of pre-trained language models (PLMs) including GPT2 [9], BERT [4], T5 [5], ALBERT [10], ELECTRA [2], and RoBERTa [7].We apply the two-tiered classification approach outlined in Section 2.2.3 to make disease predictions with PoMP.For all baseline models, category predictions and disease predictions are performed independently.
We implement PoMP based on the SOTA Sentence Transformer (all-MiniLM-L6-v22 ).The training process utilized two NVIDIA Tesla V100 GPUs equipped with 32GB RAM.We have made both the dataset and the source code publicly available3 .

Results and Analysis
The results of category predictions and disease predictions are presented in Table 2.We utilize the hit rate (Hit@k) and the area under the precision-recall curve (AUC-PR) for evaluation, with the best performance highlighted in bold.
In category predictions, PoMP achieves the highest performance in terms of Hit@1 and Hit@3, while its AUC-PR scores are comparable to PLM baselines.The notably strong performance in category prediction further supports the two-tiered classification strategy we proposed.
In disease predictions, PoMP achieves the highest performance in terms of all metrics among all baselines.Notably, the substantial improvement relative to the second-best approaches approaches are +17.3% for Hit@1, +18.0% for Hit@3, +17.6% for Hit@10, and +13.3% for AUC-PR.These findings highlight the efficacy of our two-tiered classification strategy.

Ablation Study
To demonstrate the significance of demographic information in disease prediction, we conducted an ablation study.This study compare PoMP to the vanilla Sentence Transformer model, which accepts text-only inputs.The result of category predictions and disease predictions are illustrated in Table 3. PoMP achieves a better results in terms of Hit@1 and AUC-PR for category prediction.For disease prediction, PoMP achieves a significant better result in terms of all metrics.Notably, the vanilla Sentence Transformer appears to suffer from limited discriminatory capacity, as indicated by the small performance gaps among Hit@1, Hit@3, and Hit@10 (increases of +7.2% and +5.6%).In contrast, PoMP exhibits larger performance disparities, with improvements of +11.8% and +9.5%, respectively.

CONCLUSION
In conclusion, we address the critical need for early disease prediction by introducing Personalized Medical Disease Prediction (PoMP), an innovative approach that leverages only patient-provided health narratives through a two-tiered prediction model.PoMP simplifies the process of connecting patients with appropriate medical specialists, representing a substantial advancement in making disease prediction more accessible and tailored to patient needs, thereby enhancing the efficiency of healthcare communication.To validate the effectiveness of PoMP, we collected extensive patient-doctor consultation records from the Haodf platform, encompassing a wide array of patient narratives detailing their conditions.We believe this work will lay a solid groundwork for future research in patient-side disease prediction.

Figure 1 :
Figure 1: Distribution of patient demographics across six categories.

Table 1 :
Statistics of the Haodf dataset.

Table 2 :
Category prediction and disease prediction results on Haodf dataset.

Table 3 :
Ablation study results compared to vanilla Sentence Transformer.