Investigating Generalizability of Speech-based Suicidal Ideation Detection Using Mobile Phones

Speech-based diaries from mobile phones can capture paralinguistic patterns that help detect mental illness symptoms such as suicidal ideation. However, previous studies have primarily evaluated machine learning models on a single dataset, making their performance unknown under distribution shifts. In this paper, we investigate the generalizability of speech-based suicidal ideation detection using mobile phones through cross-dataset experiments using four datasets with N=786 individuals experiencing major depressive disorder, auditory verbal hallucinations, persecutory thoughts, and students with suicidal thoughts. Our results show that machine and deep learning methods generalize poorly in many cases. Thus, we evaluate unsupervised domain adaptation (UDA) and semi-supervised domain adaptation (SSDA) to mitigate performance decreases owing to distribution shifts. While SSDA approaches showed superior performance, they are often ineffective, requiring large target datasets with limited labels for adversarial and contrastive training. Therefore, we propose sinusoidal similarity sub-sampling (S3), a method that selects optimal source subsets for the target domain by computing pair-wise scores using sinusoids. Compared to prior approaches, S3 does not use labeled target data or transform features. Fine-tuning using S3 improves the cross-dataset performance of deep models across the datasets, thus having implications in ubiquitous technology, mental health, and machine learning.


INTRODUCTION
Suicidal Ideation (SI) is a significant public health concern that affects an estimated 12.4 million adults (5.0% of the US population) according to the National Survey on Drug Use and Health (NSDUH) [76].SI refers to thoughts or contemplation of suicide.It is a serious mental health concern and a precursor to suicidal behavior [44].The proliferation of ubiquitous technology has led to advanced screening methods for SI.Speech-based mobile systems are one such method that collects audio recordings through smartphones and analyzes paralinguistic patterns using machine learning (ML) [97,104].
However, previous research indicate that ML methods exhibit performance decreases under distribution shifts [81], raising serious safety concerns.In fact, most prior speech-based SI screening methods are only evaluated on a single dataset or specific populations owing to the high costs associated with conducting longitudinal studies in mental health, and privacy-related barriers to sharing data across institutions.It is imperative to understand the generalization capabilities of these methods for smooth deployment.Consequently, a well-known journal recently established best practices for implementing machine learning in healthcare, highlighting limited generalizability of models and their tendency to exacerbate biases in data [53], while a leading digital health journal emphasized the importance of independent validation [14].Challenges in speech generalization for mental health arise from several sources.First, the incidence of SI varies depending on the population.For example, a clinical population reporting symptoms of persecutory ideation will have more individuals with SI than student populations [35].Such differences can be observed across different clinical populations.Second, the machine learning methods used for pattern recognition might be highly sensitive to population characteristics and small datasets.Balancing intra-dataset and inter-dataset performance remains a fundamental challenge.Third, audio characteristics may vary across samples.
Consequently, the effect of these out-of-distribution shifts in speech-based SI detection is largely unknown to the research community, which is a significant gap.Thus, we sought to examine SI detection performance across four datasets.To this end, we collect three datasets investigating a mental illness or related symptoms such as major depressive disorder, auditory verbal hallucinations, and persecutory ideation.In addition, we include the open-source StudentSADD dataset [104] for analysis.Therefore, we evaluate generalizability across four datasets for speech-based SI detection.We first employ a consistent strategy to select data and extract features that allow fair evaluation.Subsequently, we examine the out-of-distribution generalization across the four datasets with a binary classification task -using speech to classify whether an individual has SI.
We systematically evaluate cross-dataset performance by first investigating dataset similarity qualitatively (t-SNE visualizations) and quantitatively (OTDD metric).Next, we assess within-dataset performance using machine learning (ML) and deep learning (DL) models in a stratified-k-fold setup.After establishing strong baselines, we examine the performance of models when trained on one dataset and tested on another, referred to as one-one validation.Similarly, we experimented with leave-one-dataset-out validation, which completely holds out one dataset for testing, and trains with the rest.Our results indicated poor generalization performance of ML and DL methods and emphasizes choosing optimal data points for training.Thus, we applied UDA and SSDA approaches to improve generalization.While UDA methods apply feature transformations and instance weighting, SSDA methods rely on using limited target label data in a contrastive or adversarial training setup.Both assume access to large unlabeled target datasets, making them less ideal for mental health.
In this paper, we propose an SSDA method termed sinusoidal similarity sub-sampling (S3) that works with smaller unlabeled target datasets.S3 selects an optimal subset from the source dataset to adapt to the target dataset, and it is computed as follows.First, we transform the source and target domain embeddings from a deep learning model (VGGish [45]) into sinusoidal signals.Next, we randomly generate an anchor sinusoidal matrix composed of many sine waves.Finally, the transformed embeddings are compared to the anchor through dot products to obtain pair-wise scores to select best source subset in different ways, which are referred to as S3 variants.Intuitively, S3 extracts frequency information by comparing a series of sine waves.Using the subset for fine-tuning models results in better generalization performance than other methods.

Contributions
To investigate the generalizability of speech-based SI detection methods, our contributions are as follows: • We evaluate the performance of SI detection across four different datasets with three experiments: withindataset (section 5.2), one-one (section 5.3), and leave-one-dataset-out (section 5.4) validation.Consequently, we benchmark several UDA and SSDA methods based on their effectiveness in handling the dataset shift.In general, our results indicate SSDA methods performed better than UDA approaches.To our knowledge, we are the first to validate speech-based SI detection on multiple independent datasets, elucidating previously unknown factors.• We propose the sinusoidal similarity sub-sampling (S3) metric with a focus on improving generalization in the context of mental health, where target datasets are small and unlabeled.We observe that S3 outperforms UDA and SSDA methods in many cross-dataset scenarios.S3 obtained significant performance gains for the smallest dataset (n<50).• We perform extensive post-hoc analysis to interpret important features for generalization across different populations.Our findings suggest spectral roll-off is crucial across two datasets but not the other, suggesting some acoustic heterogeneity may exist across datasets even among commonalities.Furthermore, we evaluate the robustness of UDA and SSDA approaches to help future researchers choose appropriate methods.Finally, our analysis on using S3 for acoustic scene classification indicates that it is well-suited for mental health, but not as a general audio SSDA method.

BACKGROUND & RELATED WORK 2.1 Ubiquitous Sensing and Mental Health
Passive sensing data from smartphones and wearables show strong potential for identifying individuals with mental illness [47,70,114].StudentLife [113] was the first passive sensing Android application to assess mental health.Recently, several studies have employed the use of smartphones and mobile applications for depression detection [72,114].For instance, Mullick et al. [72] collected data from 55 adolescents to predict depression; their findings highlight the utility of screen, call, and location-based features in improving performance.A study by Xu et al. [114] investigated the generalizability of various sensing data across different populations for depression detection, suggesting the need for improved methods that can be validated across independent datasets.Although several studies have addressed the problem of depression detection, SI has received limited attention.Horwitz et al. [47] investigated the prediction of suicidal ideation (SI) among medical interns using FitBit data on sleep and steps.They found that passively collected FitBit data did not enhance SI detection.They also acknowledged that better results were achieved when data collection was closer to the outcome rather than averaging sensing data over time.Sleep serves as a crucial biomarker for mental health, as demonstrated by Wang et al. [112], who discovered a connection between delayed bedtimes and self-reported concerns of potential harm and hallucinations in individuals with schizophrenia.Additionally, Abdullah et al. [1] explored circadian rhythm through sleep period markers to detect sleep deprivation and enhance overall well-being.To detect bipolar symptoms, Gruenerbl et al. [43] validated accelerometer-location sensors with bipolar patients from an Austrian psychiatric hospital.Their system achieved 72-81% accuracy in recognizing clinical states (depression/mania) and demonstrated high precision (96%) and recall (94%) in state-change detection.

Suicidal Ideation Detection
Various methods such as electronic health records (EHR) [7,111], functional MRI (fMRI) [54,64,73], video [60,94], social media text [26,27,49,77], and speech [9,16,22,24,104] can detect suicidal ideation.Rich longitudinal EHR data with information on diagnostic codes, laboratory results, and medications are particularly useful.For example, Barak-Corren et al. [7] used a Bayesian model with 15 years of EHR data to predict future suicidal behavior, observing that unconventional factors like back contusions can increase suicide risk.Similarly, Walsh et al. [111] used EHR data from 5167 participants to predict suicide attempts on a larger scale.They applied a random forest to predict suicide attempts within a seven-day window, achieving a 0.84 AUC.In brain imaging, Li et al. [64] found that voxel-wise concordance in parts of the brain can be used as a biomarker for SI in individuals with depression.Another study by Nawaz et al. [73] concluded that there was no evidence to support the association between SI and amygdala structural changes.Videos are another useful tool to analyze body and facial cues, and thus enable SI detection.For instance, Shah et al. [94] examined social media videos suggesting that multi-modal information combined with shoulder and torso changes are important features for SI detection.
Another study by Laksana et al. [60] examined facial behaviors and observed that smile-based descriptors are the most discriminative for SI detection Social networking websites such as Facebook, Twitter, and Reddit provide an anonymous space for individuals to share their suicidal thoughts.Many studies have sought to detect SI using text data scraped from these websites [26,27,49,77].For example, O'dea et al. [77] extracted 14,701 tweets from Twitter and trained an SVM to classify highly concerning tweets automatically.Reddit is perhaps the most relevant platform for studying SI [26,27,119].De Choudhury et al. [27] sought to forecast if a person talking about mental health online would transition into suicidal ideation discussions on Reddit.A similar study by De Choudhury and De [26] investigated selfdisclosure, anonymity, and social support on Reddit mental health forums.Their results suggest that responses are surprisingly high quality and contain prescriptive advice, contrasting responses on Twitter.Text messages are a useful modality for SI detection.While Nobles et al. [75] address the subtle problem of differentiating between suicidal and depression periods, Tlachac et al. [103] detect SI using less longitudinal data, i.e., predicting a particular week's SI using data from previous weeks.
Compared to many methods mentioned above, mobile phones enrich longitudinal diary studies and psychology research with their in-the-wild data collection capabilities [96].Experience Sampling Methods (ESM) use these devices to trigger prompts for user data, providing researchers with timely information, reducing recall bias [59,62].Speech-based methods have several advantages over traditional approaches, they are easy to use, relay real-time longitudinal data to researchers, and facilitate sharing of personal narratives [90].Furthermore, audio diaries are discreet, time-saving, and capture authentic emotions.Their convenience promotes user compliance, allowing more open sharing of sensitive information [15,46].
Speech has emerged as an important active modality for SI screening, where machine learning models are used to learn patterns from paralinguistic features.In particular, the AVEC2013 [107] feature set is widely used for depression and SI screening [23,24,104].Broadly, studies can be classified as those using data from clinical interviews [16] with long recordings, or smartphones with shorter in-situ recordings [9,104].A study by Belouali et al. [9] investigated SI detection in veterans using voice recording from Android tablets.They train machine learning models on phonation, prosody, and glottal features of voice to obtain an AUC of 0.78.Another study [40] modelled emotions such as guilt and anger from phone conversations to detect SI obtaining an AUC of 0.79.However, their dataset had only 31 participants.More recently, a study by Tlachac et al. [104] investigated speech patterns in over 300 students for SI detection.While their traditional machine learning methods used the AVEC2013 feature set, their deep learning model trained on unscripted audio obtained a balanced accuracy of 0.73.Most speech-based SI detection studies are evaluated on a single dataset or population, making them prone to poor cross-dataset generalization and data biases [14,53,81].In contrast, we sought to understand performance across four datasets.Thus, our investigating differs from the previous studies in the following ways.First, to our knowledge, we are the only study to validate speech-based SI detection on multiple independent datasets (Table 1).To this end, we analyze four datasets comprising clinical, non-clinical, and mixed populations.Second, our analysis of 786 participants is larger than previous studies, with highly varying positive SI samples ranging from 11% to 74%.Third, our studies use audio diaries collected from our Android application with the same core system, establishing a paradigm for scalable data collection.Identifying SI in near real-time is crucial to administering interventions.Mobile applications are better than high-burden clinical interviews in this regard.Furthermore, smartphones enable in-situ data collection, thus, reducing costs and improving diversity by enrolling individuals from underrepresented communities.Audio diaries have some advantages over social media content analysisthey may be accompanied by contemporaneously collected ground truth, such as item 9 from the PHQ-9 [57,58].

Domain Adaptation
Evaluating models trained on a specific population against a different population under distribution shift is crucial for real-world deployment of speech-based mental health screening systems.The poor generalization performance in such scenarios can be alleviated through Domain adaptation (DA) [29].Given a source domain D  and a target domain D  with source and target joint probability distributions    and    , respectively.DA assumes distribution shifts where    ≠    .Unsupervised domain adaptation (UDA) is well studied, and many methods have been proposed for computer vision (CV) and natural language processing (NLP) [33,39,100,105].In UDA, we have a labeled source dataset D  = {(  ,   )} and a large unlabeled target dataset D  = {(  )}.Traditional feature-based DA techniques like subspace alignment [33], and transfer component analysis [80] transform features such that the latent spaces of the source and target domains are closer [33,80].For deep learning models, adversarial domain adaptation has been widely studied [4,39,105,117].These methods generally seek to build new feature representations for source and target data, making them indistinguishable for a discriminator network to classify.In instance-based DA methods such as linear discrepancy minimization [67] and kernel mean matching [42], source data is re-weighted to minimize the distance between source and target joint distributions.
In semi-supervised domain adaptation (SSDA), we have a labeled source dataset D  = {(  ,   )} and a small unlabeled target dataset D  = {(  ,   )}, and a large unlabeled target dataset D  = {(  )}.Grandvalet and Bengio [41] proposed a method to adapt neural networks by minimizing the entropy on unlabeled target data, whereas Saito et al. [91] showed that using adversarial training to maximize the entropy followed by minimization improves the quality of discriminative features.Kim and Kim [55] introduce the concept of intra-domain discrepancy where target sub-distributions are unaligned and propose a three-step procedure for mitigation.Forgoing adversarial training, Singh [95] presents a contrastive learning framework that learns good representations through strongly augmenting unlabeled target data.Recently, Yu and Lin [115] proposed to denoise the source data by viewing it as a noisily-labeled version of the target data.
The performance of the above-mentioned approaches for speech and mental health remains largely unknown.Additionally, these methods assume access to large unlabeled target datasets to enable adversarial and contrastive training, which is uncommon in mental health.In contrast, our work proposes a sub-sampling method to select the most optimal source subset for fine-tuning the target dataset without the need for large datasets or target domain labels.

STUDY
Our analysis uses speech data from four studies that study mental illness in a specific population.We refer to these datasets as MDD, AVH, PT, and Student, referencing individuals with major depressive disorder, auditory verbal hallucinations, persecutory thoughts, and students, respectively.In this section, we describe the datasets (section 3.1), speech-based diaries (section 3.2), and ground truth (section 3.3).

Datasets
We use data from three of our studies -MDD, AVH, and PT -as well as an open-source dataset, StudentSADD.For brevity, we only discuss characteristics pertinent to this paper.We provide additional information about study protocols, collection prompts, and the Android application in the supplementary materials.

MDD.
The MDD [74] study aims to recruit 300 participants with Major Depressive Disorder (MDD) from across the United States.This study challenges two widely held assumptions about MDD.First, current diagnostic criteria assume that MDD symptoms are interchangeable, i.e., determining whether an individual has MDD based on their total PHQ-9 score.However, this method fails to acknowledge the vast variance in MDD symptom presentations across individuals [38].In fact, MDD has over 1000 unique symptoms [21].Second, current diagnostics assume that MDD symptom severity remains stable across weeks and even months.In contrast, MDD symptom intensity can vary substantially, even across a single day.[31,37].
To address the above-mentioned issues, our study was designed with the goal of using passive sensing and EMA data to predict within-person changes in MDD symptoms, with the understanding that MDD is both highly variable across individuals and a changing system.Qualifying participants install our Android application for 90 days and answer three PHQ-9 surveys each day to facilitate within-day analysis of symptoms using smartphone data.After Item 9 in PHQ-9, the user can record an audio diary (see Fig. 1), resulting in a one-one mapping between audio recordings and PHQ-9 surveys.Note that recording audio diaries are completely optional, thus, and measuring distinguishing factors between the groups is challenging.Therefore, this study has two main contributions.Firstly, it uses the Research Domain Criteria (RDoC) [25] framework from the National Institute of Mental Health to investigate AVH on a spectrum from "normal" to pathological.Secondly, it utilizes a smartphone app to gather data through passive sensing, audio diaries, and momentary self-assessments, thus differing from traditional retrospective methods like interviews and surveys that can be prone to inaccuracies.Using the above-mentioned factors, the study aims to evaluate whether AVH experience and behavior differ across clinical and non-clinical individuals.For more details about the study, we refer the reader to [11].We utilized the Hamilton Program for Schizophrenia Voices Questionnaire (HPSVQ) self-report to evaluate AVH [109].In total, 384 participants met the recruitment criteria.Among them, 192 were female, 176 were male, and 12 identified as transgender (male to female and female to male).Four participants identified as another gender.Participants installed our Android app on their phone that was designed to collect mobile sensing, audio diaries, and self-report ecological momentary assessments.We modified these tools to fit the needs of our study.During the 30-day study, we collected both active modalities, like audio diaries, and passive sensing data, like GPS, telemetry, and light data.

PT.
This study [13] aims to understand persistent harmful thoughts, called persecutory thoughts (PT).PT is prevalent in various mental health conditions, including mood, anxiety, personality, and neurodegenerative disorders.It is also found in healthy individuals.Similar to AVH, research supports a continuum of PT ranging from normal thoughts about danger to strong negative beliefs that disrupt daily life.Thus, our team sought to study the phenomenology of PT.The novel contributions of this study are as follows.First, we aimed to characterize how often people experience different aspects of paranoia-related thoughts, feelings, and actions in their daily lives.Second, we evaluate the link between these aspects and levels of clinical severity defined by treatment received.Third, we explored if people with greater functional disability exhibited similar paranoia experiences.Additional study details can be found in [13].
Here, we modify the Android Application used in the AVH study to fit the needs of the PT study.We collected data from 231 individuals who experienced PT using the Revised Green Paranoid Thoughts Scale's ideas of persecution (R-GTPS) subscale [36].The R-GTPS is a 10-item measure of persecutory ideation that was derived from the full-length Green Paranoid Thoughts Scale.Similar to previous studies, participants were recruited remotely through Google Ads.They were instructed to keep their android application installed smartphones with them throughout the 30-day data collection period and complete brief questionnaires as prompted.They could contact a research coordinator for technical support if needed, and the research team would follow up if no information was received from their devices for three days.Participants received $125 as compensation for their participation, and the app was uninstalled at the end of the data collection period.[104] dataset (Student) aimed to study suicidal ideation and depression among students during COVID-19.For data collection, two aesthetically similar Android and Web applications were developed.The app collected PHQ-9 depression screening surveys, demographics, a typed reply, two voice recording, and Twitter information from all participants.In total, 302 participants submitted their sessions.Detailed information about the StudentSADD study is described in [104].The dataset consists of several feature sets extracted from raw audio for analysis.To protect participant's personal identifiable information, raw audio information was not released by the authors.Note that we only use unscripted audio features for analysis because it closely matches data in other studies.Therefore, we use 178 audio recordings with SI labels (see Table 2).Moreover, we do not use the Student dataset for deep learning analysis as it requires raw audio data.

Speech-Based Diaries
The Android application used in the MDD, AVH, and PT studies is inspired by mobile sensing systems used in other mental health and behavioral studies [10,113].In addition, they have been tailored to suit the specific requirements of each study.To ensure strong adherence to the study, we utilize the subsequent techniques for designing, integrating, and deploying the app.The app is designed to function in the background of mobile phones, automatically collect passive sensing data, and prompt the user to collect active modalities such as audio diaries through EMAs.The voice recordings use similar protocols.First, the diaries can be atmost 180 seconds and are completely optional, thus, reducing the likelihood of trivial submissions.Second, the prompts used for collection are unrelated to SI, consequently facilitating analysis of low-level paralinguistic features.Third, speech is unconstrained, i.e., we do not instruct the user to repeat a sentence or read a paragraph.Fourth, it is collected in-situ without any restrictions on the participant's location, resulting heterogeneous speech signals.To summarize, we collect free-form speech from the participant's phone under naturalistic conditions.We envision that detection systems tested in in-the-wild noisy conditions can catalyze the development of robust SI intervention protocols.
While the core system remained the same across study, some factors are tailored towards studying the primary population group.The groundtruth collection for each study is as follows: MDD.The app prompts the user three times a day with the PHQ-9 questionnaire.After the Item 9, the user can record an audio diary as shown in Fig. 1.AVH.To record an audio diary, first, the HPSVQ [109] is administered four times a day for 30 days, randomly between 9am-12pm, 12pm-3pm, 3pm-6pm, and 6pm-9pm.Note that the PHQ-9 is collected only once at baseline during on-boarding.PT.The user is prompted four times a day semi randomly between 9am-9pm to answer a 12 item survey measuring PT, cognitive appraisals, anxiety, self-esteem, sadness, sociality, energy, and presence of others.Note that the PHQ-9 is collected only once at baseline during on-boarding.
Student.In the StudentSADD study [104], the app prompts the user with a general question such as "Describe a good friend".The user has 30 seconds to record their unscripted voice sample.Additional information regarding studies, android application, and audio diaries are presented in the supplementary.

Ground Truth
The Patient Health Questionnaire (PHQ-9) is a commonly used self-report tool for measuring the severity of depression with high validity and reliability [57,58].The survey consists of nine items that the participant rates on a Likert scale ranging from 0-3.The responses correspond to "not at all, " "several days, " "more than half the days," and "nearly every day," respectively.Item 9 in the PHQ-9 screens for suicidal ideation, and it is a strong predictor of suicide attempts [89].It asks how often the individual has been bothered by "thoughts that you would be better off dead, or thoughts of hurting yourself in some way?"Any answer other than 0 or "not at all" is considered SI.In the MDD study (section 3.1.1),the PHQ-9 scale was modified to range from 0-100, and any value over 24 is considered SI.
For fair evaluation, we ensure each participant has only one audio sample (1:1).As data collection frequency differs across studies, we describe the selection criteria here.All unscripted audio recordings with SI labels are used in the student dataset.We use the following selection criteria for the MDD, AVH, and PT datasets: (1) Audio samples must have a voice, measured using  0 > 27.5Hz and word count greater than 0, (2) the audio length must be at least 30 seconds.In the MDD dataset, we select the longest audio recording the participant has submitted.
In the AVH and PT datasets, we choose the longest audio samples submitted in the first two days of the study.Demographic information and statistics of the dataset used for analysis are presented in Table 2.

Dataset Similarity
Prior to evaluating predictive performance across datasets, it is crucial to understand their similarities.Thus, we use t-distributed stochastic neighbor embedding (t-SNE) [108] to visualize data samples and optimal transport dataset distance (OTDD) [5] to quantify dataset similarity.t-SNE [108].A widely used dimensionality reduction technique used to visualize high-dimensional data as a low-dimensional embedding.Initially, the algorithm converts data similarities to joint probabilities and minimizes KL-divergence between the low dimensional embedding and the high dimensional data [51].OTDD [5].Given two datasets with feature-label pairs, the distance between datasets is computed using theoretical underpinnings of optimal transport theory.The metric used to compute distances between features (e.g., euclidean distance) is combined with the Wasserstein distance between label distributions (over features).Thus, yielding a transportation 'cost' between the datasets, which is optimized as the lowest cost to couple data samples.

Traditional Machine Learning
In mental health research, traditional machine learning methods are often preferred because of their interpretability and applicability to smaller datasets [30,104].Moreover, these methods could be a good benchmark for deep learning approaches.

Feature Engineering & Selection.
The Audio/Visual Emotion and Depression Recognition Challenge (AVEC) provides a comprehensive list of speech-based features used to detect mental illness [107].Many studies on speech-based depression [28,71,79,88,104,110] and SI [104] use AVEC for analysis.We use the AVEC2013 feature set [107] to extract 2268 handcrafted features from the raw audio data using the openSMILE package [32].Importantly, the Student study [104] released AVEC2013 feature set instead of raw audio to protect participant privacy, thus, we extract the same features to evaluate cross-dataset performance.It is vital to reduce the feature set size to ensure optimal training.After extracting the 2268 features, we used the mutual information (MI) metric [56] to reduce the number of features.MI computes a non-negative value that signifies the dependence between the feature and the discrete binary label [56].Larger values indicate more dependence and thus could be more useful for prediction.

Machine Learning Methods.
In our analysis, we validated the performance of SI prediction using four machine learning approaches: (1) Support vector machines (SVM) [20]: a large-margin classifier capable of handling high dimensional data, (2) Logistic regression (LR) [48]: an effective statistical approach that assumes linearity, (3) Random forest (RF) [12]: a tree-based ensemble machine learning method that extends on decision trees (DT) through bagging, and (4) Extreme gradient boosted trees (XGB) [18]: a tree-based method that extends DT through boosting.For implementation, we first apply the mutual information metric to reduce the feature set.Next, the features are standardized.Finally, we perform a parameter search as described in Appendix A. Note that we describe train and test set splitting strategies in section 5.

Deep Learning
4.3.1 Architectures.In contrast to feature engineering in traditional ML models, deep learning methods automatically generate feature embeddings for classification.Importantly, the embeddings are low-dimensional representations of the input audio signal, thus enabling us to compute similarity metrics or apply DA approaches efficiently.In deep learning for speech processing, the raw audio data is transformed into a log mel-spectrogram for training.The VGGish [45] architecture is a multi-layer convolutional neural network model trained on the YouTube 8M dataset [2] for large-scale audio classification.It takes log-mel spectrograms as input and generates embeddings  ∈ R  ×128 .In our analysis, we fine-tune (layers are frozen) the VGGish model resulting in the following variants: (1) VGGish-Z.To leverage temporal dependencies from VGGish embeddings  ∈ R  ×128 , we fine-tune using an LSTM resulting in embeddings  ∈ R 1×128 .Next, two fully-connected layers take  as an input for SI classification.We refer to this as VGGish-Z because the "intermediate input" to the LSTM is the VGGish embedding  .(2) VGGish-L.As the speech samples from VGGish-Z are variable length sequences, we extract LSTM embedding  from VGGish-Z for use in different DA approaches.

Implementation Details.
We implement deep learning models using pytorch, tensorflow, and keras.The models are trained for 500 epochs with a batch size of 32 using the categorical cross entropy loss function with the adam optimizer (lr=1 × 10 5 ).Moreover, to prevent overfitting we use earlystopping with a patience=25 and model checkpointing that restores the best model weights.Additional information is presented in the Appendix B.

Domain Adaptation
To improve generalization capabilities, many DA methods have been proposed as discussed in section 2.3.While the specific implementation details are presented in Appendix B.2 & B.3, we briefly describe the UDA and SSDA methods used in our analysis: (1) Subspace Alignment (SA) [33]: The method seeks to align the source and target domains by learning a mapping function between their respective subspaces.Thus, a transformation matrix  is learned to align source   and target   feature spaces.SA is a simple and effective method for domain adaptation, and using subspaces for out-of-distribution alignments has been explored in speech recognition [50,63].(2) Linear Discrepancy Minimization (LDM) [67]: It is an instance-based DA method where the emphasis is on data rather than features.Here, the source data is re-weighted by minimizing the linear discrepancy between the two domains.(3) Adversarial Discriminative Domain Adaptation (ADDA) [105]: This adversarial framework is trained as follows.First, a source encoder generates good features for the specific task on the source domain.Next, a task network is trained using the source encoder to learn the task.Finally, a target encoder is trained to deceive a discriminator network that attempts to distinguish between source and target data in the encoded space.(4) Margin Disparity Discrepancy (MaDD) [117]: Zhang et al. [117] introduced MaDD for unsupervised DA as a method with theoretical guarantees.Empirically, the technique is modified into an adversarial learning problem to learn a new feature representation that minimizes the discrepancy between source and target domains.(5) Attract, Perturb, and Explore (APE) [55]: APE consists of three procedures.First, the target distribution discrepancy is minimized to globally align the target sub-distributions.Second, these distributions are further perturbed to accommodate unaligned target distributions.Third, the exploration procedure locally modulates the class-centers to enable more perturbation into the unaligned regions.(6) Contrastive Learning for DA (CLDA) [95]: The CLDA framework proposes: (1) an instance contrastive alignment loss procedure between the unlabeled target samples and their augmented versions, and (2) an inter-domain contrastive alignment between the labeled source data and the prediction on unlabeled samples.
(7) Entropy Minimization (ENT) [41]: A network is trained to minimize the entropy on unlabeled target samples.Thus, clustering samples around a class center.(8) Minimax Entropy (MME) [91]: This adversarial framework has two steps.First, the representative class sample (prototype) is updated by maximizing the entropy on the unlabeled target dataset.Second, the entropy is minimized to cluster features around the prototype, thus reducing distance between prototype and unlabeled samples.
4.5 Our Sampling Approach: Sinusoidal Similarity Sub-sampling (S3) Motivation.We combine existing ideas in machine learning to address the unique challenges of mental health datasets.
(1) Small datasets: The methods highlighted in section 4.4 are tailored for large datasets.Notably, mental health datasets presented in Table 1, have sizes less than 10 3 , contrasting the 10 5 sizes of DA benchmark datasets like DomainNet [84].S3 adopts a training-free metric solution to select best source samples for the target dataset, circumventing dataset size constraints.This idea stems from sample selection which is built-in in approaches such as OTDD [5], MME [91], and ENT [41].( 2) Lack of labels & data-centricity: The expense of data labeling impedes robust evaluation of models in mental health and healthcare.Data-centric methods have gained traction over training-based solutions because of their ability to generalize well using only features [118].For example, methods such as Simi-Feat [118] rely on computing metrics using only the features, followed by a clustering approach to group similar samples.Following this, S3 computes scores between source and target pairs based solely on their features to detect similar source and target samples.Importantly, our experiments indicate that approaches with built-in sampling and metric computations are more effective for mental health (see sections 6.4 & 7).S3 is based on the following ideas in speech and signal processing.First, Fourier methods assume that a signal can be decomposed into several sinusoidal signals [68].The short-term fourier transform (STFT) computes sinusoidal frequencies of a signal as a function of time, yielding a spectrogram [68].Second, spectrograms can be scaled based on the human perception of sound, obtaining log mel-spectrograms that capture time-frequency dynamics from speech samples [19,98].These are used as inputs in many large-scale audio classification models [45].Hence, the embeddings generated by VGGish are latent spaces with time-frequency information.Using the above-mentioned principles, S3 computes a metric by comparing the source and target embedding to an anchor matrix Λ composed of randomly generated sine waves.Importantly, we construct Λ based on two factors.First, we assume the frequencies of the sine waves are between 80 to 250, covering the average range of human voice [6].Second, the product of sine waves of different frequencies are orthogonal, which is analogous to vector orthogonality [92].Consequently, the product of Λ with source and target embeddings captures frequency information present in both datasets.Now, we formalize our approach.
Problem Statement.Given datasets from the source domain D  and the target domain D  with sample-label pairs and samples (x s , y s ) and x t , respectively, where x ∈ R  is an input audio signal of  seconds.S3 computes pair-wise scores  (x s , x t ) using the source and target embeddings of samples in D  and D  .
As described in Algorithm 1, we compute score  (x s , x t ) in three stages: (1) The embedding  ∈ R × of sample x from a VGGish model is transformed into a sinusoidal matrix Φ using equation 2. Φ is composed of  sinusoidals of length , where k   is the  ℎ column vector of  .Φ   and Φ   represent sinusoidal matrices for a source (  ) and target (  ) sample, respectively.
where Λ  ∈ R × and Φ   , Φ   ∈ R × Empirically, we observed that S3 captures directional information in different ways (Fig. 6).Therefore, the best fine-tuning subset for the target domain is selected in three ways (Algorithm 2).For every target sample x t , S3N and S3M select optimal source sample using the smallest and largest  (x s , x t ), respectively.By relaxing the assumption that there is one-to-one correspondence between source and target sample, S3R selects two samples with smallest and largest pair-wise scores.In contrast to UDA approaches that emphasize feature transformations and sample re-weighting, S3 simply selects the optimal subset.Moreover, S3 does not require labeled target data, differing from other SSDA methods.Thus, making it suitable for smaller datasets in the mental health domain.We fine-tune the models using the best subset for 50 epochs using the same setup as deep learning models with earlystopping and model checkpointing.

Characteristics of Datasets
To evaluate data similarity, we compute the OTDD across the datasets using the 2268 AVEC2013 features extracted from raw audio data.An important advantage of this approach is its ability to quantify similarity.From Fig. 2, we observe that student is dissimilar to more clinical datasets such as MDD, AVH, and PT.Furthermore, AVH and PT have the smallest OTDD of 2.7 × 10 28 .In fact, auditory verbal hallucinations and persecutory thoughts are common symptoms of psychotic disorders, including schizophrenia, bipolar disorder, and schizoaffective disorder [87,106].Capturing this "common-ness" using audio features motivates us to evaluate the predictive performance across these datasets.Furthermore, we observe that MDD is closer to AVH and PT.Recall that MDD is a clinical dataset, and its similarity to AVH and PT might be related to clinical sub-populations in these studies.It is worth noting that we refer to symptoms and not a diagnosis following the philosophy underlying the NIMH Research Domain Criteria (RDoC) [25,34].
We use t-SNE [108] to visualize our dataset as a low-dimensional embedding.In addition to indicating the similarity of data samples, t-SNE plot also captures the variance in the dataset as a whole.Perplexity is an important t-SNE parameter that guesses the number of neighbors around a point.Thus, smaller and larger values emphasize local and global attention, respectively.As recommended in literature [108], we considered perplexity values between 5 and 100.After visually interpreting many figures, we chose perplexity=50 and iterations=2000.We refer to areas of t-SNE using (x, y) coordinates.From Fig. 3, we observe that the student dataset is more precise and less variant (-30, 5), suggesting that it might be difficult to generalize without diverse model representations.Moreover, larger datasets such as PT and AVH have more variance and span across the entire dataset spectrum, suggesting that they might be good candidates for training generalized models.As a smaller dataset, MDD will be a good candidate for testing; however, it might be difficult to train as the samples cannot adequately represent other larger datasets.

Evaluating within-dataset Performance
Establishing within-dataset performance baselines is a crucial prerequisite for evaluating generalization.Here, we train and test the models on the same dataset using stratified-5-fold cross-validation.From table 3, we observe logistic regression obtained a balanced accuracy of 0.62 for the MDD and Student datasets and poor results for AVH and PT.In contrast, VGGish-Z obtained a balanced accuracy of 0.62 and 0.68 for AVH and PT, respectively, and performed poorly for other datasets.However, we ask, "Are larger and balanced datasets better?".Recall that AVH and PT have 356 (SI=60.1%)and 209 (SI=74.6%)participants, respectively.From Table 3, we observe that AVH has lower balanced accuracy than PT (0.62 vs 0.68), suggesting that factors other than data size and balance are important.Perhaps, AVH has a more heterogeneous cohort than PT, and capturing their characteristics is harder.In essence, data choice is a crucial component.Furthermore, we use the F1-score and recall are used to evaluate performance on positive SI detection.By comparing tables 3 and 4, we observe that the models with the highest F1 scores align consistently with balanced accuracy.Specifically, LR achieves the best scores for MDD (0.23) and Student (0.35), while the VGGish-Z-based model achieves the highest score for AVH (0.71) and PT (0.83).However, we notice a trade-off between balanced accuracy and recall.In particular, SVM attains the best recall scores for MDD (0.60) and AVH (0.86) but shows poor balanced accuracy of 0.51 and 0.49, respectively.
In summary, we observe that relatively small and homogeneous datasets such as MDD (clinical) and Student (non-clinical) can be modeled better using traditional ML models.In contrast, deep learning methods perform better on large heterogeneous datasets such as AVH (mixed) and PT (mixed).As the student dataset's raw audio is not released to protect personal identifiable information, we do not test it with deep learning methods.As a first step toward evaluating generalizability, we train our models on one dataset and test it on another.We refer to this setup as one-one validation and represent our results as a matrix.From Fig. 4, we make many interesting observations as follows.

One-One Validation
First, out-of-distribution performance is lower than within-dataset in many cases, as shown in Fig. 4   Perhaps, the representations learned by deep methods are tuned for specific cohorts.From Fig. 4, we notice that models trained on MDD do not transfer well to other datasets.The MDD dataset size and balance is the most probable reason for this result.Moreover, we observe that large diverse datasets such as AVH and PT generalize better without any adaptation or tuning.In particular, AVH is the best dataset for generalization with balanced accuracies of 0.71, 0.57, and 0.53 for MDD, Student, and PT, respectively.Second, from Fig. 4 (d), (e), (g), (h), we observe that PT exhibits higher positive predictive power compared to other datasets.Utilizing traditional ML methods, PT achieves recall scores of 1, 1, and 0.93 for MDD, Student, and AVH, respectively.However, its precision is relatively low, as indicated by the corresponding F1-scores of 0.24, 0.35, and 0.73.Additionally, AVH generalizes well to PT, whereas the opposite is not true, as evidenced by the balanced accuracies of the three models.In summary, our findings underscore the significance of identifying samples that can generalize to the target data, which serves as a motivation for the development of S3.
Finally, S3 improves the generalization of deep learning models in many cases.In particular, AVH to MDD, AVH to PT, PT to MDD (Fig. 4 (b), (c), (e), and (f)).Fine-tuning AVH for MDD using S3 variants improves balanced accuracy by Δ = 0.12 and F1 by Δ = 0.07 over a standard VGGish-Z deep model.In fact, these scores are better than within-dataset performance for MDD (balanced accuracy: Δ = 0.12; F1: Δ = 0.17).These results suggest that S3 is well-suited to transfer performance from larger to smaller datasets in the most optimal way.Moreover, notice that S3 yields performance improvements when AVH is fine-tuned for PT (balanced accuracy: Δ = 0.03; F1: Δ = 0.02).In contrast to methods that emphasize model tuning, S3 improves performance by choosing optimal samples for fine-tuning, thus, highlighting the strengths of data-centric approaches.

Leave-one-dataset-out Validation
In this experiment, we evaluate generalization in leave-one-dataset-out (LODO) validation, where one dataset is used for testing, and all others are used for training.Here, we evaluate UDA and SSDA to improve generalization.
From Table 5, we observe that, in five out of the eight cases, LODO further decreases the balanced accuracy of traditional ML compared to one-one validation benchmark (Figure 4).In particular, balanced accuracy in MDD (Δ = −0.09),Student (Δ = −0.01)and PT (Δ = −0.02),and F1 scores in MDD (Δ = −0.06)and PT(Δ = −0.08).Therefore, we fine-tuned the ML methods using linear discrepancy minimization (LDM) and subspace alignment (SA) to accommodate distribution shifts in the target domain.DA methods improved performance over traditional ML in 4 cases: balanced accuracy of MDD (Δ = 0.02) and PT (Δ = 0.02), and F1-scores of MDD (Δ = 0.02) and PT (Δ = 0.04).Moreover, logistic regression with LDM on AVH performs better on the SI class than the benchmark indicated by the 0.75 F1-score (Δ = 0.02).Surprisingly, a random forest with LDM obtained a balanced accuracy of 0.64, surpassing the within-dataset benchmark (Δ = 0.02).Logistic regression with SA and LDM performs reasonably well for adaptation.Perhaps, the combination of using a linear model, linear adaptation, and a small dataset is well-suited for SA.To summarize, our results suggest that domain adaptation works in 50% of the cases.Importantly, using AVH, a large, diverse dataset, as the source domain improves performance.Thus, indicating the importance of data choices.
In the base deep learning setup (Table 6), we observe that the smaller dataset (MDD:Δ = 0.16) benefits greatly from more diverse data, whereas larger datasets suffer in the LODO setup.Generalization performance decreases in two out of six cases: balanced accuracy of AVH (Δ = −0.11)and PT (Δ = −0.19).Consequently, we investigated UDA and SSDA method to improve cross-dataset performance.From Table 6, we observe that adversarial UDA such as MaDD and ADDA do not improve generalization, mainly because adversarial training requires large datasets [3].
Among SSDA approaches, we notice that APE and CLDA are less effective than ENT and MME.APE obtained the best F1 score (0.77) on AVH, nevertheless, it does not perform well on other datasets.Similarly, CLDA generalizes poorly to other SI datasets.Perhaps the training procedures for contrastive learning and APE are not viable in the context of SI detection.From Table 6, observe that ENT obtained the best balanced accuracy for MDD (0.63) compared to other SSDA baselines.Importantly, MME works across datasets with balanced accuracies of 0.59, 0.51, and 0.53 for MDD, AVH, and PT datasets.

POST-HOC ANALYSIS 6.1 Result Highlights by Measuring Robustness
We compile our analysis by measuring robustness, thus summarizing the performance of DA approaches across MDD, AVH, and PT.We disentangle accuracy from robustness using the notions of effective and relative robustness proposed by Taori et al. [102].Effective robustness () quantifies if the accuracy under distribution shift is better than what is expected from obtaining higher within-dataset accuracy.Given a model , we compute  using equation 5 as follows.First,  is computed using a log-linear on the base models without DA.Next, the slope and intercept are used to predict the expected values.Finally, the difference between these values and accuracy with DA is computed.Relative robustness () is computed as the accuracy difference between the base model and DA in the LODO setting.
LDM with traditional ML emerged as a robust UDA approach for the MDD dataset with a  = 0.06 &  = 0.12 (Fig. 5).However, deep UDA approaches had poor generalization for the MDD and AVH datasets.From Fig. 5, we can also observe that RF is particularly robust for MDD and PT exhibiting positive  and .Moreover, we observe that S3 obtains positive robustness values across all datasets compared to other SSDA methods.In fact, it is the only method to achieve this in the AVH dataset.MME emerged as the most robust for PT ( = 0.048).Interestingly, most methods are not robust for AVH, whereas most methods are robust for PT.The most probable reason for this is that the effect of distribution shift is more drastic in PT than AVH.Thus, we expect more improvements in PT than AVH when applying a DA approach.

Closer Examination of S3
A deeper investigation into the factors that contribute to S3's effectiveness is necessary.As S3 samples data for finetuning, it is natural to question if performance gains are from additional fine-tuning rather than chosen samples.Thus, we train with a random subset referred to as Random.From Table 6, we notice Random's performance deteriorates or remains the same in most cases (5 out 6 cases), suggesting that samples selected by S3 are conducive to fine-tuning models to the target domain.Next, we compare the probability distributions of the source and target datasets with subsets selected by S3N, S3M, and S3R.Here, we first apply PCA to the VGGish-L embeddings to obtain a 2D latent space.Next, we use kernel density estimation to model the probability distribution of the Table 7. Top-3 most correlated handcrafted features with learned features of VGGish + S3R.The significance of learned features with labels is computed using Kruskal-Wallis Test at  < 0.05.Spearman  is used to compute correlations between significant and handcrafted features, and it is presented in braces.

Interpretability
Understanding the learned features of the deep learning model is vital for SI detection applications.In particular, we want to interpret features that are important for generalizability, thus we use models from the leave-onedataset-out setting.To identify associations between the latent feature of VGGish+S3R (sections 4.3 & 4.5) and traditional handcrafted features (section 4.2.1),we use rigorous statistical tests in two stages: (1) As interpreting variable length sequences is non-trivial, we extract the output of our penultimate fully connected layer to obtain a 128-D vector for each input.Next, we apply the Kruskal-Wallis Test (nonparametric version of ANOVA) to identify significant learned features between the SI and non-SI groups.(2) We select the top three significant features and compute their spearman correlations with the AVEC2013 handcrafted features described in section 4.2.1.We identified 3, 4, 9 significant features ( < 0.05) for the MDD, AVH, and PT datasets, respectively.The 3 most significant features in each dataset exhibiting the most positive and negative correlations with the handcrafted features are presented in Table 7.Here, we observe that spectral roll-off is crucial for detection across both MDD and AVH datasets.Interestingly, most spectral roll-off functionals are positively correlated whereas mel-frequency cepstrum coefficients (MFCC) are negatively correlated.Notice that the psycho-acoustic sharpness variable in AVH is correlated to our deep learning embeddings, suggesting that our models can capture acoustic measures of mental health symptoms.Important features for detection in PT are different from MDD and AVH, suggesting that some acoustic differences are present across different populations.

S3: Mental Health vs. Other Domains
Recall that S3 is designed to address mental health tasks which are characterized by small datasets and unlabeled target domain, benefiting data-centricity.On the opposite end, we wanted to examine if S3 can extend to largescale audio datasets.Thus, we build models to perform acoustic scene classification (ASC) using the TAU urban acoustic scenes 2019 mobile development dataset [69].Given a 10s audio sample, the goal of ASC is to classify it into one of 10 classes (airport, indoor shopping mall, metro station, pedestrian street, public square, street, tram, bus, underground metro, urban park).We tackle sub-task b which consists of three sub-datasets A , B, and C referring to data obtained from different recording devices/domains.Datasets A, B, and C consists of 14400, 1080, and 1080 samples, respectively.Importantly, to mirror mental health dataset sizes, we perform the same analysis with 10% of the dataset, i.e., A, B, and C consist of 1440, 108, and 108 samples, respectively.Evaluating the performance of SSDA approaches on the full and 10% datasets will highlight the advantages and disadvantages of S3 (Table 8).For brevity, training details are described inAppendix E.
Table 8 shows many interesting observations relevant to understanding S3.While S3 performs reasonable well for the full dataset, CLDA is clearly the best performing method.In contrast, S3 is the best performing method on the smaller dataset (10% of dataset).Other methods like MME and ENT also demonstrate notable effectiveness on this reduced dataset, an observation noted in SI detection.In summary, S3 is not an optimal choice for broad, large-scale SSDA applications, but it proves to be effective in the mental health domain, which typically involves smaller datasets.

DISCUSSION 7.1 Summary of Results
In our within-dataset analysis (section 5.2), we find that deep learning and machine learning perform better with larger and smaller datasets, respectively.This finding has been observed by other studies examining the impact of dataset size on learning model [8,101]. in the StudentSADD study [104], best performing ML and deep model obtained accuracy of and 0.73, respectively.Similarly, we obtain balanced accuracy ranging from 0.62-0.68 for the different datasets, indicating the challenging nature of speech-based SI detection.
Using OTDD and t-SNE, we observed that AVH and PT are similar datasets, while Student is dissimilar to all other datasets.Generally, we assume similar datasets to transfer better.Through our one-one validation, we make several interesting observations.First, our assumption about data similarity transferring better is untrue.While AVH had minor generalization on PT, the inverse is not true.Second, transfer from larger to smaller datasets is better, as observed by other studies [116].The observations above emphasize the need to choose the "right" data for adaptation, serving as a motivator for S3.Third, S3 variants performed the best, improving over VGGish-Z.In fact, using S3M with AVH for MDD obtained a balanced accuracy of 0.74, which is Δ = 0.25 higher than the within-dataset baseline.
Through leave-one-dataset-out validation (section 5.4), we evaluate the performance of testing on one dataset while training on all other datasets.We observe that distribution shift leads to decreased performance in many cases in both traditional machine learning and deep learning.Many studies have studied the effects of distribution shifts of performance and propose DA and DG methods for mitigating their effects [39,78].We employ some of these methods to investigate their effectiveness in alleviating this problem.Subsequently, we observed that using LDM with machine learning methods mitigated performance decreases in many cases.For deep learning methods, we noticed that adversarial UDA approaches such as ADDA are insufficient to improve performance.In summary, UDA approaches work reasonably well for traditional ML approaches compared to deep models.
Among SSDA methods, S3 improves cross-dataset performance in most cases over the baselines.Our analysis attributes these improvements to the specific design elements of S3, which are useful for mental health applications.Recall that S3 leverages a metric-based solution to subsample from the source dataset.While ENT and MME demonstrate improved cross-dataset performance, CLDA and APE are ineffective.CLDA's contrastive learning framework relying on strong augmentations and abundant unlabeled training data, is not suited to address our domain's limitations.Similarly, APE's demand for matrix uniformity to assess maximum mean discrepancy poses challenges.ENT's core principle is to use conditional entropy [41] to discern samples beneficial for domain adaptation.This mirrors S3's metric-driven optimal sample selection.MME's approach hinges on measuring sample "distance" from class prototypes.From a feature perspective, this is similar to computing the distance between anchor and source/target embedding in S3 (equation 4).By comparing the presence/absence of components of SSDA baselines with S3 suggests that S3's design elements like subsampling, preference for metric computations over feature transformations, and data-centric focus are paramount when dealing with mental health datasets.Importantly, our empirical analysis on acoustic scene classification (section 6.4), indicates that S3 excels with smaller datasets.This makes it particularly suitable for mental health contexts, where small and often unlabeled datasets are common, and data-centric solutions are crucial.
We analyze the choice of the S3 variant by summarizing our analyses.Notice that S3M selects one optimal sample whereas S3R selects two, thus relaxing the assumption of only one optimal source sample.We observed that S3 variants improve generalization for one-one (section 5.3) and leave-one-dataset-out validation (section 5.4).Moreover, from Fig. 6, we see that S3M captures variance in one direction, while S3R accommodates larger subsets of data, suggesting that S3M and S3R are suited for single-domain and two-domain scenarios, respectively.Further investigation is necessary to evaluate S3R's effectiveness in multi-domain setups using many datasets.In our interpretability analysis, we find that spectral roll-off is important feature that generalizes across MDD and AVH.Previous studies have suggested the effectiveness of spectral roll-off for detecting depression [66,99] and somatization disorder [86].However, this feature was not important for PT, suggesting that differences exist between populations.

7.2
7.2.1 Computer Collecting speech-based diaries from mobile phones a crucial component of our Smartphones offer a fast and cost-effective of administering interventions or monitoring at-risk individuals.Our investigation implicitly suggests that diverse smartphones instrumented with different microphones can feasibly detect suicidal ideation.Therefore, we can extend our study to provide just-in-timeadaptive interventions (JITAI) in two ways.First, we can integrate our method into existing applications such as Talkspace to screen individuals experiencing a mental health crisis and connect them to mental health experts.Second, we can provide personalized screening, integration, and self-monitoring to individuals by tracking their mental health history.During monitoring, mobile phones can deliver longitudinal interventions such as cognitive behavioral therapy (CBT) or mindfulness-based stress reduction (MBSR).

7.2.2
Resource-constrained Populations.The under-detection of mental illness in resource-constrained environments is a common problem [52].Many low and middle-income countries have a large psychiatric disease burden without sufficient resources [82].Furthermore, many populations with SI are understudied.In such scenarios, data collection efforts lead to small imbalanced datasets, making computational modeling more challenging.Moreover, generating large labeled datasets is infeasible owing to large resource burdens.Our investigation directly addresses many of these challenges.For example, smaller target domains significantly benefit from optimal transfer using larger source datasets.These findings are important in many cases.For example, SI in understudied mental disorders such as body dysmorphia [85] can be analyzed with reduced data collection efforts.Similarly, extending our method based on socio-economic status and geography can help underrepresented groups.Following NIMH RDoC [25,34], we evaluate a common symptom across populations, with the potential to advance understanding of these symptoms at a level that is more general than that of typically-used diagnostic categories.

Data-centric Machine Learning.
Current machine learning research is model-centric, where improving predictive performance involves experimenting with new architectures, loss functions, optimization methods, etc.In contrast, data-centric machine learning refers to systemically engineering data in different ways to improve predictive performance.Here, the data is given more importance, and the models are assumed to be fixed.Recent work in this area focuses on covariate shifts and trustworthy data samples [17,93].We believe that robust in-the-wild systems in speech-based SI detection via mobile phones could benefit greatly from data-centric approaches.S3 is data-centric as it selects samples from the source domain that is more likely to explain shifts in the target domain.Furthermore, we evaluate performance on unseen users, where the model has no prior information about the user or the domain in many cases.Thus, aiding translation to real-world applications and addressing the cold-start problem [65].

Challenges of Generalizability in Mental
Health.Through comprehensive analysis of DA methods using multiple datasets, we achieve incremental improvement over many previous methods.Nonetheless, our results highlight that detecting rare symptoms in small mental health datasets with rare samples is challenging with great room for improvement.Here, we suggest some future directions to directly tackle this unresolved issue.First, as data is limited, models may benefit from incorporating expert knowledge instead of a purely learning based approach.Second, multi-modal methods could enhance the amount of data available.Moreover, using different modalities can complement each other and thus improve generalization.Third, we could personalize models to investigate if it works across people before we generalize across populations.

Limitations
Here we discuss the limitations of our study and proposed S3 algorithm.First, while we investigate generalizability across four datasets, it is crucial to understand the trade-offs with generalization.A method that works in all cases is impossible and not be In our we make many efforts to recruitment biases include people representing the studied Nevertheless, biases arise from many factors, including gender, race, geography, and socioeconomic status.Second, while the AVEC2013 has been validated for affective computing and some mental illnesses, it is possible that feature engineering has some information loss that decreases predictive power.
S3 requires latent embeddings to compute pair-wise scores.Thus, making it suited for deep learning methods.However, applying S3 to traditional ML methods without latent spaces is not straightforward.We choose an S3 variant such that the number of samples selected should equal the number of domains.However, we only investigate this situation in one and two domain settings.Future work will benefit from exploring multi-domain scenarios.Also, notice that selecting more samples with fixed domains will bring the subset closer to the whole dataset, which is undesirable.S3 effectiveness is reduced when transferring from smaller source domain to larger target domains.This is a challenging problem, and future work addressing this area of research is necessary.S3 is specifically designed with mental health datasets as its focus, so it's best understood within that context rather than as a generic domain adaptation method.Given its effectiveness with smaller datasets and its foundation in speech and signal processing, exploring its use in different areas could be valuable for future research.Such exploration might help determine S3's strengths and weaknesses for general audio data processing.

Ethical Considerations
We believe smartphone speech data for SI detection provides actionable insights for clinicians.However, ethical concerns must be addressed to ensure participant safety and privacy, and our processes are as follows.First, our studies involved at least two clinical psychologists or psychiatrists to address participant concerns.In the MDD study, if suicide intent was expressed (see Fig. 1), a message alerted the team, enabling immediate outreach by a psychiatrist/psychologist. Second, to ensure transparency and accountability when handling sensitive user data, informed consent processes were set.Here, we ensure participants understood how their data was being used through written content and/or custom-made videos.Third, participants are assigned a unique random user ID to protect their identity, and we avoid using demographic data or personally identifiable information (PII) for modeling to minimize unintentional biases and harm.The data is stored on servers with 2FA and only accessible to specific study team members.
The participants recruited in our studies are representative of the psychiatric symptom population.Moreover, we rigorously investigate S3 and other methods under out-of-distribution shifts across four datasets.Nevertheless, users of such systems should be aware of biases in the machine learning models.For instance, our datasets are largely white females.Thus, their effectiveness for minorty groups should be taken into account prior to deployment.We envision our method as a screening tool that works in conjunction with a clinician.The expert will evaluate the detection and suggest appropriate intervention to ensure the individual receives adequate care.Thus, our method is not a substitute for diagnosis or treatment.

Takeaways and Suggestions
Some important findings from our study are as follows: (1) While most models exhibit poor generalization, models trained on very small datasets benefit from training on larger datasets.(2) SSDA methods are better suited for the mental health domain as large datasets required for UDA are seldom available.(3) S3 incrementally improves over SSDA approaches, indicating that a data-centric approach is useful.Nevertheless, generalizability in mental health remains unresolved.
(4) studies benefit using measures such as effective and robustness in addition to accuracy.HCI and UX researchers should maximize use of spectral roll-off for design because it is important for SI detection.

CONCLUSION
In this paper, we examined the generalizability of speech-based suicidal ideation detection using multiple datasets, including users from different populations.Our analysis indicates that many domain adaptation methods do not generalize well in in-the-wild settings, particularly approaches that require large target datasets.Furthermore, the generalizability of models depends on selecting the "correct" source data for training.Thus, we proposed sinusoidal similarity sub-sampling (S3), which computes pair-wise similarity scores between the source and target domain to select a subset of data for fine-tuning models.Fine-tuning deep models using S3 improves generalizability compared to other deep learning methods across many scenarios.As S3 does not require target labels, it improves generalization considerably on the smallest dataset, suggesting its effectiveness for mental health tasks.In the post-hoc analysis, while two datasets had common important features (spectral roll-off), one dataset had distinct important features, indicating some heterogeneity across populations.Our findings have important practical implications for deploying ubiquitous technology in mental health using machine learning.We hope our work contributes to future research addressing pragmatic challenges in mental health systems, such as distribution shifts, imbalanced datasets, and fine-tuning, ultimately improving mental health screening systems to give individuals the best care possible.

A TRADITIONAL MACHINE LEARNING
We perform hyperparameter search with the following choices: A.1 UDA UDA approaches such as LDM and SA were implemented using the adapt 1 python package.We first perform hyper-parameter tuning as described above.Next, we choose the best model and apply SA or LDM.

B DEEP LEARNING ARCHITECTURES B.1 Base and S3
We implement Base and VGGish-Z using tensorflow and keras with the architecture shown in Table 9.The models are trained for 500 epochs with a batch size of 32 using the categorical cross entropy loss function with the adam optimizer (lr=1 × 10 5 ).Moreover, to prevent overfitting we use earlystopping with a patience=25 and model checkpointing that restores the best model weights.In S3, we fine-tune the models using the best subset for 50 epochs using the same setup as the deep learning models with earlystopping and model checkpointing.

B.2 UDA
UDA approaches such as MaDD and ADDA were implemented using the adapt 2 python package with Keras and TensorFlow backend.The input to these models are LSTM outputs that capture temporal information from varied length time series.We use the following hyperparameters for the approaches.For MaDD, we used the  11) networks with fully-connected (FC) layers.The model was trained using categorical cross-entropy loss with a batch size 16 and MaDD gamma parameter = 2 for 100 epochs.For optimization, we use stochastic gradient descent with a learning rate (lr) = 0.04 on the encoder and predictor.Furthermore, a learning rate scheduler was applied to reduce lr by one-tenth with momentum and alpha of 0.9 and 0.0002, respectively.Finally, we include early stopping criteria for discriminator loss with patience = 10.ADDA uses the same setup as MaDD.It is worth noting that we implemented the same setup with a 1D ResNet instead of FC layers.However, we did not observe performance improvements.

B.3 SSDA
We implemented APE3 , CLDA 4 , ENT 5 , and MME6 using PyTorch.Furthermore, it should be noted that CLDA, APE and ENT are built on top of the MME implementation.We use 10% of the target domain as the unlabeled dataset.SSDA consists of two networks, the encoder (Fig. 7a) and the predictor (Fig. 7b).The input to these models are the LSTM outputs that capture temporal information from varied length time series.The following hyper-parameters are evaluated on these networks: linear in_features=[128, 64,32,16] and dropout p=[0.5, 0.4, 0.3, 0.2].Overall, the models are trained for 200 epochs with batch size=10 using cross-entropy loss.Furthermore, the G (lr=0.01)and F1 (lr=1.0)networks are optimized using stochastic gradient descent with momentum=0.9 and weight decay=0.0005.Now, we discuss specific details for each approach.For ENT and MME, we use temperature=0.05 and eta=0.01.In APE, we extract the normalized output after the first linear layer for computation.Furthermore, we sampled the source data to match the size of target unlabeled data, a requirement for MMD computation.In CLDA, as 2D augmentation cannot be applied to our data, we investigated adding standard gaussian noise and uniform noise.Ultimately, we used uniform noise to generate negative samples for training.

Fig. 1 .
Fig. 1.Example Android application screens from the MDD study: (a) The PHQ-9 Item 9 question, (b) The user submits a high Item 9 score and is redirected to a safety question which provides immediate support, (c) A direct link to call emergency services if the user is currently experiencing SI, (d) The user can optionally submit a audio, video, or text diary, (e) The audio diary screen where the user can submit a recording up to 180 seconds.The safety protocols employed are described in section 7.4.

( 2 )Algorithm 1
To compute the anchor sinusoidal matrix Λ, we sample k  ∈ R 1× where each value  ∈ [0, 2].Next, Λ is generated using equation 3. (3) The source (Φ   ) and target (Φ   ) sinusoidal matrices are multiplied with the anchor matrix (Λ), and the outputs are multiplied with each other and aggregated to compute the scalar score  as shown in equation 4. Computing S3 scores Input: Source dataset D  = {(  ,   )}  =1 , Target dataset D  = {  }   =1 , VGGish model with pre-processing , vector of anchor frequencies f.Output: Pair-wise scores Γ ∈ R  ×  ← initialize weights for  = 1 to  do for  = 1 to  do

Algorithm 2 Fig. 2 .
Fig. 2. Pair-wise dataset distances (divided by 10 25 ) computed using optimal transport dataset distance (OTDD).Smaller values indicate the datasets are more similar.AVH and PT datasets are similar to each other, whereas the Student dataset is dissimlar to all other datasets.

Fig. 4 .
Fig. 4. One-One evaluation with balanced accuracy (top row), F1 (middle row), and recall (bottom row).Best refers to the top performing method.The scores for each model/method are shown in Appendix C.

Fig. 5 .
Fig. 5. Measuring effective robustness (top row) and relative robustness (bottom row) of UDA and SSDA baselines, and S3.Positive values indicate useful robustness to distribution shifts.

Fig. 6 .
Fig.6.Kernel density estimate plots describing the distribution of the data chosen by S3N, S3M, and S3R.Source and target domains are represented by the gold and blue contours, respectively.A compact fit to the target domain is more desirable.For example, in subfigure (a), we observe that S3R is compact and encompasses MDD better than S3N and S3M.

Table 1 .
An overview of suicidal ideation studies using speech.
[97]ravarthula et al.[16]examined suicide risk factors in military couples with acoustic features, embeddings, and lexical cues.Using these features, they trained an SVM to predict categories of suicidal risk such as none, ideation, attempt with an average recall of 0.6.Similarly, Stasak et al.[97]investigated manually annotated voice quality and speech disfluency measures in 246 individuals with and without SI.Their findings suggest that the SI group has a lesser average number of hesitations and speech errors compared to the suicide attempt group.

Table 2 .
Demographic descriptors of participants in our analysis.Note that pacific islander, hispanic/latino individuals are grouped under Other.(Demographics for one MDD participant is unavailable).

Table 3 .
Within-Dataset performance using balanced accuracy.

Table 4 .
Within-Dataset performance using F1-score and recall.

Table 5 .
Leave-one-dataset-out validation using traditional learning methods.
Δ = −0.15,respectively.Moreover, notice that deep models have more severe reductions than ML methods.

Table 6 .
Leave-one-dataset-out validation using deep learning, UDA, and SSDA approaches.
Table Investigating adaptation for acoustic scene classification.Within-Dataset refers training and testing on the dataset, whereas all other methods use validation.The three domains A, B, C refer to different recording devices.Metrics are presented as the unweighted mean of the individual class metrics.Recall and precision are reported in Appendix E.