MoodCapture: Depression Detection Using In-the-Wild Smartphone Images

MoodCapture presents a novel approach that assesses depression based on images automatically captured from the front-facing camera of smartphones as people go about their daily lives. We collect over 125,000 photos in the wild from N=177 participants diagnosed with major depressive disorder for 90 days. Images are captured naturalistically while participants respond to the PHQ-8 depression survey question: \textit{``I have felt down, depressed, or hopeless''}. Our analysis explores important image attributes, such as angle, dominant colors, location, objects, and lighting. We show that a random forest trained with face landmarks can classify samples as depressed or non-depressed and predict raw PHQ-8 scores effectively. Our post-hoc analysis provides several insights through an ablation study, feature importance analysis, and bias assessment. Importantly, we evaluate user concerns about using MoodCapture to detect depression based on sharing photos, providing critical insights into privacy concerns that inform the future design of in-the-wild image-based mental health assessment tools.


INTRODUCTION
Today, most people automatically unlock their phones using camera biometrics and face recognition.The front-facing camera quietly captures glimpses of users' faces tens to hundreds of times daily, week in and week out.Unlike selfies, these in-the-moment images capture authentic, unguarded facial expressions, free from biases such as social desirability and self-presentation.We envision a future where AI processes these unguarded facial images on the phone in real-time using deep learning, assessing the user's mood without needing the images to leave the device, thus safeguarding privacy.This low-burden, continuous approach to depression assessment and detection will significantly alter how mental health is passively assessed, enabling early detection of depression, timely intervention, and constant evaluation of individuals at risk.This paper discusses the first steps toward realizing this vision.
Depression is a complex and pervasive mental health issue affecting millions of people worldwide.According to the World Health Organization (WHO), over 264 million people suffer from depression [56], making it a leading cause of disability and a major contributor to the overall global burden of disease.The consequences of depression extend beyond emotional distress [62], significantly impacting physical health [23,55], social relationships [67], and occupational functioning [19].In severe cases, depression can lead to suicide, accounting for nearly 800,000 deaths each year [1,9].The need for early detection and intervention in depression is critical, as timely identification of the condition allows individuals to access appropriate treatment and support, thereby improving clinical outcomes and reducing the risk of long-term complications [21,63].
Smartphones offer an opportunity to explore alternative approaches for depression detection that are more objective, unobtrusive, and continuous.The vast amounts of data generated through daily smartphone usage, including images, text messages, and social media interactions, provide a rich and ecologically valid source of information that can be utilized to gain insights into an individuals' mental state.Consequently, several studies have made use of smartphone sensing data to assess depression [14,75].
Most of the prior research utilizing facial images to detect depression focus on capturing these images in controlled settings, where individuals may be instructed to perform specific actions [22,38,43,46].These face features are not authentic as they are performative and are influenced by biases such as social desirability and self-presentation.Furthermore, traditional methods such as clinical assessments and subjective self-reports are time-consuming and affected by recall bias.Advances in smartphone cameras offer a solution to address these disadvantages.To this end, we present MoodCapture, a novel approach to collect in-the-wild face images and self-reported depression symptoms in natural, everyday environments using smartphones.The resulting face images capture authentic and unguarded facial expressions.Thus, minimizing the influence of self-awareness on emotions and enhancing the credibility of our data.By using such naturalistic images for analysis and training machine and deep learning models, we can better understand intricate patterns associated with depression.Ultimately, insights from our work can be used to create accurate, efficient, and personalized tools for depression detection.
Our paper contributes to the growing intersection of Human-Computer Interaction (HCI) research and mental health assessment by investigating the potential of machine learning and deep learning models trained using in-the-wild smartphone images for identifying depressive symptoms.We collected over 125,000 images from N=177 participants diagnosed with major depressive disorder over three months, utilizing 87 distinct types of Android devices owned by users in the study.On average, each participant provided six photos per day, creating a varied and extensive dataset.We comprehensively analyze various image characteristics obtained from these images captured in-the-wild.We evaluate the performance of machine learning and deep learning models trained to predict depression based on these images, as shown in Figure 1.At the end of the study period, we assess user acceptance by inquiring about participants' comfort levels and privacy concerns in sharing their photos for mental health assessment purposes.Therefore, our research aims to foster the development of more ethically sound mental health assessment and intervention tools.The contributions of our work are as follows: • We develop a passive-sensing image-based mobile app called MoodCapture that automatically collects in-the-wild smartphone images from participants' front-facing cameras, ensuring an unobtrusive data collection process and maintaining user privacy.Compared to prior studies, our application captures front-facing photos in-the-moment, resulting in naturalistic images with authentic emotions.Our app provides valuable insights for future in-the-wild studies.
• We analyze different image characteristics such as illumination, location, phone angle, background color, and objects, providing insights into the visual properties of smartphone images.For example, majority of the images were taken indoors in well lit environments.These properties are crucial for model training and informs HCI practitioners about the environmental conditions in user interactions.• We evaluate the performance of several machine learning and deep learning models for depression detection and PHQ-8 score prediction.A random forest trained with 3D face landmarks demonstrates the feasibility of analyzing depression from in-the-wild smartphone images, resulting in a balanced accuracy of 0.60, Matthew's Correlation Coefficient (MCC) of 0.14 and Mean Absolute Error (MAE) of 130.31 (a 6% improvement over baseline on a 0-800 scale).Furthermore, we identify important features providing useful insights for HCI design.• We report on user acceptance with respect to the comfort levels of the participants in sharing their photos for mental health assessment, providing valuable insights into privacy concerns that inform the future design of in-the-wild imagebased mental health assessment tools.
In addition to its relevance to the HCI community, our Mood-Capture study contributes to affective computing, which deals with recognizing, interpreting, and simulating human emotions.By leveraging computational methods and machine learning models to interpret emotional cues from images, our research contributes to the understanding and development of affective computing within the HCI field.Furthermore, our study has tangible, real-world implications, such as the potential benefits of early depression detection, timely interventions, improved clinical outcomes, and overall wellbeing for individuals.
This paper is structured as follows: Section 2, presents related works in depression detection and work that uses smartphone images.Section 3 details the MoodCapture study, participant demographics, and the analysis we perform to identify image characteristics and to detect depression.Section 4, discusses our results, while Section 5 describes the ethical considerations and user acceptance study.Section 6 discusses the study findings and its implications.Finally, Section 7 and Section 8, discuss the limitations of the study and provide some concluding remarks, respectively.

RELATED WORK
In this section, we delve into the pertinent literature, examining the key studies and developments in the field that inform the foundation of our MoodCapture research.

Smartphones and Mental Health
Depression has been traditionally diagnosed through clinical interviews or self-reporting questionnaries such as the Beck Depression Inventory (BDI) [7] and the Hamilton Depression Rating Scale Figure 1: MoodCapture Framework: Users answer the PHQ-8 depression survey questions using the MoodCapture Android App while the app takes bursts of photos using the front-facing camera on the smartphone (top-left).Image characteristics are analysed using factors, such as, illumination, indoor vs. outdoors, phone angle, dominant image color, and background objects (top-right).Given that raw images compromise privacy, these characteristics provide insights into the types of features our machine learning and deep learning model infer.Finally, OpenFace features are extracted to train machine learning models, while raw images are used to train deep learning models (bottom).Depression classification is a binary predictor that classifies an image as depressed or not depressed, whereas PHQ-8 score prediction is a regression model that predicts raw PHQ-8 scores.
(HDRS) [37].However, these tools are affected by the individuals' subjective recollections, social desirability bias, mental health stigmas, or the person's diminished self-awareness [18,33,69].Therefore, the pervasive, objective, and continuous nature of multifaceted smartphone data makes it an ideal candidate for unobtrusive depression detection.Many studies evaluate patterns in call logs, text messages, GPS coordinates, and overall smartphone activity, to gain insights into behavioral shifts, social engagement frequencies, and alterations in daily routines, all of which can serve as indicators of deteriorating mental health [14,53,75,77].Other modalities such as speech have also gained traction in evaluating mental health symptoms such as suicidal ideation [8,59].The growth of social media platforms provides ways to harness user-generated content for depression detection.In particular, analytical approaches using text and images have been applied to content from platforms like Facebook and Instagram.For instance, the linguistic attributes of posts can shed light on a user's emotional state, sentiment, and overall mental well-being [12,17].Moreover, machine learning algorithms have been employed to decipher patterns and indicators of depression from visual content shared on these platforms.Such analyses often encompass aspects like colors, objects, scenes, and overall aesthetics [26,30,61].

Contextual Image Factors in Human Computer Interaction
Understanding the content and intrinsic characteristics of spontaneous images could be essential from a HCI standpoint.Contextual elements like environment, angle, color, and lighting play a significant role in how users interact with their smartphones.For example, research by Ikematsu et al. [34] indicates that people often prefer positions that require minimal movement when using their devices.This makes it valuable to examine factors such as the smartphone's angle and the background objects present during use.In addition, the ambient light during device interaction can act as a situational impairment, as noted by Tigwell et al. [71] and Sarsenbayeva et al. [68].For instance, the facial expressions and illumination on a user's face can vary greatly between bright outdoor sunlight and controlled indoor lighting conditions.The environment, whether indoor or outdoor, also affects color, which in turn can influence user psychology.Valdez and Mehrabian [72] conducted studies assessing the impact of color on emotions like pleasure, dominance, and arousal.Their findings suggest that colors like blue and purple are typically perceived as pleasant, while greenish hues tend to be more arousing.This raises the possibility that the dominant color in a user's surroundings might have a correlation with their facial features during smartphone interaction.

Smartphone Images in Controlled Settings for Mental Health
Extracting facial features to assess mental health and emotions has received significant attention in computer vision, with applications spanning from education to healthcare [51].Here, many studies have explored facial expressions, gaze patterns, and the overall composition of images to extract visual markers symptomatic of depression [38,43,46].However, most of these studies are conducted in controlled environments or rely on participants deliberately capturing their images, which could inadvertently influence their emotional portrayal.For instance, Kong et al. [38] captured photographs using a tablet in a standardized clinical setting.Participants were asked to sit before a white background, remove hats or glasses, and tie up long hair to expose their ears; the users looked straight ahead with relaxed expressions as instructed.Similarly, Liu et al. [46] employed a multi-modal deep Convolutional Neural Network (CNN), considering both facial expressions and body movements.
During psychotherapy sessions, they captured video using a 4K high-resolution camera in a controlled laboratory setting.Consequently, the participants' expressions and body movements were analyzed in a highly regulated context.Numerous other studies have similarly relied on advanced devices for image capture, used video recordings, or incorporated additional signals (such as movement, audio) within controlled environments [22,31,35,57,60,78].
Our work aims to address these limitations by examining the feasibility of using spontaneously captured images from participants' smartphones, which offers a more natural and less intrusive method for predicting depression.As smartphones have become an integral part of modern life, they are an ideal tool for unobtrusive and widespread data collection.By utilizing smartphone cameras to capture participants' images, our approach eliminates the need for controlled environments or deliberate image-taking, thereby reducing the potential for biased emotional portrayals.Furthermore, the widespread availability of smartphones enables our method to reach a larger and more diverse population, ultimately promoting greater accessibility and inclusivity in mental health assessments.

"In-the-wild" Smartphone Images for Mental Health
Our study emphasizes the analysis of "in-the-wild" smartphone images, particularly those captured via front-facing cameras of smartphones.These images offer a direct window into an individual's emotions, expressions, and environment, thus enhancing the accuracy of mental health assessments.In contrast to social media content, these images remain relatively free from biases like social desirability and self-presentation, which often affect traditional tools.
A limited number of past research have used "in-the-wild" smartphone images for mental health evaluation.For instance, Wang et al. [73] collected 5811 opportunistic photos in-the-wild from 37 students over ten weeks using their phone's front-facing camera.
The study reported that depression scores significantly correlate with the students' facial expressions and activity.While Wang et al. [73] was the first to use in-the-wild images from front-facing phone cameras to study mental health on a non-clinical population of college students, the authors state that there was insignificant signal in the images to predict self-reported depression.MoodCapture is inspired by this original work, which was part of the StudentLife study [74] in 2013.Our progress is that a decade on from the Stu-dentLife study, phone cameras have seen significant advancements, leading to substantial differences in their capabilities compared to those from ten years ago.For example, new phone cameras typically offer much higher resolution and more megapixels than those from a decade ago, resulting in sharper and more detailed face photos; advances in sensor technology and image processing have greatly improved low-light performance, resulting in today's phone cameras capturing better quality face photos in low-light conditions; optical image stabilization has become more common in smartphone cameras today, reducing the impact of shaky hands and resulting in smoother sharper photos, especially in low light; and finally front-facing cameras primarily designed for selfie shots have improved significantly in terms of resolution, image quality, auto-focus on the face.Other differences between Wang et al. [73] and our work are that we take advantage of massive advances presented by deep learning models and focus not on a non-clinical group but a clinical population.
Other studies have also leveraged front facing cameras in one way or another.Khamis et al. [36] studied the visibility of the face and eye in 25,726 in-the-wild images of smartphone users and found that the full face is visible about 29% of the time.The authors stated that their state-of-the-art face detection algorithm performed poorly against photos taken from front-facing cameras.Similarly, Bâce et al. [3] used in-the-wild images to study the visual attention and gaze of users.Darvariu et al. [16], on the other hand, used in-the-wild images from rear-facing cameras.The authors developed a smartphone application that allows users to periodically log their emotional state together with pictures from their everyday lives.They collected 3,305 mood reports with photos from 22 participants.Authors report finding context-dependent associations between objects surrounding individuals and their self-reported emotional state.However, the genuine spontaneity of these captures and their potential for unbiased mental health evaluation remain relatively unexplored.Our contribution to this growing field pivots on the innovative use of genuinely spontaneous, in-the-wild facial images for depression detection.By employing a passive-sensing mobile application that seamlessly captures images without the subject's acute awareness, we negate the potential influence of self-awareness on emotional representation.This strategy bolsters the ecological validity of our data source, making it a robust tool for depression detection.

METHODOLOGY
In what follows, we discuss the design of our MoodCapture study, demographic information of the individuals that participated in the study and the ground-truth used for analysis.

Study Design
We recruited 181 participants from across the United States using targeted online advertisements on Google and Facebook.Each participant underwent a clinician-administered Structured Clinical Interview for DSM-5 (SCID), and only those diagnosed with Major Depressive Disorder (MDD), without bipolar disorder, active suicidality, or psychosis, were eligible for the study.Upon qualification, participants installed our Android-based mobile sensing app on their devices, which gathered Ecological Momentary Assessments (EMA) during the 90-day study period.The app prompted participants to complete a brief Patient Health Questionnaire-8 (PHQ-8) [41] (see Table 6) survey about their depressive symptoms three times daily (morning, afternoon, and evening).As participants answered their daily surveys, the app was designed to discreetly capture a burst of up to 5 images using the front-facing camera.Specifically, images were taken when participants responded to the PHQ-8 item: "I have felt down, depressed, or hopeless." (see Fig. 2).We chose this question as we believed it would best capture participants' genuine emotions related to depression.The PHQ-8 is a validated inventory for measuring depression.For further information about the survey, please refer to the Ground Truth section.
During the onboarding process, we informed participants about the image capture procedure and emphasized that sharing their photos was optional.Upon launching the mobile app for the first time, participants were asked, "To help us better understand your depressive symptoms, we would like to take a few photos in the background that capture your facial expressions while you fill out questionnaires.Do you give us permission to do this?"Participants could respond with either "Yes" or "No."If they agreed to share their photos, the app captured images as they answered the EMA.If they opted not to share their photos, no images were captured.The image capture process was designed to be unobtrusive, with only a green dot at the top of the Android status bar/screen indicating camera usage -which users' may or may not have observed.Participants did not see their face or receive any other indication that photos were being taken.This discreet image capture process ensured a seamless user experience without interrupting or obstructing the EMA flow.As stated earlier; while participants consented to have photos taken using the front-facing camera during the operation of the MoodCapture app in the study they were not informed exactly when these photos were captured, thus promoting in-the-moment naturalistic and authentic capture of users' faces and surroundings.
Participants were compensated $1 for each completed EMA, with an additional $50 bonus for achieving a completion rate of 90% or higher during the study period.Compensation was not dependent on sharing photos; participants were compensated regardless of their photo consent.The study was approved by Dartmouth College's Internal Review Board (IRB).Our analysis and predictive modeling focuses on 177 out of the 181 participants who provided consent for their photos to be captured.We collected 125,335 images from these participants, excluding 15,063 photos that were either too blurry, contained no faces, featured children, or contained nudity.

Ground Truth
Our study is designed to account for the wide variability in MDD symptoms.In particular, MDD can manifest in over 1000 distinct symptom combinations across individuals, with significant withinday variations [15,20,24,25].However, existing diagnostic methods face several limitations.Firstly, SCIDs are not effective in capturing moment-to-moment fluctuations in depression symptoms.Secondly, the Likert scale used in depression screening tools like the PHQ-8, which typically offers a limited response range from 0-3, forces respondents to fit their experiences into pre-set categories.This can lead to central tendency bias and a lack of detailed responses for complex mental states.To overcome these challenges and better capture intra-individual variation, our clinical team modified the PHQ-8 scale to a more nuanced continuous scale ranging from 0-100 (see Figure 2).The practice of re-scaling psychometric scales is not uncommon and has been applied to the PHQ in various past studies [29,48,50,54].A standard PHQ-8 score of 10 or higher (out of 24) signifies major depression [44].In our continuous scale, a score exceeding 334 indicates depression (i.e., 10/24 times 800).To provide holistic analysis, we complement our binary classification models with regression models that predict raw PHQ-8 scores.Note that the PHQ is versatile, serving both as a screening tool for depression and as a means to monitor clinical symptom changes [40].
To enhance the reliability and accuracy of the EMA responses, we employed a validation technique wherein the app randomly reversed one question in each PHQ-8 survey (thus adding an additional item), ensuring that participants are attentive.We then compared the responses to the original and reversed questions; if there is a significant discrepancy, the response is excluded from our analysis.After applying this filtering process, we obtain a refined dataset comprising 31,215 EMAs.Since we captured a burst of images with each EMA response, we amassed 110,272 images in total.As depicted in Figure 3a, we divided our dataset into two groups: depressed (74,347 images, N=175) and non-depressed (35,925 images, N=156).On average, participants submitted 176 EMAs (stdev = 78) and 623 images (stddev = 278) per participant during the study period.It is crucial to note that all participants recruited for this study had major depressive disorder.Consequently, they reported being below the cut-off threshold on some days and above it on others.However, 19 participants consistently reported depression throughout the study.Figure 3b shows the variability of PHQ-8 scores among participants i.e., intra-individual variability.It provides insight into the fluctuations in a participant's scores over time.On average, participants' scores varied around their own mean by approximately 101.92 points, with the variability ranging widely from a standard deviation of 27.56 points to as high as 262.24 points.This suggests that some participants had relatively stable scores over time, while others exhibited more pronounced fluctuations.Moreover, we measured the internal consistency of the PHQ-8 items, obtaining a Cronbach's  = 0.85.This demonstrates good reliability and validity of our measures.

Image Characteristics
We gather in-the-wild images captured by participants using a diverse range of smartphones with varied configurations and camera placements.Predominantly, participants use Samsung, Google, and Motorola devices, and the images captured from these devices had resolutions ranging from 1920x1080 to 4656x3488 (see Table 1).Our naturalistic approach at capturing image ensures ecological validity and represents users' natural behavior while engaging with their devices in different environments.To examine the characteristics of these images, we analyze factors such as phone angle, dominant color, lighting condition, photo location, and background elements present in the photos.The in-the-wild smartphone images offer a unique glimpse into the multitude of ways users interact with their devices and surroundings.However, extracting meaningful insights from these images demands a refined approach that acknowledges the diverse contexts in which they are captured.To achieve this, we utilize the BLIP [45] visual question answering (VQA) model, an advanced AI tool specifically designed for image analysis and answering questions about image content and context.BLIP is recognized as a state-of-the-art method for visual question answering tasks.Furthermore, the VQA analysis contextualizes our predictive modeling in the following ways.First, it can elucidate the raw image content, which is the input for our deep learning models.Second, as our ML models use handcrafted features from the face, it differentiates the performance obtained by considering background in addition to face versus only face.In summary, our motivation is to harness VQA to interpret both explicit and implicit image content.Consequently, enabling a more holistic approach to image analysis, where both the central subjects and their surrounding context contribute to the predictive insights.Importantly, as we cannot display images to protect participant privacy, the VQA provides some level of interpretation.With the help of the VQA model, we explore the following characteristics: Image Angle: By inquiring about the image angle, we gain an understanding of user interaction dynamics with their devices.Varying angles, such as high or low, offer insights into users' physical engagement with their smartphones.High, low, or level angle refers to the perspective from which an image is captured or taken with respect to the subject in the frame.A low angle shot refers to the subject looking down at their phone, whereas a high angle shot refers to the user looking up at their phone.A level angle shot is taken from the same height as the subject, capturing it at eye level.We asked the VQA: "Is the image taken from a high, low, or level angle?".Dominant Colors: Colors are crucial for establishing the context of an image.To identify dominant colors in the images and understand the users' environments, we asked the VQA: "What is the dominant color of the image?".
Lighting Condition: Lighting conditions in an image reveal important information about the user's ambient environment.Using the VQA model, we classified images based on their lighting as well-lit, dimly lit, or poorly lit.We asked the VQA: "Is the image well-lit, dimly lit, or poorly lit?".
Photo Location: The location context (indoors or outdoors) can significantly influence user-device interactions.We determined the location context of images with the help of the VQA model by asking: "Is the photo taken indoors or outdoors?".
Background Objects: Identifying specific objects in the background can provide valuable information about the user's context and activities.We queried the VQA model about the background objects to recognize and categorize various elements within the images.We asked the VQA: "What are the background objects in the photo?".

Number of People in the Image:
In order to evaluate the social context of the images, we employed the VQA model to determine the number of people present in each image.This information provides insight into users' social interactions and their surroundings during device usage.We asked: "How many people are in the image?".
By leveraging the BLIP VQA model, we are able to extract structured insights about the content and context of in-the-wild images, enhancing our understanding of user behavior and interaction with their devices in diverse settings.Importantly, two expert annotators manually annotated 1500 unique images corresponding to individual EMAs.To clarify any ambiguities, we provided them with specific instructions.They determined the image angle based on eye level with the phone.'Dominant color' refers to the most prominent color in the overall image.For lighting conditions, 'welllit' represents the best lighting condition, while 'poorly lit' indicates the worst.After completing the manual annotation, we calculated the average accuracy between the two annotators and the interrater agreement using Cohen's kappa ().These results (see Table 3) indicate substantial agreement between the annotators and alignment with VQA responses, indicating high reliability, consistency and accuracy.

Depression Classification and Regression
In this study, we aim to accurately identify depression from facial images by utilizing both machine learning and deep learning techniques.In particular, we build binary classification models to classify a face image as depressed or not depressed, and a regression model to predict the raw PHQ-8 score (see Section 3.3).

Machine Learning.
To facilitate machine learning approaches, we extract 711 (709 trainable) facial features using OpenFace [6], a well-validated feature set for depression detection that has been employed in a variety of studies [28,58,64].The extracted features consist of 2D and 3D facial landmarks, head pose, eye gaze, facial expressions represented by facial action units (FAU), and rigid and non-rigid shape parameters (see Table 2).Before training, we apply feature selection using only the training set in two distinct ways.First, we compute the mutual information (MI) metric, selecting the most independent features indicated by smaller MI values.In our analysis, we choose the top 25%, 50%, or 100% of the features.Second, we conduct an ablation study to gain valuable insights into the effectiveness of different hand-crafted features, thus inferring the best performing feature set.An ablation study is a systematic experimental procedure in which certain features are systematically removed or "ablated" to analyze their individual impact on the 2D Landmarks [5] 136 These are x and y axes locations of different face landmarks in the image.These landmarks refer to specific locations in the face.For example, a point in the right eye is represented as landmark number 38, while points in the lips are represented using numbers 49-68.All landmark numbers are described in [65,66].
3D Landmarks [5] 204 These are x, y, and z axes locations of different face landmarks in the image.The landmark numbers are identical to 2D landmarks, however, they are represented using three coordinates.
overall model performance.We use a Logistic Regression [32] model for classification and an ElasticNet [79] for our regression task, whereas a Random Forest (RF) [11] is used for both tasks.Statistical approaches such as regression and a bagging-based decision tree can provide different modeling insights.The baseline model is a RF trained using the participant's gender, age, and time spent on EMA.

Deep
Learning.Deep learning models are capable of learning useful features directly from raw images.Pre-trained computer vision models trained on large-scale datasets can capture image features that are transferable to other domains.As a result, we examine the performance of various EfficientNet [70] and InceptionResNetv3 variants, which were previously trained on the ImageNet and VG-GFace2 datasets, respectively.Upon observing that the EfficientNet B0 (EffNet) model provided the best performance while other models were underfitting our dataset, we decided to further fine-tune EffNet for depression prediction.We implement EffNet using the PyTorch framework, freezing all layers during the training process except for blocks 6 and 7.The classification and regression models are fine-tuned using binary cross-entropy and mean absolute error loss functions, respectively.For optimization, we use the Adam optimizer (with a learning rate of 0.0001) with a batch size of 256 trained for 50 epochs.This fine-tuning process allows the model to learn and adapt to the specific characteristics of our depression detection dataset, potentially improving its performance and generalizability.

Evaluation.
To effectively evaluate our models, we adopt a 5-fold leave-subject-out cross-validation approach.This method ensures that all images associated with a single participant are exclusively used for training, validation, or testing the model but not mixed among the subsets.Furthermore, we use nested crossvalidation on our training data for hyper-parameter tuning.The subject-independent splits and cross-validation ensure our results are more robust than those of a single train-test strategy.We evaluate classification performance using balanced accuracy (Equation 1) and Matthew's Correlation Coefficient (MCC) (Equation 2), whereas regression performance is evaluated using MAE (Equation 3).We chose these metrics as they provide a comprehensive assessment.For example, MCC summarizes all four values in the confusion matrix, whereas balanced accuracy emphasizes both true positive and true negative detection.In fact, MCC is preferred over F1 score in many binary classification problems [13].
where  ,  ,  , and   are true positives, true negatives, false positives, and false negatives, respectively.Note that higher balanced accuracy and MCC values indicate better performance.MCC ranges from -1 to +1, where +1, 0 and -1 indicate perfect classification, random coin-toss classification, and perfect mis-classification, respectively.The regression models are evaluated using MAE defined as: where  is the number of samples.  and ŷ are true and predicted PHQ-8 scores, respectively.Note that lower MAE values indicate better performance.

RESULTS
In this section, we present the outcomes of our analysis, which includes an examination of image characteristics, an evaluation of the predictive capabilities of our machine learning models, and an ablation study.In addition, we identify crucial features integral to our models' performance and explore potential biases within these models.

Image Characteristics
Our analysis using the VQA model reveal many insights into different features of real-world smartphone images.These images serve as glimpses into user interactions and surroundings.From Table 3 and Figure 4, we notice that the VQA obtained good accuracy ranging from 89% for lighting conditions to 97% for number of people in the image.Furthermore,  is greater than 0.70 for all questions indicating substantial inter-rater agreement.In terms of capture angle, the images predominantly favored a low angle, with approximately 96.08% falling into this category.Conversely, a mere 3.92% were captured from a high angle, suggesting a specific user posture or device interaction habit in the majority of instances.Dissecting the dominant colors present, we found that 'white' emerged as the prevailing color, characterizing roughly 67.51% of the dataset.
Other noticeable colors included 'black' at 8.70%, while a combined representation of 'brown', 'blue', 'gray', and 'yellow' accounted for approximately 18%.A diverse array of other hues constituted the remaining 5.75%, emphasizing the richness of user environments.Closer analysis during the annotation process revealed that the images' dominant white color mainly reflects environmental elements like white walls and ceilings, not participants' skin tones.Importantly, we noticed most images consisted of partial face images, an observation commonly found in other similar studies [36].Hence, the dominant color is influenced by background objects.This is evidenced in Figure 4, where walls, ceilings, tiles and lights are frequently identified as background objects, ensuring our analysis focuses on environmental, not physiological, aspects.The lighting conditions under which these images were taken were also revealing.A vast majority (80.57%) were captured under well-lit The dimly lit and poorly lit categories followed with 10.35% and 9.08%, respectively, showcasing the varied ambient conditions in which users interact with their devices.Furthermore, in terms of photo location, an impressive 95.08% of the images were taken indoors, signifying the primary environment for user-device interaction.The outdoor segment, constituting 4.92%, provided insight into the more dynamic and mobile interactions users might experience.Notably, 95.81% of the captured images featured only one person.Regarding background objects, we discovered that walls, lights, pictures, and windows were the most common elements.The presence of terms such as "pillow" could imply individuals reclining, while words like "plant, " "moon, " "flower, " and "cloud" might suggest outdoor settings.Overall, it appears that a significant number of images were captured indoors against plain backdrops, possibly within homes or offices.To visually represent these background objects, we have created a word cloud, which can be seen in Figure 4.

Predictive Analysis
In our analysis, we leveraged both machine learning and deep learning to assess MoodCapture's ability to detect depression in natural settings.As shown in  4. First, we observe that RF outperforms LR across all metrics, suggesting that decision trees with bagging are useful in modelling face features for depression.RF's ability to model non-linear dependencies and in-built feature selection makes it a good candidate for our problem.Second, we notice that manual feature selection, such as using 3D landmarks offer better performance than using automatic feature selection methods with MI.This finding underscores the importance of conducting an ablation study to determine the most impactful features for our analysis.
In summary, an RF trained with 3D landmarks performs well across both classification and regression tasks indicated by wellbalanced scores across balanced accuracy (0.60), MCC (0.14), and MAE (130.31).Moreover, RF offers better explainability compared to deep learning methods, making it an ideal choice for post-hoc analysis (Section 4.4).Our investigation into depression detection and PHQ-8 prediction using machine learning and deep learning methods provides important insights into the potential of different techniques when applied to MoodCapture data in naturalistic conditions.The results emphasize the importance of considering a range of methods, from deep learning models capable of learning complex features to traditional machine learning techniques that offer interpretability and simplicity.By carefully selecting and fine-tuning these models, we can improve the overall performance and applicability of depression detection systems in real-world scenarios.

Ablation Study
In this analysis, we aimed to determine if specific OpenFace feature sets are more useful for depression detection by evaluating the performance across the seven groups (Facial action units, Gaze, Eye landmarks, Pose, Rigidity Parameters, 2D and 3D landmarks).From Table 5, we make several interesting observations that provide insights into the utility of individual feature sets.
First, we notice that many feature sets perform better than the automatic feature selection using MI, indicating that only some specific features in the image are useful for depression detection.This finding suggests that a more focused approach to feature extraction and selection may improve overall performance.Second, we observe that facial action units are less discriminative than other features.This result may be attributed to the presence of partial face images, which are common in front-facing cameras, thus hindering the effectiveness of action units in detecting depression.Third, we find that gaze features outperform eye landmarks, suggesting that gaze direction and angle are useful.These observations highlight the importance of capturing subtle facial changes when developing depression detection systems.
Table 5 also indicates that 3D Landmarks is the best performing feature set for classification and PHQ-8 prediction across all metrics (balanced accuracy=0.60;MCC=0.14;MAE=130.31).These results suggest that an RF trained with 3D landmarks is more accurate, correlates better with the ground truth, and has lower PHQ-8 prediction errors than other methods.3D landmarks (see Table 2) are coordinates of specific points on the face.For example, a point in the right eye is represented as landmark number 38.This location is represented using coordinates.All landmark numbers are described in [65,66].Intuitively, different values of 3D landmarks correspond to changes in facial expressions over time.
In conclusion, the ablation study provides valuable insights into the utility of specific feature sets for depression detection.By understanding the strengths and limitations of individual features, researchers and practitioners can make informed decisions when designing and implementing depression detection systems, ultimately  improving overall performance and applicability in real-world scenarios.

Machine Learning Feature Importance
It is crucial to understand important face features that are correlated with depression.Therefore, we employ a post-hoc explainability approach, namely SHapley Additive exPlanations (SHAP) [49], to investigate our best performing (Table 4) Random Forest model.SHAP explains the model outputs using notions from game theory.It assigns each feature an importance value for a particular prediction, offering insights into how and why a model makes its decisions.
The top ten important features for depression classification and regression are shown in Figure 5. Here, we observe that lips and face contour position are useful for both depression classification and score prediction.For instance, we notice that larger values of face contour near the left cheek (X_14, X_13, X_11) influence the model towards predicting depression and push the raw PHQ-8 score higher.Interestingly, we find that important eye and lip features occur on the right side of the face (Y_48, Y_36, Y_17, Y_41); and higher values are associated with higher depression scores.This indicates that the ML model captures asymmetry associated with front-facing camera pictures, i.e., the right side of the face could be more visible.We discuss this further in Section 6.

Investigating Bias in Machine Learning
Our dataset predominantly consists of white females, highlighting the need to assess biases in our machine learning models related to gender and race.Thus, we categorize the test dataset into two   5: SHAP plots describing the top 10 features for the classification and regression tasks.The best performing random forest trained using 3D landmark features is evaluated using SHAP.The features are x and y axis with the numbers (0-indexed) corresponding to facial landmarks [65,66].
groups for gender: females, and a combined group of males and nonbinary individuals.We made the decision to combine the groups due to the notably smaller representation of non-binary individuals and  males in our study.This decision aimed to address the imbalance and ensure a more meaningful analysis, acknowledging the constraints posed by the limited sample sizes of these specific demographic groups.Similarly, we classify the data into white and non-white categories for race.Again, this binary grouping strategy is designed to increase group sizes, thereby improving the statistical power of our analysis.We use our best performing RF model for these evaluations.
Figure 6 displays the performance results of our models, revealing several notable observations.Firstly, as indicated by MCC scores, we observe that the classifier predictions show some correlation with the ground truth at varying levels across different genders and races.Secondly, as shown in Figure 6a, the results for depression classification and regression varied by gender.Specifically, we found that depression classification was more accurate for females, whereas PHQ-8 score predictions were more precise for non-females, as indicated by their lower MAE.Thirdly, the race-based performance analysis in Figure 6b demonstrated that the model was more effective for white participants in both the classification and regression tasks.These biases likely stem from the predominance of white females in our dataset, a limitation that we discuss in Section 7. Our analysis of these biases is intended to enhance transparency in machine learning models, providing insights for future research in this area.In studies involving sensitive mental health data, it is paramount to address the ethical implications to safeguard participants' privacy, confidentiality, and well-being.Our primary goal was to prioritize the security and confidentiality of the data.We securely stored all collected data and granted access only to specific team members.We took great care in removing all personally identifiable information by implementing a thorough anonymization process.To respect privacy, any image that unintentionally captured subjects or nudity was identified during a review by two team members and subsequently deleted.We understand the sensitive nature of mental health and made sure to maintain transparency with our participants.They were informed about the study's purpose, methodology, and expected outcomes.This approach not only sought their permission but also ensured they felt comfortable and safe throughout the process.We further clarified that their compensation was unrelated to their photos.

ETHICAL CONSIDERATIONS AND USER ACCEPTANCE
At the end of the study, we asked participants about their comfort levels with automated front-facing photo capture during surveys.This was optional, so we have responses from only 172 out of the 181 participants that were recruited.Approximately 45% of participants were comfortable, while 38% felt it was intrusive or uneasy, and the remaining 17% were neutral.If participants were uncomfortable, we further ask them about specific reasons for their feelings which can be summarized into a few key themes, as shown below.While we acknowledge these concerns, it is important to note that the study followed strict privacy and data protection guidelines.
(1) Privacy and Surveillance: Participants felt uncomfortable with the idea of being watched or monitored, as it evoked a sense of intrusion into their personal space.One participant mentioned, "I don't like being watched.I'm already paranoid when it comes to cameras." (2) Appearance and Self-Esteem: Several participants mentioned their discomfort with having their photos taken due to concerns about their appearance.One participant stated, "I don't want people to see photos of me", while another said, "I am very uncomfortable with my appearance when I'm depressed." (3) Inappropriate Situations: Participants worried about the possibility of photo bursts being taken during inconvenient or inappropriate moments.One participant shared, "If I was comfortable and at home, during some of them I may not have been completely covered." (4) Data Security: Although participants were aware of the study's data protection measures, some still expressed concerns about the safety and storage of their images.One participant expressed, "The idea of my picture being out there ...although I know it was to be analyzed with AI. " (5) Lack of Control: Participants felt uneasy about not being able to review, approve, or delete the photos taken during the bursts, as well as not knowing when the camera was active.A participant shared, "Having pictures taken and not knowing what they looked like or if they were embarrassing is an uncomfortable thing to think about." In summary, participants' concerns mainly revolved around privacy, self-esteem, potential inappropriate situations, data security, and control over the images.It is essential to consider these concerns when designing and implementing studies involving photo bursts or similar data collection methods to ensure participants' comfort and trust in the research process.Acknowledging the sensitive nature of our research, we offered participants the option to delete their photos at the end of the study if they felt uncomfortable.Interestingly, no participants chose this option, highlighting the trust they placed in our research process and commitment to ethical conduct.We remain keenly aware of the potential for technology misuse, especially in unauthorized surveillance or data mining scenarios.We have taken measures to minimize such risks, emphasizing that our technological developments are primarily intended as health aids, not tools for unwarranted monitoring.Further, to address participants' concerns regarding privacy and data security, one possible solution could be leveraging the capabilities of AI chips on smartphones.By conducting all image classification and processing on the device itself, no images would need to be transmitted or stored externally.This approach could significantly alleviate users' concerns about their images being stored or accessed by unauthorized parties.As AI technology continues to advance, incorporating on-device processing capabilities into our research methodology may not only increase user trust and comfort but also pave the way for a new generation of privacy-focused health aids.In line with our commitment to ethical conduct, we will continue to explore and implement such technological advancements to ensure the protection of participants' data and privacy in our research.

DISCUSSION
In this section, we provide a summary of our findings and engage in a thorough discussion, exploring the implications and uncovering the potential opportunities highlighted by our results.

Summary of results
Our study investigated the potential of using in-the-wild smartphone images and deep learning models for detecting depression and predicting PHQ-8 scores, aiming to contribute to the development of user-centered and unobtrusive mental health assessment tools.The results of our analysis provided valuable insights into the characteristics of in-the-wild images, the performance of machine learning and deep learning models, and user acceptance of such approaches.The image characteristics analysis revealed that most images were captured from a low angle, indoors, and under well-lit conditions.These findings highlighted the participants' natural behavior with their smartphones, emphasizing the importance of considering real-world HCI dynamics in designing mental health assessment tools.
Our predictive analysis demonstrates that a random forest model trained by manually selecting 3D landmark features obtains the best overall classification (balanced accuracy of 0.60, MCC of 0.14) and regression performance (MAE of 130.31).Interestingly, the EffNet deep learning model barely beat this score for classification task by 0.01.It correctly identified depressed and non-depressed participants with a balanced accuracy of 0.61.Given additional high quality data, the deep learning models could improve over existing methods.To summarize, these scores are promising.They are even more noteworthy considering that the facial images were captured using a diverse range of smartphone devices -87 different models from 9 distinct brands.As the camera quality of these devices varies significantly, it is important to note that the results may be influenced by factors such as image clarity and auto-focus capabilities.Despite these potential limitations, our findings support the ecological validity of the study and emphasize the potential of machine learning and deep learning methods in analyzing depression from facial images, even when captured in less-than-ideal conditions.
During post-hoc analysis, we gained several interesting insights.Firstly, our ablation study indicates that smaller domain-specific feature sets perform better in both our tasks.Specifically, we notice that 3D landmarks, gaze, and pose offer good performance across all metrics.By focusing on these features, researchers can potentially improve the overall performance of mental health assessment tools.Secondly, our explainability analysis revealed that larger values on the right side of the face have an impact on both depression detection and PHQ-8 score prediction.This finding suggests that people hold phones in a way that emphasizes the asymmetry of frontfacing face images.Thirdly, our investigation into biases within machine learning models offers crucial insights for future research, particularly in terms of improving generalization and guiding data collection strategies.In terms of user acceptance, we found diverse responses regarding participants' comfort levels with automated front-facing photo capture.While some participants were comfortable with the process, others felt uneasy due to concerns related to privacy, self-esteem, inappropriate situations, data security, and control over the images.These concerns highlight the need for careful consideration of ethical implications in designing and implementing studies involving photo bursts or similar data collection methods.
In conclusion, our research highlights the potential of using inthe-wild smartphone images, machine learning and deep learning models for depression analysis, offering a more objective, unobtrusive, and continuous approach to mental health assessment.By carefully considering the insights gained from our analysis and addressing the ethical implications, researchers and practitioners can work towards developing user-centered, effective, and ethically sound tools for mental health assessment and intervention.

Implications
The findings from our study hold significant implications for various stakeholders, including researchers, practitioners, and policymakers in the fields of mental health, digital health, humancomputer interaction (HCI), and public health.
Our research highlights the potential of utilizing smartphone images and machine learning models as a supplementary method for mental health assessment.This innovative approach encourages the exploration of alternative ways to assess mental health that can complement traditional tools such as self-report questionnaires and clinical interviews.While our data was collected from participants who had major depressive the results pave the way for future research to investigate the broader applicability of these methods, potentially leading to a better understanding of depression and improved mental health support over time.Consequently, promoting timely access to appropriate interventions and support systems.
From an HCI perspective, our study underscores the importance of considering user acceptance when developing mental health assessment tools that utilize smartphone images and machine learning.Recently, there has been a growing interest among researchers to integrate user acceptance into the training phase of machine learning models, as proposed in studies like [10,47].In a related observation, our feature importance analysis indicated that the right side of the face is more useful in depression detection.This phenomenon could be linked to the dominance of right-handed individuals, often resulting in partial face images that capture more of the right side.Various studies support the idea that handedness influences user interaction with smartphones and user experience (UX) [2,27,42,52].Therefore, future research in HCI could benefit from focusing on developing tools that facilitate the capture of the entire face more effectively.For instance, the work by Nelavelli and Ploetz [52] explores adaptive app design tailored to the user's handedness, which could be a promising direction for enhancing face image capture in smartphone applications.In summary, understanding users' concerns and preferences is crucial for creating tools that are more likely to be adopted and used by those in need of support.This focus on user acceptance can inspire the HCI community to design mental health assessment tools that balance effectiveness, privacy, and user engagement, leading to the development of more accessible and inclusive digital mental health solutions.
In the broader context of public health, the study's findings emphasize the importance of leveraging technology and innovative methods to address mental health challenges.As mental health disorders continue to impact individuals and communities worldwide, adopting novel approaches like the one presented in our study can contribute to more effective prevention strategies, early intervention, and resource allocation.This could ultimately lead to better mental health outcomes and overall well-being for individuals across various demographic and cultural contexts.In summary, the implications of our study extend well beyond the immediate findings, offering valuable insights for a range of stakeholders working at the intersection of mental health, digital health, and humancomputer interaction.By considering user acceptance, exploring the potential of smartphone images for mental health assessment, and recognizing the broader public health context, our study contributes to the development of more effective, user-friendly, and contextually appropriate mental health assessment tools with the potential to improve the lives of individuals affected by depression.

LIMITATIONS
Our study while providing valuable insights into the use of in-thewild smartphone images and deep learning models for depression detection, has some limitations that should be acknowledged.First, our study's dataset may be limited in size and diversity, as it consists of a relatively small number of participants.Additionally, it is important to remember that our dataset is primarily composed of white females.Although our models currently show better performance for females in classification tasks and for non-females in regression tasks, expanding our dataset to include more diverse samples is necessary.By incorporating additional data that represents a broader spectrum of the population, we can ensure a more comprehensive representation.This expansion will not only enhance the robustness of our findings but also significantly improve the generalizability of our results across different demographic groups.Furthermore, the study relies on self-reported data, such as depression scores, which may be subject to biases, including social desirability and recall bias.Future research could be significantly enhanced by including more objective measures of mental health, such as clinical evaluations or physiological indicators.In our study, we adjusted each item's score on the PHQ-8 from its original 0-3 range to a broader 0-100 scale.As mentioned earlier, the practice of re-scaling psychometric scales is not uncommon and has been applied to the PHQ in various past studies [29,48,50,54].However, one limitation of adapting the PHQ-8 to a 0-100 scale is the potential for inconsistencies when correlating these scores with established levels of depression severity.To mitigate this, we proportionally scaled the original scores to derive our depression categorization, striving to preserve the original scoring system's integrity.Additionally, our prediction models consider both the raw PHQ scores and the adjusted class scales, an approach that aims to balance detailed granularity with traditional scoring validity.It is also important to highlight that all participants in our study had received clinical diagnoses for MDD.However, we relied on selfreported data for tracking daily depression levels, which facilitated more consistent monitoring.Our study also focused exclusively on a clinically depressed cohort.Including healthy individuals in the dataset would have been beneficial for developing a more comprehensive and accurate prediction model.A randomized controlled trial (RCT) with healthy controls or incorporating a diverse cohort of individuals not experiencing depression could provide valuable insights into the differences between depressed and non-depressed individuals and improve the model's ability to distinguish between them.Future research should consider expanding the dataset to include both depressed and healthy individuals, which can contribute to the development of more effective and precise mental health assessment tools.
Another limitation is that the study primarily focuses on the analysis of in-the-wild smartphone images and their relationship with depression.However, there may be other factors, such as social interactions, physical activity, and environmental context, that could provide additional insights into depression detection.Integrating these factors into future research may help to develop more holistic and accurate prediction models.Deep learning models, while powerful and effective, can often be considered as "black-box" models with limited interpretability.This may make it difficult to understand the specific features or patterns that the model has identified as being related to depression.Future research could explore the use of more interpretable models or techniques to provide insights into the underlying mechanisms linking visual cues and depression.Lastly, the use of in-the-wild smartphone images for mental health assessment raises ethical and privacy concerns, which need to be carefully considered when designing and implementing such tools.Ensuring user consent, data security, and transparency in the use of personal data is crucial for maintaining trust and fostering the adoption of these tools.Addressing these limitations in future research can help to further advance our understanding of the relationship between smartphone images, deep learning models, and depression detection, contributing to the development of more effective, user-centered, and ethically sound mental health assessment tools.

CONCLUSION AND FUTURE WORK
Through this study, we have demonstrated the potential of using in-the-wild smartphone images and machine learning to detect depression, offering valuable insights for mental health assessment, HCI and digital health.With this, we aim to pave the way for more effective and user-centered mental health assessment tools.Addressing the limitations of our study and building upon its findings, future research can contribute to the development of more robust, accurate, and ethically sound mental health assessment tools that have the potential to improve the lives of individuals affected by depression.
When we embarked on designing our MoodCapture study to investigate whether high-resolution face capture from phones could assess mood, we were acutely aware of the ethical issues surrounding our research and the potential privacy concerns of a population that included individuals diagnosed with depression.As discussed in the section on Ethical Considerations and User Acceptance, our study was meticulously designed to safeguard user privacy throughout, and we sought their evaluations of the MoodCapture app poststudy.This invaluable feedback forms the foundation for future work in image-based mood detection which we believe is a promising technology.One direction we plan to pursue as our next step involves utilizing on-phone AI chips that are now available on top-end smartphones to run deep learning models directly on the device, ensuring that images never leave the phone.Additionally, we intend to explore the combination of this on-device prediction approach with federated deep learning, where models are trained without sharing raw data across a network in a central entity such as a server or cloud.This approach could effectively address security concerns associated with centralized data collection and the privacy issues our participants raised during the acceptance study.Finally, we recognize that the performance of the models we considered for face-based depression detection, particularly deep learning models, would benefit significantly from a larger face dataset.In the MoodCapture study, we collected over 125,000 images from 177 individuals living with depression over a period of 90 days, representing a well-sized dataset to demonstrate the potential of this idea.If future face-based depression studies have access to larger pools of naturalistic images (e.g., VGGFace2, which contains over 3 million face images) collected in the wild, we anticipate that the accuracy and capabilities of the models would see significant improvement.

Figure 2 :
Figure 2: PHQ-8 application screens for each item: Images are always captured while users respond to the PHQ-8 depression survey question (highlighted in cyan): "I have felt down, depressed, or hopeless".While users consent to have photos taken using the front-facing camera during the operation of the MoodCapture app they are not informed exactly when these photos are captured to promote in the moment naturalistic and authentic images.

Figure 3 :
Figure 3: PHQ-8 score statistics: Figure (a) depicts the distribution of the PHQ-8 score reported by the participant and the corresponding label (i.e., Depression or No Depression).Figure (b) shows the variability of PHQ-8 scores among participants over the duration of the study (Cronbach's  = 0.85  = 0.85  = 0.85).
Figure 3: PHQ-8 score statistics: Figure (a) depicts the distribution of the PHQ-8 score reported by the participant and the corresponding label (i.e., Depression or No Depression).Figure (b) shows the variability of PHQ-8 scores among participants over the duration of the study (Cronbach's  = 0.85  = 0.85  = 0.85).

Figure 4 :
Figure 4: Background objects: Word Cloud showing the range of objects detected in the background of the images captured.(Acc=91.72 ;   =0.70)conditions, indicating optimal settings for smartphone interaction.The dimly lit and poorly lit categories followed with 10.35% and 9.08%, respectively, showcasing the varied ambient conditions in which users interact with their devices.Furthermore, in terms of photo location, an impressive 95.08% of the images were taken indoors, signifying the primary environment for user-device interaction.The outdoor segment, constituting 4.92%, provided insight into the more dynamic and mobile interactions users might experience.Notably, 95.81% of the captured images featured only one person.Regarding background objects, we discovered that walls, lights, pictures, and windows were the most common elements.The presence of terms such as "pillow" could imply individuals reclining, while words like "plant, " "moon, " "flower, " and "cloud" might suggest outdoor settings.Overall, it appears that a significant number of images were captured indoors against plain backdrops, possibly within homes or offices.To visually represent these background objects, we have created a word cloud, which can be seen in Figure4.
(a) Important features for depression classification.(b) Important features for predicting raw depression score.

Figure
Figure5: SHAP plots describing the top 10 features for the classification and regression tasks.The best performing random forest trained using 3D landmark features is evaluated using SHAP.The features are x and y axis with the numbers (0-indexed) corresponding to facial landmarks[65,66].
(a) Gender-wise performance comparison (b) Race-wise performance comparison

Figure 6 :
Figure 6: Random forest performance on sub-populations divided by gender and race.Note that balanced accuracy and MCC are multiplied by 100 for better visualization.

Figure 7 :
Figure 7: Comfort Level: Participant's comfort with the automated capture of their photos.

Table 1 :
Demographics, smartphones, and image composition in our study.

Table 3 :
Image Characteristics: Different characteristics of the image captured, such as image angle, dominant colors, lighting conditions, photo location and number of people present.The accuracy and Cohen's kappa are presented in braces next to the categories.These results indicate substan-

Table 4 :
Performance: Depression detection using machine learning and deep learning methods.Standard deviation is given in braces.'LR + EN' refers to logistic regression for depression classification and elastic net for regression i.e., raw PHQ-8 score prediction. 2  2  2 values are presented in Appendix B.

Table 5 :
Ablation Study: Investigating depression detection of OpenFace feature sets using a random forest.The standard deviation is presented in braces. 2  2  2 values are presented in Appendix B.