Psychophysiological Arousal in Young Children Who Stutter: An Interpretable AI Approach

The presented first-of-its-kind study effectively identifies and visualizes the second-by-second pattern differences in the physiological arousal of preschool-age children who do stutter (CWS) and who do not stutter (CWNS) while speaking perceptually fluently in two challenging conditions: speaking in stressful situations and narration. The first condition may affect children’s speech due to high arousal; the latter introduces linguistic, cognitive, and communicative demands on speakers. We collected physiological parameters data from 70 children in the two target conditions. First, we adopt a novel modality-wise multiple-instance-learning (MI-MIL) approach to classify CWS vs. CWNS in different conditions effectively. The evaluation of this classifier addresses four critical research questions that align with state-of-the-art speech science studies’ interests. Later, we leverage SHAP classifier interpretations to visualize the salient, fine-grain, and temporal physiological parameters unique to CWS at the population/group-level and personalized-level. While group-level identification of distinct patterns would enhance our understanding of stuttering etiology and development, the personalized-level identification would enable remote, continuous, and real-time assessment of stuttering children’s physiological arousal, which may lead to personalized, just-in-time interventions, resulting in an improvement in speech fluency. The presented MI-MIL approach is novel, generalizable to different domains, and real-time executable. Finally, comprehensive evaluations are done on multiple datasets, presented framework, and several baselines that identified notable insights on CWSs’ physiological arousal during speech production.


INTRODUCTION
The recent advancement in technology has resulted in the production of state-of-the-art sensors which provide an accurate reading of various physiological signals with minimum intrusion and only pose minor limitations in a person's mobility. The physiological data collected by these sensors give insight into the human affective states and allow us to examine how emotion influences human thought and behavior. The goal of this study is to develop automated machine learning (ML) classifiers that can identify subtle differences in affective states between young children who do stutter (CWS) and who do not stutter (CWNS) during a stressful scripted speech and a narration task.
Emotions (i.e., affective state) are temporary, last for a short time, and are complex psychophysiological constructs composed of two underlying dimensions: valence and arousal [46,49]. Valence is defined as the positive to negative evaluation of the subjectively experienced state [31]. Arousal measures the intensity of the affective state ranging from calm to highly excited or alert [13,47].
Psychological arousal of an individual is observed as spontaneous responses in physiology due to an external (seeing a scary picture) or an internal (one's own thought) stressor [50]. These responses are spontaneous and can manifest themselves as changes in heart rate [52], electrodermal activity (skin conductance) [54], etc. The autonomic nervous system (ANS) is responsible for directing these physiological responses. The sympathetic nervous system (SNS), one branch of the ANS, directs the 'fight or flight' response. It stimulates the body to respond to a stressful situation by the elevation of physiological parameters like heart rate, respiration rate, blood glucose levels. The other branch of the ANS, the parasympathetic nervous system (PNS), directs the 'rest and digest' response. It conserves the body's natural activity and relaxes the individual once a stressful situation has passed. The PNS leads to a decreased arousal by reducing the heart rate and respiration rate.
The two branches of the ANS interact to coordinate our physiological responses. The interaction of the SNS and PNS, the two branches of the ANS, is demonstrated in children's cardiovascular response to a stressful situation shown in Figure 1(a). It is a descriptive statistical comparison of 'mean heart rate (HR) change-score Euclidean-distance-feature' of the CWS and CWNS participants before, during, and after seeing a negatively-valenced picture. For CWS participants, while seeing the picture, SNS causes the mean HR to increase. That is marked by a green arrow. Once the stressful situation is over (i.e., removal of the picture), the PNS reduces the mean HR back to the resting state. That is marked by a red arrow. However, the mean response of the CWNS population follows a different trend. While exposed to an external stressor (i.e., negatively valenced picture viewing), CWNS's PNS reduces the mean HR even lower (marked by a red arrow), in speaking and the development of stuttering (section 1.1). Further, we discuss the deep learning frameworks to understanding of children's physiological response during speech (sections 1.2, and 1.3). We end the introduction with the study challenges (section 1.4), and our research questions, problem statement and contributions (section 1.5).

Physiological Arousal in Children who Stutter
Stuttering is believed to be a multifactorial condition where multiple factors interact and contribute to stuttering onset in early childhood and its later development to its chronic form [70]. Research shows that speech production leads to increased autonomic arousal in both adults [33,41,83], and children [7,18,36,42,72,89,90]. In this study we were interested in examining whether speaking tasks that vary in linguistic complexity are inherently more stressful and associated with higher physiologic arousal for preschool-age children who stutter. We were also interested in examining any patterns in physiological reactivity that distinguish children who stutter from their typically fluent peers. Psychophysiological research to date offered mixed findings regarding whether preschool-age children who stutter differ in their autonomic arousal during speech production from their peers who do not stutter.
Literature from speech science generally indicates that preschool-age children who stutter do not have an elevated autonomic arousal during such speaking tasks as picture naming, picture description, and non-word repetition [7,18,36,37,42,72,78,79,89,90]. However, differences in autonomic arousal between children who do and do not stutter based on the children's age [89] and complexity of the speaking task [72] have been reported. Further, among children who stutter, differences related to children's stuttering chronicity [90], and speech fluency [79] have also been observed. Importantly, studies published to date relied on traditional statistical approaches in examining potential differences in physiological arousal between children who do and do not stutter. However, identifying fine-grain, second-bysecond, and personalized differences in physiological parameters between CWS and CWNS groups during speaking tasks has yet to be addressed.

Deep Learning in Physiological Arousal and Stress Detection
Various ML approaches (e.g., SVM, CNN, RNN, RCNN) have been developed to detect arousal/stress through physiological sensors [4,34,53] from non-stuttering individuals. For example, a study [61] evaluated a convolutional neural network (CNN) and long short-term memory (LSTM), taking a combination of ECG signal features, vehicle, and contextual data as input to predict driver stress with an accuracy of 92.8%. Another study [4,53] used CNNs and Recurrent CNNs to detect the stress using physiological parameters such as electrodermal activity (EDA) as features and achieved an accuracy of 67.50% and F1-score of 0.71 respectively in detecting stress. Table 10 in Appendix A.6 summarizes the recent literature on physiological sensing-based stress assessment.

Deep Learning and Physiological Sensing for Stuttering Speech Disfluency detection
To the best of our knowledge, no study has developed ML classification approaches to differentiate the physiological responses between CWS vs. CWNS during perceptually fluent speaking tasks like scripted phrase repetition under an external stressor and spontaneous narration conditions. However, a study [77] on adults with stuttering (AWS) has developed a multi-layered perceptron (MLP) neural network-based disfluency classification system. It used respiration signals as input and achieved an 82.6% accuracy in differentiating the physiological response of 'AWS during disfluent speech production' vs. 'AWNS during fluent speech production.' In this study [77], the participants were asked to read aloud a 677syllable text extracted from a Spanish story. The speech-language pathologists observed the experiments and provided fine-grain annotations of stuttering disfluency events of the AWS, such as sound/syllable repetitions, sound prolongations or silent tense pauses. Notably, in contrast, in our data (section 2), children are talking perceptually fluently for approximately 97% of the duration. Hence, the presented study focuses on differentiating the physiology of the CWS vs. CWNS during both perceptually fluent and disfluent speech production.
Though statistical-based studies (discussed in section 1.1) have shown that there are groupwise differences in the physiological response of CWS vs. CWNS during perceptually fluent speech production, no study to our knowledge has examined or identified the second by second salient and distinguishable patterns present in CWSs' vs. CWNSs' physiology during their fluent speech. Since such patterns are unknown, it is not possible to annotate our data at the fine-grain level, which makes our classification a challenging task [30,85].

Study Challenges
This paper aims to develop automated ML classifiers that can identify the subtle differences in affective states between CWS and CWNS during the stressful scripted speech and narration tasks. This section discusses the challenges of developing such a classifier effectively.

Weakly Labeled Data:
A challenge is, though prior statistical studies [72,89] established the existence of physiological parameter differences in CWS and CWNS, none of them identified the CWSs' distinctive physiological signal patterns. Moreover, preschool-age children cannot self-assess and report on their physiological state due to their young age [32]. Also, changes in arousal would not always be accompanied by observable changes in behavior. Ergo, our data is 'weakly labeled', meaning we do not have the precise labels of CWS's distinctive physiological signal patterns. For example, figure 1(b)-(g) show six children's 20s cardiovascular response (i.e., HR change-score) window. The dotted rectangles show the children's class (CWS or CWNS) indicative distinctive patterns. These patterns are 2 -5s in duration and may appear in any timestamps of the 20s window. Our data annotations only provide information about the data belonging to a CWS or CWNS individual; the distinctive patterns and their timestamps in the detection window are unknown. Additionally, our datasets are limited in size. There were 180 and 200 picture-viewing events for CWS and CWNS during the scripted speech experiment. Supervised learning classifiers fail to learn the above discussed subtle, sparse and independent physiological sensing patterns from 'weakly' labeled data, specifically while limited in size.

Modality-wise Distinctive Patterns:
This paper evaluates multiple physiological modalities (e.g., HR, EDA and Respiration activities) to measure the participants' (i.e., CWSs' and CWNSs') physiological response. A notable observation is that arousal indicative sparse patterns do not simultaneously emerge in each modality. Figure  2 shows the EDA and HR signals of a CWS participant during a negatively valenced picture viewing. As shown in Figure 2a, a low-to-high arousal transition (i.e., region B) appears in the EDA during 8s to 13s timestamps while the participant is waiting for a picture to be viewed (anticipation effect [28]), and the participant is experiencing high arousal during the negatively valenced picture viewing (i.e., region C). However, according to Figure 2b, a low-to-high arousal transition (i.e., region B) appears in the HR during picture viewing (12s-15s timestamps), where the HR decreases rapidly (freezing response [5]). High arousal (i.e., region C) in HR is depicted as the subsequent increase in HR after the freezing response. Notably, our 'weakly' labeled data do not have the fine-grain annotations of such arousal indicative of sparse patterns in any of the modalities. The above-discussed observation motivates that our developed classifier needs to extract CWS's distinctive sparse patterns independently from each modality, without any available annotations (for such patterns) during training.

Capturing Cross-modality Dependency:
Previous studies [10,40] have shown that the correlations between physiological parameters such as HR, EDA and Respiration activities are effective attributes in stress [4,34,53] or emotional valence [68] detection. Moreover, the HR-EDA synchronization is positively associated with variability of arousal responses [40]. Hence, our solution approach must capture the cross-modality relationships as well.
In conclusion, our solution needs to identify CWS's distinctive sparse patterns independently from each physiological modality without any available annotations of such patterns, and capture and leverage the cross-relationships of the identified modality-specific sparse patterns for effective CWS vs. CWNS classification.

Problem Statement and Contributions
In particular, the study aims at investigating the following key questions. To our knowledge, no previous study has investigated these questions through machine learning approaches.

3.
Do the CWS and CWNS show differences in physiology under rest or baseline condition?
To investigate, we developed and evaluated classifiers to differentiate the CWS vs. CWNS from the 20s baseline/neutral condition physiological signals of the scripted dataset. (discussed in section 4.1.3)

4.
Do the CWS and CWNS show different physiological responses during spontaneous narration?
The narration task is linguistically and cognitively demanding since the children develop new context or storylines and articulate them in speech. Therefore, this research question investigates whether the CWS show different physiological responses than the CWNS while performing such narration task. To investigate, we developed and evaluated classifiers to differentiate the CWS vs. CWNS from the 20s windows of narration task condition of the free-speech dataset (discussed in section 4.1.4).

Contribution in Classification:
To investigate the above-mentioned questions, we develop and evaluate a novel Modality Invariant-MIL (MI-MIL) (section 3.2) classifier.

1.
To address the weakly labeled data challenge and identify modality-wise distinct patterns (section 1.4), MI-MIL applies modality-wise multiple-instance-learning (MIL) paradigm in each physiological modality independently. MIL paradigm is designed to extract sparse and subtle patterns from weakly labeled data (i.e., without any fine-grain annotations of the region, timestamps, or duration of the patterns in the data).

2.
To capture the cross-modality-relations (section 1.4), MI-MIL presents a novel modality-fusion network that identifies the cross-relations of each modality's CWS indicative sparse patterns.

3.
Our evaluations discussed in section 4.1 show, the presented approach outperforms the supervised learning classifiers, and the recent state-of-the-art MIL approach (attention-based MIL) significantly.

4.
Our evaluation demonstrates, MI-MIL is real-time executable in scalable resource constraint devices: NVIDIA Jetson Nano and Google pixel 6 smartphone.
Developed MI-MIL models' high efficacy in addressing each of the four questions (q1-q4) indicates the existence of physiological signal patterns that differentiate the binary categories, and our classifier can identify them.

1.
We collected two datasets (section 2) containing CWS and CWNS's physiological responses (i.e., physiological sensing parameters: electrodermal activity -EDA, heart rate -HR, respiratory rate -RSP-rate, and respiratory amplitude -RSP-amp) while speaking in two challenging conditions: speaking under stressful conditions (after experiencing external stressors) and a linguistically and cognitively demanding narration task where the children need to spontaneously develop new context and concepts.

2.
Motivated by the state-of-the-art behavioral science studies, we extracted a novel vector-distance-based representation of change-scores features of physiological parameters that captures the fluctuation of one's physiological response in a target condition than their neutral condition (section 2.3.2). Presented change-score representation significantly outperforms the representation used in literature (Appendix A.1).

3.
We evaluated ML models for both raw and change-score physiological features. They give us insights into the CWS vs. CWNSs' physiological-patterns and fluctuations-differences. Furthermore, we evaluated and ranked the raw features according to their discriminative capabilities (section 4.2). The feature importance ranking is in line with our observation of Shapley visualization (section 5).

Classification Interpretation:
This study is the first-of-its-kind to analyze, interpret, visualize and discuss the fine-grain, second by second, temporal, and distinctive physiological response patterns of CWS from CWNS during speech production in different challenging conditions. The developed MI-MIL classifiers are black-box models; hence we employ SHAP [16] ML model explainers (detailed discussion is in section 5) to extract and visualize the group-wise and personalized situational physiological arousal patterns. Identifying and visualizing group-wise patterns would enhance our understanding of stuttering etiology and development. Personalized pattern identification would enable remote, continuous, and real-time assessment of stuttering children's physiological arousal, which may be used clinically to develop personalized emotion regulation strategies (e.g., biofeedback, mindfulness intervention), resulting in an improvement in speech fluency.

PROCEDURE
We collected two datasets comprising preschool-age children's (both CWS and CWNS) physiological response (i.e., physiological sensing parameters) while performing two different speaking tasks. Due to the nature of speaking tasks, we refer to the first dataset as "free speech dataset" and to the second dataset as "scripted dataset" consistently throughout the manuscript. Study participants and procedures are explained below.

Participants:
Participants in both datasets were preschool-age children (mean age: 50.3 months, std: 9.14). The study procedures were approved by the Syracuse University Institutional Review Board. All data collection procedures took place in our Laboratory over two visits. During the first visit participants were administered standardized tests of speech and language to ensure age-appropriate speech articulation and language scores, and passed a pure-tone hearing screening. Participants' speech fluency was assessed by a licensed speech-language pathologist and the diagnosis of stuttering was established using evidence-based diagnostic criteria [73,88]. All psychophysiological data were collected during the second visit. The free-speech dataset comprised data from 35 preschool-age children (age range: 36-67 months). Among the participants, 16 were CWS, (13 boys and 3 girls), and 19 were CWNS (12 boys and 7 girls). The scripted dataset comprised data from 38 preschool-age children (age range: 38-69 months). Among the participants, 18 were CWS, (16 boys and 2 girls) and 20 were CWNS (16 boys and 4 girls).

Data-Collection Experimental Procedure
We explain the common procedures followed in data collection for both datasets first, then we describe the dataset specific differences in procedures. Upon arrival at the lab, participants played and spoke with the examiner for about 15 minutes to get them acquainted with the lab. Then, they were seated in a child-sized chair, in front of a computer screen. Hypoallergenic electrodes were attached to the skin at the suprasternal notch of the rib cage and at the 12th rib laterally to the left for acquisition of the electrocardiogram [75]. A strain gauge transducer designed to measure respiratory-induced changes in thoracic or abdominal circumference (model TSD201, Biopac Systems, Inc.) was used to record respiratory effort. The transducer was positioned around the participants's chest for acquisition of the respiration waveform. The electrodermal activity was recorded with electrodermal response transducers (model TSD 203, Biopac Systems, Inc.) which included a set of two Ag-AgCl electrodes with incorporated molded housings designed for finger attachment. The response transducers were filled with an isotonic electrolyte gel and were placed on the volar surfaces of the middle phalanges of the two fingers of the participants' right hand. After the sensors were placed, the participants' baseline psychophysiological data were collected first followed by the experimental conditions. The conditions are explained below.

Baseline Condition:
For both datasets, to establish a pre-experimental baseline for each participant's resting skin conductance level, breathing rate and heart rate, participants viewed an animated screensaver of a three-dimensional fish tank for four minutes. This procedure has been successfully implemented in prior studies to establish baseline psychophysiological levels in preschool-age children [36,72].

Experimental Condition -Free Speech Dataset:
The experimental condition in this dataset was a picture description task, which lasted about 10 minutes. During the picture description task, participants were shown pictures from a wordless storybook about a boy, a dog, and a frog by the author Mercer Mayer, Frog Goes to Dinner [55]. To keep the narrative elicitation procedure consistent between the participants, the examiner was not allowed to ask specific questions about the picture but could only prompt the participant to tell them what was happening in the picture by saying "Let's look at this picture. Tell me what is happening here." The examiner was instructed to provide no more than three such elicitation prompts per picture. Participants who stuttered did show some (3% of the total speech) stuttering events, such as sound repetitions and prolongations, during this experiment.

Experimental Condition -Scripted Dataset:
The experimental condition in this dataset also lasted approximately 10 minutes and involved negatively-valenced picture viewing and phrase repetition. Specifically, the participants were shown 10 negativelyvalenced pictures from the International Affective Picture System [48] and were asked to repeat a target phrase "Buy Bobby a puppy" (BBAP phrase) after a pre-recorded prompt presented over the speakers. Picture presentations were interspersed with speaking such that per each picture shown the participants were asked to repeat the target phrase 3 times. Figure  3 shows the chronological order of events for one negatively-valenced picture viewing in the scripted dataset. None of the participants showed any stuttering disfluency events during this experiment.

Data Acquisition: Preprocessing and Cleaning:
The respiratory, electrodermal, and cardiac activity were acquired simultaneously using the Biopac MP150 hardware system (Biopac Systems, Inc.) and recorded using Acqknowledge software (ver. 4.3 for PC, Biopac). Respiratory effort (RSP), electrodermal activity (EDA), and electrocardiogram (ECG) signals were sampled at 1250 Hz during the baseline and experimental conditions. The butterworth high pass filter [62] was applied to the raw signals to remove the noise and baseline drift.

Extraction of Event Detection Windows:
This paper detects the affective state differences from 20s windows; since in the scripted dataset, children took on avg. 20s to utter BBAP phrase 3 times following a picture viewing. For the free-speech dataset, baseline-level data collection and picture description sessions lasted approximately 4min and 10min. We segment the sessions into 20s windows with a5s overlap (15s hop-length). Physiological sensory streams from each 20s window from baseline and picture description session represent the participant's physiological response in the neutral and narration (i.e., linguistic and cognitively demanding) conditions. Similarly, in the scripted dataset, baseline-level data collection lasted approximately 4min, which was segmented into 20s windows. Sensory signals from these 20s windows represent participants' physiological response in the neutral condition. During the speaking task, for each negatively-valenced picture viewing, we extracted two 20s windows: (1) during picture viewing; and (2) after the picture viewing while the participants were repeatedly saying the scripted phrase ("Buy Bobby a puppy"). As shown in figure 3, the picture was flashed on the screen for 3 seconds (i.e., red region), after which the computer prompts the sentence "Buy Bobby a puppy" (BBAP). After that, participants started to speak, which took approximately 20s. The first window starts at time step 0 (where in each time step is a 2s window) and the image is shown at time step 13. Hence, window 1 of the scripted dataset captures the participants' physiological response while waiting for a picture, watching the pictures, preparing speech in stressful conditions. That is, it captures the participants' progression from normal to stress state. Window 2 started at time-step 18 and ended at time step 38. Hence, Window 2 of the scripted dataset captures the participants' physiological responses while speaking in stressful conditions. (It is important to note that window-1 and window-2 both consist of nineteen time-steps or 2s instance windows each have 1s overlap resulting in 19 time-steps for a 20s window). The data collection sessions were segmented into 20s windows with a 5s overlap; hence for the free speech dataset, we had a total of 1510 windows. For the scripted dataset, there were 752 windows in total; wherein there were 376 window-1 and window-2 each.

Physiological Features Extraction
As discussed in the previous section each of the 20s signals was divided into nineteen segments with 2s duration and 1s overlap, and features are extracted from the 2s segments. We evaluated the event detection models using two categories of features: (a) raw features and (b) change-score features.

Raw Physiological
Features.-Psycho-physiological features relevant to affective states and stuttering individual's physiological responses are extracted from each 2s segments. We extracted the heart rate (HR) from the raw ECG signal, since a recent study [74] of heart rate (HR) in relation to stressful situations indicates that children who stutter showed a significantly higher HR than CWNS. Electrodermal activity (EDA) signals increases and shows spontaneous fluctuations during arousal [9,58]. Hence, filtered EDA is one of the extracted features. Moreover, studies have shown respiratory distributions vary in individuals with stuttering vs. non-stuttering and different speech-fluency levels [71]. To capture the children's respiratory patterns, we extract the respiratory rate (RSP-rate) and respiratory amplitude (RSP-amp) [66] as raw physiological features from the raw RSP signal.
Following previous studies [63,69], two-level physiological features (Low-level and highlevel Descriptors) are extracted, allowing the ML models to capture signal characteristics in different granularity levels. Four low-level descriptors (LLD) features: heart rate (HR), electrodermal activity (EDA), respiratory rate (RSP-rate), and respiratory amplitude (RSPamp), were extracted at 0.8-millisecond intervals from each 2s segment. Six high-level descriptors (HLD) functionals: min, max, std, var, mean, and median, are applied to the LLDs to extract the feature representation for each of the four LLDs, totaling <24> HLD features, are extracted from each 2s segment. In total, we extract 19 × 24 raw physiological features from the 20s event detection window.

Change-score Features-State-of-the-art
behavioral science studies evaluate change-score [17,19,37] features to understand the psychophysiological changes on individuals in response to different affective states (e.g., arousal). A change score is the difference between the value of a variable/feature measured at one point in time (Y t ) from the average value of the variable for the same unit at the baseline-level condition (Y b ). Y t is called the 'post-score,' Y b is the 'baseline-score,' and the difference between Y t & Y b is the 'change-score'. This study extracts change-scores of HR, EDA, RSP-amp, and RSP-rate LLD-features from each 2s segment. The post-, baseline-, and change-scores of these physiological features are represented as vectors.
Post-scores.: are calculated from each 2s segments in different non-baseline scenarios. Each of the LLD features' is represented as a 6-dimensional vector (i.e., one dimension for each HLD-functional) quantifying an individual's physiological response. These 6-dimensional vectors are the respective LLD features' post-scores.
Baseline-scores.: are calculated from all of the 2s segments in the individual's baseline condition. For each LLD feature (HR, EDA, RSP-amp, and RSP-rate), we consider the mean of its 6-dimensional HLD vectors from all baseline-condition 2s segments as its 'baseline-score' vector. Meaning, we extract four 6-dimensional baseline-scores/vectors for each individual, representing the average LLD-features values in their neural condition.
Change-scores.: are the vector differences between the post-scores and baseline-score. For each 2s non-baseline segment, it quantifies the difference in an individual's physiological response regarding their baseline (i.e., neutral) response.
In this study, the vector difference between the post-score and baseline-score is measured by two matrices: cosine similarity and the euclidean distance. They are the two most common matrices to measure vector difference used in machine learning [26,44,76,80].
For each of the four LLD features HR, EDA, RSP-amp, RSP-rate, we calculate two changescore values (euclidean distance and cosine similarity), totaling eight change-score features extracted from each 2s small-signal segments. In total, we extract 19 × 8 change-score features from the 20s event detection window.

Comparison of Features:
Raw physiological features (HLDs) captures the signal amplitudes and patterns of individuals in different conditions. Hence, affective state detection classifiers learn the physiological parameter values and time progression relevant to different target classes (e.g., arousal). However, a limitation is that individuals' physiological responses in the neutral condition can be dissimilar. Hence, a classifier trained on raw physiological signals may misclassify an individual's neutral state to aroused state if their baseline-level physiological signal attributes are different from the average population. Change-score features eliminate such bias since they capture the difference in an individual's physiological response in different conditions compared to their neutral state. However, in doing so, change-score features lose fine-grain information of the signals (e.g., signal amplitudes, std, etc.). Hence, this paper trained and interpreted classifiers using both raw and change-score physiological features to understand how their 'attributes (amplitudes and patterns)' and 'fluctuations compared to the neutral condition' indicate children's mental states during phase repetition under aroused condition and narration tasks.

Approach Design Choices
This section discusses how our presented MI-MIL approach addresses the challenges (discussed in section 1.4) in CWS vs. CWNS affective states difference classification from physiological sensing signals.

Weakly Labeled Data and Multiple Instance Learning (MIL)
:.-We employ a multiple instance learning (MIL) paradigm [64,86,87] to address the absence of fine-grain annotations in our data. In MIL, each input of a classifier is considered as a bag of instances B = {x 1 , x 2 , … x K }. Each bag B has an associated single binary label Y ϵ {0, 1} known during training. However, the labels of instances within a bag, i.e., y 1 … y K and y k ϵ {0, 1} for k = 1 … K are unknown. As per conventional instance-based MIL assumption [24], a positive bag has a label: Y = 0 and a negative bag has a label: Y = 1. A negative bag has at least one negative instance, and may contain positive instances (i.e., ∃x j ∈ B, y j = 1). However, a positive bag contains positive instances only (i.e., ∀x j ∈ B, y j = 0). Thus, the relationship between bag label Y and instance label y j is: Y = max i = 1, ..k y i .
In this paper, a negative bag is the extracted features set from a CWS's 20s physiological data, whereas a positive bag is from a CWNS. Features extracted from each 2s segment discussed in section 2.3 constitute an instance, and the collection of all instances of a 20s physiological sensing data constitute a bag. According to the instance-based MIL assumption, if the MIL classifier identifies that at least one instance is negative (2s segment, conveying CWSs' distinctive physiological response pattern), the 20s data would be detected as a CWS's response. In contrast, if none of the instances is identified as negative, the 20s data would be detected as a CWNS's response.

Temporal Dependency and Attention-MIL
:.-Previous studies [8,79] have shown that the physiological response in arousal comprises temporal patterns. A limitation of instance-based MIL [24] is that it considers that the distinctive patterns (indicative to CWS or negative class) are sparse and independent. Moreover, it classifies a bag (i.e., one input) based on the instance with the highest likelihood of being negative. Hence, a large portion of the data remains unutilized, negatively affecting the classification performance.
To address this challenge, this paper adopts an attention-based MIL approach named attention-MIL [35,81]. Attention-MIL is capable of identifying sparse distinctive patterns from weakly labeled data, captures sequential traits, makes inferences from the aggregation of all instances in a bag, and is shown to achieve better classification performance [81].
In contrast to the instance-based MIL, it generates a score or attention weight for each instance (x i ) in a bag indicating the likelihood of the instance (x i ) conveying CWS indicative distinctive patterns. Weighted instances (preserving their temporal patterns) are aggregated through an attention-based pooling function (section 3.2.2) to generate a bag representation, from which the classifier makes the inference.

Modality-wise Distinctive Pattern Extraction:
This paper evaluates multiple modalities: HR, EDA, RSP-amp and RSP-rate (section 2.3) to measure the participants' physiological response. Section 1.4's observation demonstrates that arousal indicative sparse patterns do not simultaneously emerge in each modality. Such observation motivates the need for CWSs' distinctive sparse patterns extraction independently from each modality. Hence, this paper applies the attention-MIL approach to each modality separately and generates modality-specific bag representations.

Capturing Cross-modality Dependency:
As discussed in section 1.4, crossmodality relationships can be effective attributes in differentiating CWS's situational physiological response from CWNS. Hence, the presented approach must capture the crossmodality relationships.
As discussed above, the modality-wise attention-MIL mechanism generates an independent representation for each modality. To capture the cross-modality relations, the presented approach uses a novel modality fusion mechanism which learns the pair-wise and unary relationships between each modality-embedding.

Modality Invariant-MIL (MI-MIL) Approach
This section discusses the MI-MIL approach that takes the modality-specific bag representations (B m = x 1m , x 2m , …x km , k = 19, m = EDA, HR, RSP-amp, RSP-rate) of a 20s physiological sensing data as input. As shown in figure 4, MI-MIL has four components: (1) modality specific embedding block, (2) modality specific self-attention pooling block, (3) modality fusion Block, and (4) classifier Block. While the first two blocks are applied to each modality m independently, the latter two combine the cross-modality information to generate inference. The components are discussed in detail below.

Modality Specific Embedding Block:
For each modality m, the MI-MIL utilizes a modality-specific embedding block f m that takes each 2s segment (i.e., modalityspecific instance x im ) of the respective modality as a separate input and transforms it into a lower p dimensional embedding vector (e im ) [25]. Each block f m comprises multiple linear layers with ReLU activation functions. Generated embeddings e im , i = 1, 2…, k convey the modality m specific CWS vs. CWNS differentiating information from their respective instance x im . Here, w m ∈ R L × 1 and V m ∈ R L × M are the network parameters of the m modality-specific self-attention pooling block. The hyperbolic tangent tanh(.), element-wise non-linearity is utilized to ensure proper gradient flow [35]. Generated weights a im represent the likelihood of the embedding e im 's conveying CWS vs. CWNS differentiating pattern information, and the weights a im , i = 1,2, …, k must sum to 1 to be invariant to the number of instances of a bag. Hence, this block ensures the temporal patterns present in each modality are captured, aggregate CWS vs. CWNS differentiating modality-specific information from each 2s segments (i.e., instances), and trainable through backpropagation.

Modality Specific Self-Attention
As discussed in section 3.1, CWS vs. CWNS differentiating patterns can be present in different asynchronous portions (i.e., different timestamps) of different modality-signals.
Hence, different modality-specific pooling blocks may learn different attention weights for each 2s segment (i.e., instance), enabling CWS indicative pattern extraction independently from each modality. Equation 1 is similar to the attention pooling mechanism presented in attention-MIL paper [35].

Modality Fusion Block:
Modality fusion block captures the cross-modality relationships. It receives four independently generated modality-specific bag embeddings from four modality-specific self-attention pooling blocks. Each of the embeddings is an N-dimensional vector. It concatenates the four embeddings to a [1,4N] dimensional vector, X = < x 1 , x 2 , …x 4N >, and generates a [1,4N] dimensional vector, Z = < z 1 , z 2 , …z 4N > that encodes the pairwise relations among all possible vector dimensions (x i , x j ) of X, as well as the unary relations, meaning how one dimension x i of X may have its independent impact. Hence, Z essentially encodes pair-wise and unary relationships between each of the dimensions of each of the modalities. Each z i encodes the relations corresponding to x i and computed using equation 2.
Here, z i computation enumerates all possible dimensions j. f(x i ,x j ) represent the pairwise relation between dimension i and j of X. Here we use Embedded Gaussian function as Here, θ x i = W θ x i and ϕ x j = W ϕ x j are embeddings of x i & x j , and W θ & W ϕ are learnable network parameters. In our implementation, W θ & W ϕ are single-convolutionallayers with kernel size of 1. θ(x i ) T ϕ(x j )is the dot-product similarity. The normalization factor is set as C(x) = ∑ ∀j f x i , x j . With the equation above, 1 C(x) f x i , x j become a softmax operation along the dimension j.
The function g(x j ) generates an unary embedding of x j . It is a simple linear embedding: g(x j ) = W g x j , where W g is a learnable network parameter. In our implementation, W g is a singleconvolutional-layer with a kernel size of 1. Hence, according to equation 2, the modality fusion block generates a modality invariant representation Z, encoding pair-wise relations between each of the modalities while preserving each modality's unary information.

Classification Block:
The classification block predicts the bag label (CWS vs. CWNS from 20s data), taking the modality invariant representation Z as input. Our implementation uses two fully connected linear layers followed by a Sigmoid activation as the classification block.

EXPERIMENTS AND EVALUATION OF MI-MIL
This section evaluates the performance of our MI-MIL approach and different approach components. We compared MI-MIL's performance with state-of-the-art attention-based-MIL approach [35], DNN CNN, LSTM, and LSTM with attention approaches. The architectural information for MI-MIL, attention-based MIL and the mentioned baseline models are discussed in Appendix A.5.1, A.5.2 & A.5.3 respectively. The presented network parameter configurations were optimized by performing a grid search of the possible network-parameters.
Following we present the dataset splits and evaluation metrics used for our evaluations. Later, section 4.1 discusses the models' performance on the research question-wise tasks discussed in section 1.5 and the evaluation conclusions. Finally, section 4.2 evaluates the raw features' importance in addressing the research questions and real-time executability of MI-MIL.

Dataset Split and Evaluation Metrics:
For each of the evaluations, we followed the person-disjoint hold-out method [15]. We divided each dataset into three person-disjoint train, validation, and test sets, randomly selecting an equal number of participants from the CWS and CWNS groups. The distribution was as follows: test set (all data from 3-CWS and 3-CWNS), validation set (all data from 3-CWS and 3-CWNS), and training set (rest of the data). The same training, validation, and test set distributions were used for all evaluations of a dataset. To reduce contingency and avoid overfitting, classifiers were trained (on training+validation set) and evaluated (on the test set) three times with different seed values, and the average results are reported in this paper. Evaluation results are presented with the metrics: recall, precision, accuracy, F1-score, and specificity. The MI-MIL approach achieves F1-scores of 0.90 and 0.80 (for raw and change score features respectively), establishing that CWS exhibit easily identifiable unique physiological 'attributes' (raw features) and 'fluctuation-from-neural-condition' (change-score features) patterns than the CWNS while exposed to external stressors.

Investigating the Research Questions through Classification Evaluation
Notably, MI-MIL outperforms all the baseline models. Specifically, in table 1b, the lower performance of the Attention-based-MIL indicates that the existing CWS distinctive sparse situational fluctuation-from-neural-condition patterns are subtle and more disjoint across different modalities compared to the physiological attribute patterns. However, MI-MIL's modality-specific embedding blocks identify the patterns effectively, resulting in higher performance. Supervised learning baseline approaches perform relatively poorly since they fail to optimize with the absence of fine-grain data annotations, hence confirming the need for a weakly supervised learning MIL paradigm for effective classification.

Evaluation of Q2: Differentiating the CWS vs. CWNS from Scripted 'window-2'
Signal: To get insights for research question 2, we evaluated the models to differentiate the CWS vs. CWNS from 'window-2' of the scripted dataset. In 'window-2' of the scripted dataset, participants are repeating the predetermined BBAP phrase. Hence, this window comprises the participants' physiological responses while talking under stressful conditions.
Our evaluation results using raw and change-score features are shown in tables 2a & 2b.
The MI-MIL approach achieves F1-scores of 0.89 and 0.83 for raw and change score features respectively, establishing that CWS exhibits unique physiological attributes (i.e., amplitudes) and fluctuation-from-neutral-condition patterns than the CWNS while talking in stressful conditions. Comparing Table 1b & 2b, demonstrates that, the CWS's unique physiological parameters fluctuation patterns are more explicit and identifiable while talking in stressful conditions. MI-MIL significantly outperforms supervised learning baselines (i.e., CNN, DNN, LSTM, LSTM with attention) and MIL baseline (attention-based-MIL), demonstrating that the presence of sparse and modality-specific disjoint physiological response patterns in CWS while they speak under stressful condition.

Evaluation of Q3: Differentiating the CWS vs. CWNS from Scripted Dataset Baseline Signal: This evaluation addresses the research question 3. It is important
to understand if the CWS and CWNS show differences in physiology during their neutral affective state or baseline condition. If yes, the use of raw physiological features for classification may provide erroneous insights. For example, if CWS have higher physiological parameter values than the CWNS in baseline/neutral condition, identifying that the CWS and CWNS are showing similar parameter values (i.e., low classification accuracy) in a challenging situation would not demonstrate that they have similar physiological responses. Instead, it may indicate the CWNSs' parameters fluctuation is higher (i.e., stronger physiological response) than the CWSs'.
Hence, we developed and evaluated models to differentiate CWS vs. CWNS from the 20s baseline signals of the scripted dataset. The evaluation results are shown in table 3. All models take raw physiological features as input and achieve low F1-scores (0.56-0.69). These results demonstrate that the CWS and CWNS exhibit similar physiological parameters during neutral/baseline conditions, though some subtle differentiating patterns exist.
Such findings justify our evaluation of the models using raw and change-score features separately. Models with raw features give us insights into the physiological parameters value differences. Models with change-score features give us insights into the fluctuations in physiology that represent stronger or weaker responses (i.e., higher or lower fluctuations).

Evaluation of Q4: Differentiating the CWS vs. CWNS from Free-Speech
Signal: Though no external stressors were imposed during the free-speech experiment, the narration task is linguistically and cognitively demanding. It may elicit different physiological responses in CWS vs. CWNS. To investigate such differences in responses, we evaluated the models to detect the differences in the CWS vs. CWNS from the free-speech dataset. Our evaluation results using raw and change-score features are shown in tables 4a & 4b.
The MI-MIL approach achieves F1-scores of 0.73 and 0.67 for raw and change score features respectively, establishing that CWS exhibit identifiable unique physiological responses than the CWNS while performing linguistically and cognitively demanding tasks (i.e., narration). MI-MIL approach outperforms all the baseline models, demonstrating its higher efficacy in identifying subtle, sparse and modality-disjoint patterns.
Notably, no predetermined phrases or sequences of phonemes were uttered during the spontaneous narration task. Hence, differentiating CWS vs. CWNS was more challenging than the scripted dataset's 'BBAP' phrase repetition task. Therefore, this section's lower evaluation accuracy (table 4) compared to the tables 1 & 2, does not indicate that the CWS exhibit a more explicit or stronger physiological response difference than CWNS in the stressful talking task compared to the narration task.

Feature Selection:
This section evaluates the discriminative capability of the raw physiological features for the CWNS vs. CWS classification task. Notably, recent literature [9,58,66,71,74] have established that the HR, EDA, RSP-rate, RSP-amp are effective features for affective states and stuttering individual's physiological arousal detection. Hence, this section aims to identify the features' relative importance rather than feature selection. Following prior studies [22,29,82,93,94], we employ ridge-regression. Ridge regression adds 'squared magnitude' of coefficient as penalty term to the loss function, hence highly penalizes coefficient of less important features. Evaluations are done on the scripted dataset for a fair comparison, where all participants experienced a similar condition. Table 5a presents the feature rankings. EDA features are highly important, following RSPrate, HR, and RSP-amp features. These results are in line with our ML interpretability evaluations.
We also evaluate the performance of our baseline Attention MIL model utilizing the top-k features from the ranks, and the results are presented in Table 5b. Notably, since not many modalities were selected during Top-5, -10 evaluations, MI-MIL was not utilized. According to our evaluation, adding more features result in higher performance. These evaluations indicate that our classifiers were not overfitting due to redundant features, which is obvious since we have only 24 features, and all of them are shown to be effective by the literature.

Execution Time and Resource Usage on Scalable Devices:
We evaluated MI-MIL's real-time executability and resource usage on scalable mobile devices: an Nvidia Jetson Nano and a smartphone. We run the MI-MIL models taking consecutive 20-sec windows for 10 minutes and record the running time and resource usage. According to the Tablse 6, MI-MIL takes 0.019 sec and 0.005 sec on Jetson Nano and smartphone to process each 20-sec physiological data window. The average CPU and GPU usages are also low. The results suggest that MI-MIL can perform real-time analysis on resource constraint devices.
MI-MIL's inferences can be utilized to understand stuttering children's psychological arousal during speaking from a group-wise and personalized perspective. The developed binary ML classifiers in section 4.1 are BlackBox. Understanding which physiological features (raw or change-scores) are important for the respective model's inference is critical since they contribute most to differentiating CWSs' physiological response from others. We employed the KernelSHAP, a model-agnostic interpretation framework that determines each physiological feature's (raw or change-score) impact in terms of its Shapley value [67] on the respective model's inferences. Notably, distinct Shapley feature importance values are generated for each input, indicating each feature's impact on generating a class inference for that particular input (i.e., personalized perspective).
However, Shapley values are additive [51]. We average the Shapley values per feature across the data to consider the global importance. Global feature importance indicates how much a model relies on each feature at each timestamp, overall (i.e., group-wise perspective). We calculate the global importance for a model's true predictions by computing the mean of the generated Shapley values corresponding to the test set 20s windows.
We present the global feature importance for a research-question respective MI-MIL model's inference through a grid-heatmap (with a cell for each features in each timestamp, e.g., figure 5). In contrast, we present the personalized feature importance of each 20s physiological response data (from a CWS or CWNS) through time-series representation of the features and heatmap on each 2s segment of the time-series (section 5.1.2).
The heatmaps use blue color to show positive Shapley values (responsible for pushing the model towards CWNS), and red color for negative Shapley values (responsible for pushing the model decision toward CWS), darker the intensity of the red/blue color higher the magnitude of the Shapley value and higher is the importance of the feature in pushing the model towards CWS/CWNS class.
Speech science studies' interest lies in understanding the second-to-second effect of children's physiological arousal in their speech production; hence we are particularly interested in visualizing and understanding the importance of physiological features in differentiating CWS vs. CWNS while repeating the target phrase (BBAP) after viewing negatively-valenced (stress-provoking) pictures (discussed in section 5.1), and while describing pictures spontaneously during the free-speech condition, discussed in Appendix A.4.

Q2 Interpretation: MI-MIL Model to Differentiate CWS vs. CWNS While Speaking in Stress Condition (Scripted Dataset)
This section discusses and demonstrates how MI-MIL's inferences can be utilized and visualized to understand stuttering children's distinctive psychological responses during speaking under stress condition from a group-wise (section 5.1.1) and personalized (section 5.1.2) level. The ML interpretations are discussed in following: 5.1.1 Group-wise Global Feature Importance: The figures 5a and 5b show the global features importance through Shapley interpretation graphs of the MI-MIL models (section 4.1.2) evaluated on window-2. Each of the figures comprises two SHAP plots: features' importance plot for true negative (i.e., detecting CWS) and true positive classification (i.e., detecting CWNS). Window-2 comprises children's physiology during the BBAP phrase repetition under stressful conditions task, and the evaluation in section 4.1.2 indicates that the CWS and CWNS show significant differences. This section visualizes and discusses the MI-MIL identified group-wise differences in CWS vs. CWNS through the respective model's interpretation.

EDA Features Importance:
As shown in figure 5a, CWS experienced higher raw EDA feature values than CWNS, indicating CWS experienced higher physiological arousal [9,58]. These raw EDA features are important in CWS vs. CWNS classification, indicating that they are distinctive patterns. Psychophysiological research examining speaking-related sympathetic nervous system activity in CWS is limited to just a few studies. However, speaking task-and age-related differences in skin conductance level have been reported [18,36,37,72,74,89,90]. This study is the first to evident that, the CWS experience distinctively higher arousal while talking in stressful condition than their non-stuttering peers. When interpreting our raw EDA findings, the readers are reminded that in this evaluation, we used raw EDA data (phasic skin conductance responses were not removed from the signal; similarly, the baseline EDA was not considered in this model).

HR Features Importance:
According to the figure 5a, for CWS participants, the HR variance during the later part of the 20s window (timesteps 12-18) shows a sudden increase and has high importance toward CWS classification (indicated by the dark red cells). Research indicates that social-emotional challenges or a feeling of anxiety elicit increased heart rate attributes in children and adults [45]. Hence, we interpret the discussed HR-variance pattern as a sign that during talking under stressful conditions, CWS's arousal increases with time, and it is a distinctive attribute between CWS and CWNS. Such interpretation is in line with the findings from the stuttering research literature. A recent statistical analysis-based study [74] showed that CWS experienced on avg. higher HR attributes, hence, higher emotional arousal than CWNS. In contrast, our approach allows finding second by second temporal patterns in specific modalities like HR, which are distinctive in CWS.

RSP Features Importance:
In regard to the raw respiratory effort data (per the figure 5b), RSP-rate mean and variance contributed significantly to differentiating CWS vs. CWNS. Although the raw respiratory effort values were higher in CWS than in CWNS, the data for both groups are in line with speech breathing rates for preschool-age children [12]. Thus, our finding indicates the relatively faster speech breathing rates of CWS (compared to CWNS) as a distinctive respiratory effort pattern. Interestingly, CWSs' RSP feature values slightly decrease with time progression (in Y-axis), indicating that with the progression of time, CWS's speech breathing rate decreases even under stressful conditions. Though, it has been proposed that stuttering is associated with various airflow irregularities (e.g., [11]), but to our knowledge, this is the first study to examine respiratory effort in preschool-age children who stutter.

Change-score Features Importance:
We also evaluated the change-score of the four physiological features (HR, EDA, RSP-amp, and RSP-rate). According to the figures 5b, CWS showed a higher fluctuation from baseline in HR, EDA, RSP-amplitude, and RSP-rate features compared to CWNS. HR and EDA change score features have higher shapley importance. Hence, CWS showing higher fluctuation in EDA and HR during talking under stressful condition compared to the CWNS are distinctive patterns. It demonstrates that CWS experience higher arousal than their peers while speaking in stressful conditions. According to this section's evaluation and interpretation, we can conclude that the CWS showed distinctive temporal and overall-window-wide physiological parameter differences from their non-stuttering peers. This paper's presented approach can identify these patterns in fine-grain, second by second level.

Personalized Interpretation:
On average, across CWS participants, the HR and EDA changes-scores from window-2 are distinctive. A major contribution of this paper is its ability to extract and visualize personalized fine-grain second by second temporal physiological response patterns. Figures 6 and 7 demonstrate the visualization of personalized EDA and HR feature importance in 20s-window-2 from five different individuals. According to figure 6, 'CWS 1' participant has higher EDA, meaning experiencing higher arousal [9,58]  This section's discussion demonstrates that our approach can effectively identify fine-grain, personalized, distinctive temporal patterns from CWS and CWNS individuals. Identifying such patterns would enable personalized understanding of stuttering development and potential just-in-time personalized interventions to mitigate the physiological responses that may affect a children's speech.

STUDY OBSERVATIONS, IMPACT, AND LIMITATIONS
We developed a novel MI-MIL approach (section 3.2) that addresses the challenges present in differentiating CWS vs. CWNS's situational physiological arousal (section 1.4) from 'weakly labeled' data. MI-MIL's high efficacy in addressing all research questions indicates its effectiveness.
Our evaluation visualization in section 5.1 demonstrates presented papers approach can effectively identify fine-grain, personalized, distinctive temporal patterns from CWS and CWNS, group-wise and personalized for each 20s window. Notably, many of the patterns in CWS's physiological response patterns are investigated for the first time. Many of them conform to the existing speech science literature showing the reliability of our approach's visualization. The following discusses the presented approach's impact, generalizability, and limitations.

Impact and Application of the Study:
Approximately 5 percent of all children go through a period of stuttering, and 1 percent suffer from long-term stuttering [2]. Speech production is a complex process that requires fast and precise coordination of respiration, voice production, lip, tongue, and jaw movements (among other speech articulators) while simultaneously processing cognitive-linguistic information. Social engagement, including regulating own emotions and responding appropriately to one's communicative partner, is also inherent to spoken communication. Naturally, speech production can be affected by the speaker's physiological arousal. Young children who have speech disorders, such as stuttering, are especially vulnerable to these influences.
The fast nature of speech production calls for fine-grain, second-by-second assessment of any physiological response parameter of interest that can influence speech characteristics. To attain this goal, the presented study offers a new way to examine CWS's physiological arousal data and has both group-level and personalized-level impacts on stuttering and individuals with stuttering.

Group-wise Impact:
This study's presented ML-based group-wise examination (discussed in section 5.1.1) of attributes and fluctuations in physiological arousal during speaking could inform our understanding of the role of physiological arousal in the development of stuttering and explain the origins of its situational variability, one of the key and unexplained features of this condition.

Personalized-level Impact:
Recent literature [6,23,39,43] suggests that CWS (compared with fluent peers) have increased difficulty in the regulation and/or adaptation of their behavioral and emotional responses to everyday scenarios which lead to increased emotional reactivity to stressful stimuli. This paper's evaluation not only confirms that, moreover, the presented approach can identify the fine-grain, second by second, temporal and personalized distinctive physiological response patterns of each CWS. Important to note that the physiological sensing parameters evaluated in this study are present in recent wearables. For example, same Biopac sensors utilized in this study (section 2) are present in Biopac wireless wearable system [1]); hence the MI-MIL models can be implemented and evaluated in wearables. According to our evaluation in section 4.2.2, developed MI-MIL models are real-time executable in smart devices (e.g., smartphones). Hence, the presented approach has the potential to be leveraged for remote, continuous, automated, and real-time assessment of stuttering children's physiological responses. Such assessments can be used clinically to facilitate providing just-in-time emotion regulation strategies (e.g., biofeedback, mindfulness interventions [59]) that may lead to improvement in speech fluency/disfluency.

Generalizability of MI-MIL:.
It is important to note that the MI-MIL approach is not limited to CWS's physiological response. To demonstrate the generalizability, we evaluated the MI-MIL model on a benchmark dataset named WESAD dataset [65] which contains the physiological response parameters (RSP, ECG, EDA) from 15 participants for baseline and stress conditions. Notably, WESAD dataset is not weakly labeled; has fine-grain annotations. Still, the MI-MIL approach outperformed all the baselines: DNN, CNN, LSTM, and Attention-based MIL models by achieving an F1 score of 0.92 (Detail in Appendix 9), which shows its generalizability, robustness, and applicability in different domains.

Study Limitations:
The study's limitation is that we analyzed data from only two conditions. Future work would benefit from sampling data from a broader range of situations to determine the models' predictive validity boundaries. Additionally, in the future, our approach can be implemented and evaluated on wearables for the longitudinal understanding of CWS's situational physiological arousal. Notably, this study's scope does not include an evaluation of CWS's speech. Multi-modal analysis of speech acoustics and physiological parameters can be a future research direction.

CONCLUSION
The presented first-of-its-kind study effectively identifies and visualizes the second-bysecond temporal pattern differences in the physiological arousal of preschool-age CWS and CWNS while speaking perceptually fluently in two challenging conditions: speaking in stressful situations and narration. We collected physiological parameters data from 70 children in the two target conditions. However, our dataset and differentiating CWS from CWNS leveraging their physiological response has several challenges (section 1.4), which we address by developing a novel MI-MIL. MI-MIL applies a multiple-instance-learning paradigm on each modality independently, while through a cross-modality-fusion network, it effectively combines each modality's sparse, latent attributes. MI-MIL is real-time executable and generalizable to other domains. The evaluation of this classifier addresses four critical research questions that align with state-of-the-art speech science studies' interests. Later, we leverage SHAP classifier interpretations to visualize the salient and fine-grain physiological parameters unique to CWS. Finally, comprehensive evaluations are done on multiple datasets, presented framework, and several baselines that identified notable insights on CWSs' physiological arousal during speech production.

ACKNOWLEDGMENTS
This work was supported in part by NSF IIS SCH # 2124285, NIH NIDCD # R21DC018103, and NIH NIDCD # R01DC017476 S2. We extend our sincere gratitude to the children and their caregivers whose participation made this project possible.

A.1 Presented Change-score Feature's Efficacy
This section compares our vector distance (i.e., cosine similarity and euclidean distance) based change-score feature's efficacy with conventional subtraction-based change-score features utilized in literature [17,19,37]. The conventional change-score for a variable (Δ f ) at timestamp t is calculated as the difference of the variable's value on that timestamp (f t ) from its average at the baseline condition: Δ f = f t − f baselineMean . The evaluation results utilizing conventional change-score features (Δ f ) of HR, EDA, RSP-amp, and RSP-rate are shown in Table 7, where results are relatively poor. Compared to the conventional subtraction based change-score, cosine similarity and the euclidean distance are effective measures for vector differences (i.e., our features from each timestamp is vectors) [26,44,76,80], capable of subtle difference assessment from multi-dimensional feature vectors, resulting in higher performances (Tables 1, 2, and 4).

A.2 Feature Indexes Used in Shapley Interpretability Visualization
The table 8 lists the features and their respective indices.

A.3 Extension of MI-MIL to Other Physiological Datasets
WESAD dataset [65] contains physiological parameter data (RSP, ECG,EDA) from 15 participants (mean age: 27.5 ± years) for baseline and stress conditions collected using a chest-worn RespiBAN device [65]. During the stress condition, the participants performed a Tier Social Stress Test (TSST) [41]. HR and Respiration features (RSP-rate and RSP-amp) were extracted from raw ECG and Respiration effort signals. The same set of 24 statistical quantifiers (similar to the ones discussed in section 2.3.1) were extracted from 20s overlapping windows (15s overlap). The table 9 illustrates the performance of various models on the WESAD dataset. The MI-MIL approach outperformed the baseline DNN, CNN, LSTM, and Attention-based MIL models by achieving an F1 score of 0.92. Among the baseline models, the Attention-MIL achieves the second-highest F1 score of 0.91; other baseline models also achieve high accuracy (in the range of 0.85-0.90). These results also advocate the success of weakly supervised approaches as both Attention-MIL and MI-MIL perform the classification task with high accuracy.

HR Features Importance:
According to the figure 8a, the only HR feature that differentiated CWS and CWNS during this task was HR minimum and variance. HR variance was significantly higher in CWS than in the CWNS. Both features were important feature in differentiating CWS (red grids in X-axis index 1 and 4) vs. CWNS (blue grids in X-axis index 1 and 4). However, other HR features were not different between the groups according to the ML model and pushed the classification towards the non-stuttering class. Thus, we interpret this data to suggest that there were certain time windows where the CWS showed a higher variation in the heart rate which the model recognised as important for the classification (dark red cells index 0-5).
RSP Features Importance: CWNS showed slightly lower RSP-amplitude maximum and significantly higher RSP-rate maximum (feature index 20) compared to the CWS. The data for both groups are in line with speech breathing rates for preschool-age children [12]. It can be observed from the graph that the RSP-rate maximum has high contribution in classification of CWS class (red grids for feature 20). The difference in the magnitude of the of the RSP-rate and RSP-amp features helps that classifier in CWNS vs CWS classification (CWNS graphs have blue grids for feature 12,13,15 while CWS have red grids for feature 13,14,15,20). These results suggest that the RSP features have high contributions in the classification task for the spontaneous free speech task.
Change-score Features Importance: The figure 8b shows the change-score features' importance plots. According to the figures, CWS showed lower increases from baseline in almost all physiologic features, namely HR, EDA, RSP-amplitude, and RSP-rate, compared to those of CWNS. This suggests that CWNS experienced a higher arousal during the free speech task compared to CWS. The change score features were curated for the model to capture changes HR, EDA, RSP-amplitude and RSP-rate feature vectors from baseline in-terms of magnitude and direction. The euclidean distance and cosine similarity values of EDA, HR and RSP-rate feature vector were important for identifying CWS during the free speech task but these features contributed differently in different time steps as shown by the heatmap which are captured using the modality specific embedding and cross modality fusion architecture in the MI-MIL model.
According to this section's MI-MIL model interpretations for the free speech dataset, We observe that unlike the scripted dataset where CWS experienced a higher arousal during the scripted stress-provoking speech task compared to CWNS. For the spontaneous speaking task, CWNS showed higher physiologic values than CWS during this task. Lastly, both CWS and CWNS experienced higher arousal during the scripted stress-provoking speech task than during the free speech task. .

A.5 MIL
This section discusses the permutation-invariance property: An implication that can be drawn from this relationship is that the order of the bag instances does not decide the corresponding probability of the bag. This assumption further motivates MIL use since the salient CWS affective state indicative signal patterns can be independent and sparse.

MIL Decomposition:
The MIL model tries to predict the bag label using the equation 3.
θ(X) = g σ x ∈ X f(x) Shapley plots for models evaluated on free speech data

Instance-Based MIL Approach:
It is the common form of MIL evaluated in the literature. A transformation function f takes each instance as input x j and returns the one-dimensional instance-level scores (i.e., instance labels e j = y j ). Then individual instance-level scores are aggregated (through pooling) to obtain the bag label Y. Max or mean operators are generally used as the pooling functions.
These two functions are symmetric and do not violate the permutation invariant assumption of the MIL approach. Studies have used other functions like convex maximum [38,60] as the pooling function σ.

Embedding-Based MIL Approach:
In this approach, a transformation function f maps the instances x j , j = 1, …, k to a lower m-dimensional embeddings (e j ). m is a hyper-parameter. A MIL pooling σ takes the embeddings and generates a bag representation z that is independent of the number of instances in the bag. A classifier function g further process the bag representation to infer the bag label Y. This approach is arbitrarily flexible and can be trained by backpropagation. The only constraint is that the function σ must be differentiable [35,81].
Previous studies have shown that the embedding-based MIL approaches achieve better classification performance [81]. Since the individual instance labels are unknown, the transformation functions f may be trained insufficiently and introduce error to the bag-level class prediction. The embedding-based approaches generate a joint representation of a bag from the instances; hence they do not introduce any additional bias to the bag-level classification [35,81] A  64,1] with ReLU activation between the first two layers and sigmoid activation function for the last layer of the classifier. The figure 9 illustrates the layer wise structure for different blocks.
We extract 19 × 8 change-score physiological features from each 20s signal. Thus, there are 19 instances, and each instance (i.e., 2s signal) is represented by a 1 × 8-dimensional change-score vector. The Architecture of the change score MI-MIL model is same as the scripted dataset apart from the dropout layer (20% dropout is used) in the HR and RSP rate modality embedding block.

A.5.2 Attention-MIL Implementation:
The attention-MIL architecture for the raw physiological features evaluated on the scripted dataset is as follow: The embedding block consists of a linear layer [200], a 1-D convolution layer with 20 kernel and 2X1 kernel size, and another linear layer [64] with ReLU activation function in each layer. A flatten layer is used to make the output twodimensional, followed by a batch normalization layer and a linear layer [128] with ReLU activation. The output of the embedding block is fed into the attention block consisting of two linear layer [64,1] which are separated by a tanh function. The classification block consists of two linear layers [64,1] with ReLU and sigmoid activation functions.

Layer-wise architectural information for MI-MIL
The attention-MIL architecture for the raw physiological features evaluated on the free-speech dataset is as follow: Since detecting affective states difference in CWS vs. CWNS from spontaneous narration is a harder task, and the free-speech data sample size is larger, the transformation/embedding block is more complex in this network. block remained the same as the raw features model evaluated on the scripted dataset discussed above.

A.5.3 Baseline Model Implementation:
In each of the evaluations, we considered four supervised learning classifiers: long short-term memory (LSTM), convolutional network (CNN), deep neural network (DNN), and LSTM with self-attention network architectures as baseline models. The baseline model implementations are discussed below: LSTM model was evaluated to explore the sequential dependence among the physiological data. The implementation consists of a single-layered LSTM and a linear output layer with Sigmoid activation.
The CNN model can extract the complex sequential and global information of the data.
It comprises 1D-convolution layers with kernel size 2X1 and padding size of 1. A 20% dropout is used between the convolution layers, and ReLU is used as the activation function. The output is then flattened and passed through some linear layers followed by 1D batch normalization layers. Each intermediate layer has ReLU as the activation function with Sigmoid activation in the output layer.
The DNN model architecture consists of some Linear layers, which are flattened and passed through the 1D batch normalization layer. After batch normalization, the outputs are passed through a dense layer with a sigmoid activation function.
In order to see the performance of supervised learning approaches using the attention framework. We evaluated the LSTM model (with self attention). The model architecture consists of a four-layer LSTM whose output is fed to a self-attention block (similar to the one discussed in 3.2.2) which consists of two linear layers separated by a tanh activation.
The attention block generates the attention weight for each instance after receiving the embedding from the LSTM layer. Then, the attention-weights are multiplied with the embedding generated from the LSTM model (self-attention), which is passed through a linear layer [128] and an output layer.

A.6 Overview of the Existing Deep Learning Literature on Stress Detection
The table 10 lists the existing work in deep learning domain which are focused on stress detection.   EDA and HR signals during negatively valenced picture viewing for a CWS participant. The 'Yellow' regions (i.e., region A) of the signals do not convey any arousal indicative pattern. 'Orange' regions (i.e., region B) show the signal patterns indicative of low-to-high arousal transition, and the 'red' regions (i.e., regions C) show the high arousal response. Chronological order of events for one negatively-valenced picture viewing in the scripted dataset. Shapley global features-importance heatmaps for MI-MIL models evaluated on (Window-2) scripted dataset. In each plot, the X-axis represents the features (24 raw physiological and 8 change score features), and the Y-axis represents the '2s segments' inside the 20s event detection window. Each cell in the heatmap contains the mean Shapley value of the respective feature in that 2s segment. The color-bars in the right of the figures show the colors scheme according to Shapley value's magnitude. The raw and change-score features, with their corresponding indexes on the generated Shapley feature importance plots are shown in Appendix A.2 -table 8. Participant wise shapley importance plots for EDA change score euclidean distance features (from one window-2)