Computational Modeling of Collaborative Discourse to Enable Feedback and Reflection in Middle School Classrooms

Collaboration analytics has the potential to empower teachers and students with valuable insights to facilitate more meaningful and engaging collaborative learning experiences. Towards this end, we developed computational models of student speech during small group work, identifying instances of uplifting behavior related to three Community Agreements: community building, moving thinking forward, and being respectful. Pre-trained RoBERTa language models were fine-tuned and evaluated on human annotated data (N = 9,607 student utterances from 100 unique 5-minute classroom recordings). The models achieved moderate accuracies (AUROCs between 0.67-0.84) and were robust to speech recognition errors. Preliminary generalizability studies indicated that the models generalized well to two other domains (transfer ratios between 0.46-0.85; with 1.0 indicating perfect transfer). We also developed four approaches to provide qualitative feedback in the form of noticings (i.e., specific exemplars) of positive instances of the Community Agreements, finding moderate alignment with human ratings. This research contributes to the computational modeling of the relationship dimension of collaboration from noisy classroom data, selection of positive examples for qualitative feedback, and towards the empowerment of teachers to support diverse learners during collaborative learning.


INTRODUCTION
Small group collaborative learning is becoming a hallmark of 21st century K-12 pedagogy, owing to its effectiveness in nurturing skills such as critical thinking, problem-solving, and social interaction in addition to domain knowledge and disciplinary practices [16,24,35].However, managing effective collaborative learning experiences can be challenging for teachers as they also need to monitor progress, provide guidance on learning activities, and support groups in productive, knowledge-building conversations [38,50].Past research has highlighted a persistent deficit in students' collaboration skills [14,16,18,51] and students often report demotivation and frustration with collaborative learning activities when they feel their peers' social conduct during activities is disruptive [5].These issues have been exacerbated by the long period of remote learning in the wake of the COVID-19 pandemic, where students were required to engage in independent study, leading to challenges related to self-regulation, including low motivation and ineffective communication with peers, ultimately leaving students less prepared for collaborative learning [13,36].
The fields of Computer-Supported Collaborative Learning and Collaborative Learning Analytics have proposed several technological solutions to serve as resources for teachers and learners to address some of these issues through capturing, analyzing, and visualizing insights from group interactions [24,29,40].Yet, a recent review has highlighted several issues with existing approaches to collaborative analytics relating to features, scope, and presentation [26]."Collaborative learning dashboards," they posit, "should be designed not to simply show a group's interactive behavior, but rather to inform and motivate future decisions" (p.177).
Taking a step in this direction, the present paper focuses on the development of computational models of student discourse to enable collaborative analytics technologies for teachers, students, and researchers.Our focus on discourse was motivated by the fact that sharing ideas and building off of others' is the hallmark of collaboration [37,53].Accordingly, language emerges as a significant indicator of many aspects of collaborative learning [2,16].Prior research has leveraged natural language processing (NLP) techniques to assess group communication to identify indicators of Collaborative Problem Solving (CPS) proficiency, showcasing their ability to accurately detect such skills [20,32,33,39,44,46].NLP models can then be used to provide feedback, scaffolding skill development during collaboration [9].
Whereas prior work has largely focused on the task dimension of collaboration, emphasizing behaviors that promote success on the task at hand, we were interested in measuring and supporting the relationship dimension of collaboration as a valued measure in its own right [14,16,22].Accordingly, a goal of our work was to support students in developing skills to have more accountable and uplifting interactions with one other.Based on extensive co-design with youth [7], we are developing technology that automatically identifies and visualizes expressions of socially uplifting discourse in student group speech across three dimensions, which we call Community Agreements (CAs): community building, moving thinking forward, and being respectful.A major component was the analytics needed to model CAs from real-world classroom speech as our system is intended for use in situations where multiple groups are simultaneously interacting.Another important aspect was model generalization, which poses a substantial challenge in the deployment of models within educational environments.Collaboration inherently spans many contexts, and the availability of pre-existing data in new contexts is uncertain.Finally, integral to the system was the extraction of noticings (i.e., exemplars) of CAs from student speech.We envisioned noticings to be presented as feedback to teachers and students, providing non-evaluative, qualitative instances of affirmative discourse to inspire discussion and bolster transparency of the underlying analytics.By exploring this novel type of feedback, we sought to help students understand how their own community-building talk manifests in collaborative learning settings.

Related Work
Collaboration entails telling and doing, implicating verbal, paraverbal, and nonverbal modalities [30,44].While others have have investigated the use of nonverbal signals like eye gaze, facial expression, body movement, and acoustic-prosodic features of speech to model aspects of collaboration [45,52], we focus on linguistic approaches.Considerable research has analyzed collaborative discourse via the application of advanced NLP techniques.This has been operationalized by obtaining language data (e.g., from text chats or transcribed spoken interactions) to model CPS skills such as negotiation, information sharing, regulation, and argumentation [12,20,39,44,46].Insights have been used to understand emerging social networks and collaborative patterns [31] and to predict outcomes like learning improvement [41] and task performance [15].Prevalent NLP methodologies have involved using words, phrases, or part-of-speech tags as features in classification, however recently there has been a growing trend in utilizing pre-trained neural networks, with several studies showing their effectiveness with collaborative discourse [25,33].
We focus on two key computational issues: model generalization and speech recognition in real-world settings.While generalization, or the ability of models to transfer knowledge from domains in which they were trained to new ones, is one of the primary desiderata of NLP, few collaboration analytics studies have addressed this [8,32].Many focus on a single domain or classroom curriculum, leading to models with highly specific knowledge, often lacking the high-level representations needed for broader learning and applications.Extending the scope of collaborative analytics research to encompass diverse contexts and real-world educational settings would enhance their relevance.However, analyzing student speech engenders significant computational challenges.Automated Speech Recognition (ASR) accuracy is substantially lower for children's speech than adults' [42], and typical signal-to-noise ratios in the classroom can range from −7 dB to +5 dB [23], further impeding data quality.Thus, pertinent questions are the extent to which ASR errors cascade to affect downstream classification and how to increase robustness to noisy speech [4].
Beyond computational modeling, studies have investigated tools for providing feedback for classroom collaboration skills.One example, CPSCoach, employed ASR and NLP to offer college students personalized feedback on their CPS skills during video conferencing sessions, supplemented by learning materials for skill enhancement [47].A similar system, IneqDetect, visualizes individual speaking time to help students reflect on team communication [27].Feedback solely rooted in speaking time may overlook contributions that advance the team's goals in different ways.In a study conducted to evaluate the impact of feedback on students' collaboration skills, findings indicated that although there was no significant improvement in collaboration skills, it was recommended that feedback be prompt, include important metrics, and offer explanations, along with personalized guidance on enhancing collaboration [10].We found feedback in the form of model noticings to be an underrepresented approach.AI explainability techniques such as saliency maps, attention mechanisms, and tools like LIME or SHAP [28] are often employed to visualize and explain the model's decisionmaking process to end users, however in the context of a classroom, providing concrete examples may allow students to better grasp and trust the model's behavior.

Current Study
We focused on the development of CA coding schemes, training and evaluation of NLP models, and selection of noticings for feedback.Specifically, we sought to answer the following research questions: (1) What are effective indicators of Community Agreements in small group collaborative learning discourse?(2) How can ASR challenges in noisy classroom environments be addressed for classifying Community Agreements? (3) To what extent do the models generalize to new curricula without incorporating data from the target contexts?(4) What are the most effective ways to identify noticings of student discourse for qualitative feedback?To address (1), we employed a multi-faceted approach to developing a robust CA coding scheme, aligning indicators from a validated CPS framework to CAs, applying this mapping to classroom discourse.For (2), we fine-tuned RoBERTa language models on noisy classroom data, involving both ASR and human transcripts.For (3), we evaluated our models on labeled data from small group work from an educational physics game and a block programming game.Finally, for (4), we implemented four computational strategies for identifying and ranking examples of positive instances of CAs in student speech.We collected human ratings on the usefulness and validity of a subset of these noticings.Together, we make progress in advancing collaborative discourse analytics to facilitate formative feedback in real-world classrooms.
Key novel aspects of our work in light of prior research includes fully-automated modeling of Community Agreements in child speech from noisy classroom environments, and the development of methods for selecting noticings to promote reflection and transparency.

DATA
Approval for all procedures was obtained from the designated Institutional Research Boards, and analyzed data involved students who provided their assent and whose parents or legal guardians provided consent.

Data Collection
Data was collected from urban, rural, and suburban public middle school classrooms in Colorado, USA during the preceding two years.Students participated in small group work during a curriculum unit called "Sensor Immersion", where they programmed and wired environmental sensors to collect data about their surroundings.The curriculum revolved around an interactive display called the Data Sensor Hub (DaSH), which was a central point of exploration for the students [6].Students explored the system, constructed scientific models, and acquired the skills to replicate its functionality in the scope of their own investigations, involving authentic debugging and engineering practices as they relate to sensor technology and pair programming within group settings [11].
A tabletop omnidirectional microphone (Yeti Blue) was placed at each group table during recorded Sensor Immersion sessions.This microphone was selected after considering audio quality, affordability, power source, form factor, and ease of use.Due to the reliance on a single omnidirectional microphone to capture the conversations of multiple students, the collected data was inherently noisy [4,42].Video data was also collected with an iPAD camera.
Within each recording, we identified five 5-minute segments from the group work portion of the lesson, typically confined to the middle of the recording, as the initial and final portions tended to have less relevant discussion.We systematically listened to each segment.If it met a 20-word threshold, it was included as a sample, otherwise the next segment was evaluated, and so on.If none of the segments met the criteria, the recording was excluded from analyses.
The dataset consisted of 100 5-minute excerpts of small group work collected from 164 unique students (73 dyads, 7 triads, and 6 tetrads) under the guidance of 14 teachers.Demographic information from individual students was not collected, however the demographic composition of the school districts indicated that the sample was diverse.

Human and Automated Speech Recognition Transcription
Recordings were transcribed manually by three humans resulting in a total of 16,515 transcribed utterances.In cases where speech was too noisy, transcribers denoted some or all of the utterance as "[inaudible]".Human transcriptions included notes such as laughter, singing, or crosstalk (e.g., "[laughter]", "[singing]", "[crosstalk]"), as well as who the student was addressing (e.g., "[addressing group]", "[addressing other]").Individual utterances were automatically extracted from recordings using the human annotated timestamps and transcribed with Whisper, a state-of-the-art open-access ASR model trained on a substantial dataset of 680,000 hours of speech [34].
We computed the word error rate (WER) of each utterance, defined as ++ . The mean student WER was 68.9%, highlighting challenges of working with real classroom speech.There was high variability in WERs, however, analyses suggested that this can be improved by filtering on ASR confidence value.For example, focusing only on utterances with confidence values greater than the 40th percentile reduced the student WER to 32.6%.

Human Coding of CA Labels
Utterance-level CA Annotations.To assess CAs within student group conversations, we devised a novel mapping from CPS indicators (derived by [48,49]).This generalizable CPS coding framework consists of three facets of CPS established by literature (constructing shared knowledge, negotiation/coordination, and maintaining team functions) and 18 indicators observable in group conversations.The scheme was validated through empirical studies of triads across contexts including differences in participant age, co-locality and virtuality, and task type [48], and was further validated as a predictor of CPS performance both as individual indicators [49] or as temporal clusters of indicators [54].
We adapted the original CPS scheme to the present study wherein four coders, including one expert coder who helped develop the original scheme, iteratively annotated a subset of Sensor Immersion transcripts, noted points of disagreement, added clarifications and examples to the coding scheme, and discussed inconsistencies until consensus.After multiple training sessions, transcripts were divided among coders who individually coded each utterance for the presence of indicators (alongside the video for context).To ensure reliability, the expert coder reviewed each coded transcript.We then created a novel mapping of CPS indicators to CAs (Table 1), using definitions from OpenSciEd (a free curriculum from which the CA framework was derived) and by consulting with experts in OpenSciEd.As shown in the table, approximately 9% to 15% of the utterances contained a CA, and the labels are not mutually exclusive.
Recording-level Expert CA Ratings.For further validation, we adopted a high-level approach to CA rating, where two experts (education and language researchers with experience in observing classroom activity) applied subjective ratings of CAs at the recording-level.They were asked to "rate each video from 1-5 for each [CA] (or indicate if it was not scorable)."Inter-rater reliability as assessed by quadratic kappa was high for being respectful (0.90) and community building (0.72) but lower for moving thinking forward (0.34).Mean scores were 3.58, 3.14, and 3.47, respectively.
Utterance-level CPS mapped human annotations were averaged to thhe recording-level and Spearman correlations with the expert ratings were  = 0.44 (moving thinking forward),  = 0.25 (community building), and  = 0.42 (being respectful).Together, this is a theoretical and methodological extension of the CPS framework to a novel construct.

Data Processing
Teacher and non-consenting student utterances, and those directed at teachers or students from other groups were excluded.Human and ASR transcripts were normalized to ensure consistency and accuracy for future classroom use.Normalization involved the replacement of hyphens with spaces, removal of transcriber notes (e.g., "[inaudible]") and punctuation, and conversion to lowercase.After preprocessing, the final filtered dataset comprised 9,607 utterances.On average there were 2.3 students per recording (SD = 0.6, range = 2-4), 1.4 recordings per student (SD = 1.0, range = 1-5), 67.

METHODS 3.1 Deep Transfer Learning with RoBERTa
We developed three computational models: one each for community building, moving thinking forward, and being respectful.The models were pre-trained RoBERTa language models -a variant of the BERT language model with a multilayer bidirectional transformer architecture.The RoBERTa models were individually fine-tuned on the filtered Sensor Immersion dataset annotated with binary labels for each CA.Fine tuning pre-trained large language models is a NLP technique that allows adaptation of powerful domain-agnostic models to specific tasks.Utterances were tokenized with the RoBERTa tokenizer which incorporates padding and truncation to ensure that all text sequences are of uniform length.The fine-tuning process comprised a batch size of 32, 50 training epochs, a learning rate of 5e-06, and 50 warmup steps.Hyperparameters were based on previous research on fine tuning RoBERTa models for CPS prediction on a different dataset [32].Minor adjustments were made but we did not do massive hyperparameter tuning.
We used a stratified recording-level 10-fold cross-validation framework for model training and evaluation.The dataset was divided into 10 folds, where the proportions of positive samples from each CA class were approximately equivalent and utterances from single recordings were not split between folds.All experiments were conducted based on these initial cross-validation splits, ensuring consistency and reproducibility in our analyses.For each round of cross validation, train (8 folds), validate (1 fold), and test (1 fold) sets were created.The train set was used to fine-tune the models, generating checkpoints throughout the process.The best model checkpoint for a round was determined by testing the checkpoints on the validate data.The checkpoint with the best performance on the validate data was then used to test the held out test set and generate final predictions.Our primary evaluation metric was the area under the precision-recall curve (AUPRC), chosen for its robustness in handling class imbalances compared to metrics such as the area under the receiver operating curve (AUROC) [19].The reported percent above chance represents AUPRCs adjusted to account for baseline occurrence variations.We present results testing on both human and ASR transcripts.
We also determined the extent that utterance-level human labels and model predictions, aggregated to the recording-level, agree with the subjective expert perception of CA usage per recording, which goes beyond language and incorporates video context.Accordingly, we correlated aggregated utterance-to-recording level human annotations and model predictions of the CAs with the recording-level expert rating (1-5 scale).Recording-level aggregations leverage the principle of aggregation to reduce noise for an overall estimate of CA prevalence per session.These correlations were performed on a subset of recordings (N = 31) that had recording-level expert judgments.

Human and ASR-Augmented Training
While training and evaluating NLP models on human-generated transcripts provides a "gold-standard", it remains critical to consider how resilient models are to data collected in realistic conditions.Thus, in addition to training solely on human transcripts, following [4], we also incorporated ASR-augmentation.In this setting, for each human-transcribed utterance, we added its ASR-transcribed counterpart for training and testing.This new, effectively doubled, ASR-augmented dataset was shuffled within recording such that the human and ASR transcripts of a single student utterance were not consecutive in model training.The same stratified recording-level 10-fold cross-validation folds were retained between the approaches as our emphasis was in evaluating the relative performance shift with the utilization of ASR-augmented training.

Generalization
We sought to determine the extent that the models fine-tuned on Sensor Immersion data transferred to new tasks and curricula, without the necessity of further training with data from the target context.Specifically, we evaluated the models on two additional data sets: (1) Physics Playground -an educational physics game, and (2) Minecraft Hour of Code -a block programming game.These datasets were collected as part of previous research on remote CPS [43].These datasets involved remote collaboration among N = 288 university level students (average age = 22).The Physics dataset contained 46,679 utterances from 74 unique groups and the Minecraft dataset contained 10,976 utterances from 32 unique groups.Participants from both datasets self-reported gender as follows: 54% female, 41% as male, 1% as non-binary/third gender, and 4% did not report.Participants self-reported race as follows: 48% Caucasian, 25% Hispanic/Latino, 17% Asian, 3% Black or African American, 1% American Indian or Alaska Native, 3% Other, and 3% did not report.Both datasets were transcribed with IBM Watson, which provided an additional source of variability.
Following [32], the primary metric we used to evaluate the ability of the models to accurately predict CAs in new domains was the Transfer Ratio (TR): The TR measures the relative decline in a model's performance when training and evaluating on data from different tasks (across task evaluation), as compared to within the same task (within task evaluation).A TR value of 1 would signify perfect generalizability, with no decline in performance due to across-task evaluation.Both the numerator and denominator of the TR equation are adjusted by subtracting 0.50 to quantify performance difference over chance.

CA Noticings
Selecting Noticings.Our models returned a probability between 0 and 1 for each utterance, with 0.5 as a threshold for positive predictions.Probabilities closer to the threshold were low confidence and those closer 1.0 were high confidence positive predictions.Preliminary examinations indicated that it was insufficient to provide highly confident positive predictions as noticings as they often lacked context and viability to serve as learning examples.For instance, versions of the phrase "yeah" tended to comprise the high confidence predictions for being respectful as this was a highly typical exemplar of the indicator Responds to others' questions or ideas, but do not serve as good reflection opportunities.More context and varied examples are needed for model transparency and to provide qualitative feedback.
As such, we explored four computational strategies for identifying and ranking student speech examples that demonstrate community building behaviors.These strategies included (1) rulebased, (2) semantic similarity to student co-negotiated classroom agreements, (3) semantic similarity to CA expert definitions, and (4) topic modeling.The co-negotiated agreements were collected from students in two middle school classrooms where teachers assisted students in categorizing their ideas for definitions of the CAs.A subset of the phrases used in both semantic similarity approaches, as well as psuedocode for the rule based approach and the topics (with corresponding representative words) chosen from the topic modeling approach are detailed in Table 2.
For the rule-based approach, initially all positive predictions for utterances comprising more than one word and an ASR confidence score (Spearman  correlation to WER = -0.59)greater than 0.5 were identified.The set of two word or larger positive predictions with sufficient ASR confidence were then divided into three separate lists based on their word count (utterances with more than five words, those with three to five words, and those with three words or fewer).Following this categorization, each list was individually sorted by prediction probability.Finally, these three sorted lists were concatenated together, arranged in descending word count order.This prioritized longer utterances, which inherently contain more contextual information.
Both semantic similarity techniques follow the same process, but differ in the set of phrases used to sort by.Starting with the same list of utterances as the rule-based approach, the utterances were projected into an embedding space using the BERT language model.Embeddings adhere to the distributional hypothesis theory [21], which posits that texts sharing similar meanings also possess similar representations, and are positioned in closer proximity within the embedding space.Consequently, the greater the cosine similarity between two phrases, the more semantically related they are.One version of this approach sorted noticings by semantic proximity to real co-negotiated classroom agreements, whereas the other sorted by proximity to expert definitions of the CAs.A cosine similarity matrix of student utterances to phrases of interest was created.Noticings were iteratively chosen from the matrix in order of highest cosine similarity to a phrase.A threshold of 0.80 was chosen such that utterances with cosine similarity to all phrases below 0.80 would not be considered.The algorithm started by considering all possible phrases, and as utterances were selected, the corresponding phrase was no longer considered in that iteration.We considered only phrases that had not yet been chosen so as to diversify the noticings rather than overselect from a single phrase.This continued until all utterances with cosine greater than 0.80 to a phrase were chosen.
The topic modeling approach began with the utterances filtered by the rule-based approach.We then harnessed topic models to choose utterances that represent explicit topics of interest.Topic modeling is a classic NLP technique that identifies latent topics or themes within a collection of documents, enabling the discovery of underlying patterns and structures [3].We trained three BERTopic models [17] on the set of all positive instances of each CA from the training data and empirically chose a number of topics per model that carefully balanced over-simplification with over-segmentation, ultimately aiming for one that best captured structure in the data.Once the topic models were created, we identified the topics that aligned with expert definitions and perceptions of the CAs, excluding those that were too broad or irrelevant.The selection and ranking of noticings was performed by filtering the utterances that are labeled as a topic of interest by a topic model and this set was then sorted by NLP model prediction probability.
Human Ratings to Validate Noticings.The effectiveness and accuracy of these approaches were evaluated with human ratings.We recruited a total of 33 raters who completed secondary education or above through the decentralized Prolific survey platform [1] to read noticings and judge whether they are examples of community building, moving thinking forward, and being respectful.Each noticing received a total of 3 human ratings.Raters (Female = 17, Male = 16) were geographically diverse with an age range of 20 -72 who indicated English as their primary and first language.Raters were compensated US$6.00, with a median completion time of 14 minutes and 40 seconds.
We randomly sampled 200 positive predictions (from human transcripts only to avoid the confound of ASR errors) for each CA and subsequently filtered and ranked them with each approach.We first computed correlations between the rankings given by each approach in order to verify that they were not associated with one another.There was considerable overlap in the selections of the two semantic similarity methods (Spearman correlation  = 0.75) so we proceeded only with the semantic similarity to classroom agreements.Correlations among the other methods were between  = -0.25 and  = 0.18, suggesting considerable variability.
The survey was split into three sections, one for each CA.Each category contained 20 utterances noticed by the methods, 5 random utterances that were not noticed by any method, as well as one attention check totaling 78 items per participant.Items were individually presented with the following instructions: "Please indicate the extent to which the phrase below is an example of [Respect or Thinking or Community].") using a scale ranging from 1 (not at all) to 5 (extremely).The definition of each CA was accessible for the raters to view on each question.To control for quality, raters completed two different tests before they were considered eligible to participate: (1) a screener validation, consisting of questions replicated from the Prolific internal screening system in order to disqualify those with inconsistent responses and (2) a comprehension check, involving 12 items that tested comprehension of the CA definitions.

Accuracy of RoBERTa Models
Utterance-level Model Accuracy.The results of fine-tuning the RoBERTa models on human and ASR-augmented data are given in Table 3. Overall, models outperformed chance as evidenced by AUROC scores greater than 0.5 (mean = 0.72, SD = 0.07) and AUPRC scores exceeding the base rate (mean percent above base rate = 149%, SD = 87%).In general, testing on human transcripts provided an upper bound, and surpassed random chance by a substantial margin.As expected, we observed performance degradation when testing the models with the noisier ASR transcripts.Specifically, when training on human transcripts only, we found AUPRC decreases of -14.81%, -6.25%, and -38.71% between testing on human vs ASR transcripts for moving thinking forward, community building, and being respectful, respectively.
The incorporation of ASR-augmentation in fine-tuning yielded significant enhancements in testing with both human and ASR transcripts for all CAs.We found a notable improvement in the comparison between testing on human vs ASR transcripts.Specifically, we found AUPRC changes of -3.33%, +9.37%, and -7.69% between the test sets for moving thinking forward, community building, and being respectful, respectively.In addition to reducing the performance gap between testing on human vs ASR transcripts, we found that overall predictions were more accurate for both transcript types.The percent improvements of the AUPRC metric on human transcripts was 11.11%, 0.0%, and 25.8% and the change was greater for the Whisper transcripts, with AUPRC improvements of 26.09%, 16.67%, and 89.47%, for moving thinking forward, community building, and being respectful, respectively.ASR-augmentation proved to be the most beneficial for being respectful.
Focusing on the improved ASR-augmentation approach, we found fairly consistent results between the CA models.Testing on human transcripts yielded AUROCs between 0.77 and 0.84, and between 0.67 and 0.71 for ASR transcripts.ASR-augmentation introduced a degree of diversity into the training data, encompassing variations in speech patterns and errors often encountered in realworld, noisy classroom environments.We hypothesized that this diversity allowed our models to adapt better to the intricacies of student discourse, ultimately resulting in improved generalization to both transcript types.These combined effects highlighted the effectiveness of ASR-augmentation in refining the language model's capabilities for this task.While comparing our results with similar research (e.g., [33]) is complex due to different labels, base rates, and datasets, we found our results to fall within previously cited accuracy ranges.
Relationships with Expert Ratings.The correlations (computed with Spearman's rho ) between recording-level expert ratings (on a 1-5 scale) of the CAs and aggregated utterance-to-recording level human annotations, human transcript trained model predictions, and ASR-augmented model predictions (utterance-level labels and predictions were averaged to the recording-level) were generally moderate as shown in Table 4.We expected this, as individual perceptions of CAs will inevitably differ from a CPS indicator mapped version.The ASR-augmented model correlated more strongly with the expert ratings than the model trained on human transcripts across the board, presumably because this model was more accurate.Surprisingly, it was also more strongly correlated than the groundtruth human annotations for moving thinking forward; correlations were on par for community building, and lower for being respectful.Overall, the correlations from the ASR-augmented model provide some confidence in the automated measurements.

Generalizability of Sensor Immersion Models
We found that the models trained on Sensor Immersion data generalize well to the Physics Playground and Minecraft domains.TR comprises the across task AUROC (the AUROC of the model trained on Sensor Immersion and tested on the transfer datasets) and the within task AUROC (the AUROC of the model trained and The TRs suggest better generalization overall to Physics (mean TR = 0.75) and less so for Minecraft (mean TR = 0.58).We found good generalization for community building across both transfer tasks (TRs of .73 and .74)whereas being respectful generalized well for Physics (TR of .85)but not Minecraft (TR of.56).TRs for moving thinking forward (.56 and .67)were intermediate.It appears that the type of words commonly used in positive instances of community building had more overlap among the three datasets than in the other CAs as community building is less related to the content of group work.With that said, the sensor-related content words in the training data -as opposed to the Physics and Minecraft-related content words in the transfer data -caused a lack of transfer, most noticeably for moving thinking forward.Error analysis confirmed that the models specifically suffered in instances with domainspecific verbiage.An example of a false negative due to domain shift (underlined) is, "okay so next time you want to start from the top so that it swings you can hit control right click and it will delete".Further work is necessary to investigate these shortcomings as well as to build robust models that can transfer to new domains with little to no human annotated data.

Noticings User Study
Our analyses focused on the highest ranked (rank of 4 or 5) 423 utterances by the rule-based ( = 211), semantic similarity ( = Should we go with a smile face right here?(rating 3.7) 202), and topic modeling ( = 103) approaches.Of these, 336 were selected by a single method, 81 by two, and 6 by all three.The ratings were averaged across raters, which was the main dependent variable in our analyses.On average, the ratings hover around the midpoint (mean = 2.27, SD = 0.84) of the 1-5 scale suggesting that the identified noticings were perceived as being reasonable examples of the CA categories, illustrated in Table 5.
For the main analysis, we regressed the mean rating on CA (with community building as the reference group) x Method (with rule-based as the reference group) interaction and the recording as a random intercept (more complex random effects structures resulted in convergence errors).There was no significant interaction ( = .39),so we re-ran the model with main effects only.Results indicated a significant main effect for CA (F(2) = 17,  < .001).Post hoc comparisons with false discovery rate corrections for multiple comparisons indicated that ratings were significantly lower ( < .01)for community building (M = 2.07) compared to being respectful (M = 2.33) and moving thinking forward (M = 2.42); the latter two were on par ( = .33).There was also a main effect of method (F(2) = 7.6,  = .02)with semantic similarity (M = 2.16) being rated significantly ( < .01)lower than rule-based (M = 2.38), but not significantly different ( = .32)from topic modeling (M = 2.29), which was on par with rule-based ( = .34).

DISCUSSION
Our overall focus was on the computational modeling of the relationship dimension of collaboration in classroom environments.We utilized prior research on Collaborative Problem Solving (CPS) to define three dimensions of collaboration, referred to as Community Agreements (CAs).We investigated the accuracy of three fine-tuned RoBERTa language model classifiers for each CA in noisy classroom data.The classifiers far exceeded chance, though overall accuracy was modest.The use of ASR-augmentation in fine-tuning the models made them more resilient to ASR errors and increased overall robustness, as demonstrated by improved accuracy when testing on human and ASR transcripts.We found that models trained strictly within the Sensor Immersion dataset could modestly generalize to new domains, even those in very different settings.Finally, we found that rule-based and topic modeling approaches to filtering and ranking noticings better aligned with human perception of the CAs than a semantic similarity approach.
A major application of this research involves the practical use of these collaboration analytics models within real educational environments.We are in the process of integrating the models developed here into an AI technology that provides automated formative assessments of CAs via visual representations of CA prevalence and model noticings during small group collaborative learning.Teachers are then able to facilitate a discussion around the predictions -both in regard to the successful instances of collaboration that were noticed in the classroom and also by interrogating current limitations of AI systems.Another application is to provide automated assessments of collaboration as a variable for future research studies on collaborative learning, where manual annotation is a bottleneck.
Like all studies, ours has limitations.With respect to the coding of CAs, the low inter-rater reliability kappa for moving thinking forward and the low correlation with expert rating for community building are indeed limitations, however the overall convergence across raters and approaches encourages us that this is a productive first step for the robust validation of the CA measure.Next, only one type of NLP model was considered -fine-tuned RoBERTa language models.The absence of a comparison to other NLP architectures restricts our ability to assess relative performance and effectiveness when compared to alternative approaches.We also did not collect demographic data from individual students, which precluded an analysis of bias/fairness of the models.Another limitation pertains to the models' focus on speech-only, whereas CAs may also be expressed nonverbally.Our generalizability assessment was preliminary with mixed results.As such, the applicability to different domains or populations may vary and should be considered with caution.With respect to the noticings user study, we only collected feedback from adult raters with limited exposure to the problem space.This choice, while deliberate for certain research objectives, restricts the breadth of insights we can draw from their perspectives.Lastly, whereas the present paper focused on validating the models and selecting noticings, we have yet to investigate how these models perform when integrated into future interventions.
Future work includes developing improved approaches to modeling student speech, improving generalization to new domains, investigating and mitigating potential biases, moving towards a multimodal approach, soliciting feedback of noticings from users, and assessing overall impact, fairness, and equity.For improving the language models, we plan to investigate other pre-trained language models architectures.The generalization of NLP models to new domains is a fast-evolving research area in NLP, called the cold start problem.This problem can be addressed with techniques such as zero (or few) shot learning.Model bias will be assessed and potentially addressed with techniques such as adversarial debiasing.For further validation and improvement of our noticing selections, we will seek feedback from users (e.g., students and teachers), potentially using human ratings for training supervised NLP models that learn to select noticings.Finally, while student language is an important indicator of collaborative learning, we plan to incorporate aspects of nonverbal signals such as eye gaze, acoustic-prosodic features of speech, facial expressions, and body movements in order to create a more thorough and robust model of collaboration.

CONCLUSIONS
This study leveraged noisy classroom data, a context often underrepresented in research, to explore and model the relationship dimension of collaboration in the form of three Community Agreements, shedding light on the dynamics of collaborative discourse within real-world classroom environments.We successfully modeled the Community Agreements with real-world speech, investigated their generalizability to two other datasets, and we placed special emphasis on fostering deeper insights into the dynamics of collaboration in the form of noticings of student discourse to enhance the overall learning experience.By providing concrete illustrations of model predictions, we promoted deeper understanding of the model's decision-making process thereby fostering transparency in AI-augmented educational settings.

Table 1 :
Collaborative Problem Solving (CPS) indicator mappings to Community Agreements and their mean frequency in the final filtered dataset.Corresponding examples from the Sensor Immersion dataset are given.
"So now I think we push download and see what happens." Asks others for suggestions "How do I get rid of this?" Provides reasons to support a solution "[...] we have to make that lower because its always gonna play if its 50." Questions/corrects others' mistakes "No no.We already said Hello.[...]"

Table 2 :
Overview of the four computational approaches for selecting and ranking noticings.

Table 3 :
Performance metrics from the three models (Agreement) tested on Human and Whisper ASR transcripts, averaged over 10 folds of stratified recording-level cross validation.Human and Whisper base rates differ due to missing Whisper transcripts in the presence of noisy or unintelligible utterances.

Table 5 :
Example highly ranked noticings.Average human rating of each utterance is given in parentheses.