HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer

Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of intra- and inter- personal dependencies. Intrapersonal dependencies refer to the influences and dynamics within an individual, including their affective states and how it evolves over time. Interpersonal dependencies, on the other hand, involve the interactions and dynamics between individuals, encompassing how affective displays are influenced by and influence others during conversations. To address these challenges, we propose a Cross-person Memory Transformer (CPM-T) framework which explicitly models intra- and inter- personal dependencies in multi-modal non-verbal cues. The CPM-T framework maintains memory modules to store and update dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and robustness of our approach on three publicly available datasets for joint engagement, rapport, and human belief prediction tasks. Our framework outperforms baseline models in average F1-scores by up to 22.6%, 15.1%, and 10.0% respectively on these three tasks. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.


INTRODUCTION
In social interactions, individuals rely on a combination of cues to perceive and comprehend the affective states of others, enabling them to gain insights into the contextual aspects of the interaction [11,14,24,26].This process of understanding is influenced by multiple factors, such as the specific situation at hand, the nature of the relationship between the individuals involved, and the observer's own emotions, experiences, and expectations.Additionally, in the context of multi-party interactions, challenges arise from both inter-personal influences, which involve dynamics among participants, and intra-personal influences, which pertain to individual contributions within the group [8,17].These challenges highlight the need for a comprehensive approach that considers the lasting impact of affects, the influences of individual and group dynamics, and the contextual nuances within social interactions.
To address the challenges posed by affect dynamics in interactive conversations, we propose the Crossperson Memory Transformer (CPM-T) for modeling affect dynamics in interactive conversations.Our model incorporates a cross-modal transformer [29] to obtain fused representations of multiple modality features extracted by modality-specific backbones.Additionally, we leverage crossperson attention [17] to capture the influences of intrapersonal and interpersonal factors by encoding verbal and nonverbal cue features.The model also includes a memory network that allows for the retention of past interactions and utilizes the reasoning capabilities of a large language model to guide the interpretation of verbal cues.Given intrapersonal and interpersonal inputs from multiple modalities, CPM-T applies cross-modal and cross-person attention to encode nonverbal representations.This encoding process, guided by verbal reasoning from a large language model and supported by the memory modules, enables the model to autoregressively output an embedding that encapsulates contextualized information about the affective dynamics and interactions in the ongoing conversation.By capturing the momentum of affective states and the complex dependencies between individuals and their historical context, CPM-T enables a deeper understanding of the interplay between verbal and nonverbal cues in social interactions.
To evaluate the effectiveness of our proposed approach, we selected three complex social and affective dynamics tasks: joint engagement, rapport, and human belief prediction from DAMI-P2C [6], MPIIGroupInteraction [22], and BOSS [9] datasets, which involve long-term dependencies influenced by various intra-and inter-personal dynamics.These tasks share commonalities that involve interpreting nonverbal cues, understanding social dynamics and context, possessing empathy and theory of mind, aligning communication, demonstrating cognitive flexibility, and engaging in collaborative problem-solving.By addressing these aspects, our proposed approach aims to enhance social cognition, communication skills, and interpersonal understanding in human interactions.
To summarize, the main contributions of our work are as follows: (1) We propose Crossperson Memory Transformer (CPM-T), a novel transformer-based model which combines the concept of Cross-person Attention (CPA) and Memory (Slot) Attention for capturing the intra-and inter-personal relationship between pairs of people that lies in long-term dependencies with multi-modal streams.(2) We utilize the Large Language Model (LLM)'s reasoning as verbal context which provides guidance for nonverbal cues through the memory network to improve the model's performance.(3) We successfully integrate the proposed model into the joint engagement, rapport, and belief dynamics prediction task on three publicly available datasets.Experiments, ablation studies, and qualitative analysis support the effectiveness of our model and open up new possibilities for improving social human-robot interaction in various settings.

RELATED WORKS 2.1 Memory Networks
Memory networks have gained considerable attention due to their ability to capture and leverage contextual information for understanding and modeling affective experiences.One specific challenge involves effectively capturing the temporal dynamics inherent in affective experiences.In response, researchers have investigated the use of memory networks for modeling interactive conversational memory networks in tasks such as emotion recognition [15,18,27], sentiment analysis [31], and emotion flip reasoning [16].These models significantly enhance the comprehension and prediction of affective states within real-world interactions.Despite the promising potential of memory networks to address these challenges, further research is necessary to enhance their scalability, interpretability, and generalization capabilities in the context of affective communication tasks.

Modeling Interactive Conversations
The field of conversational modeling has increasingly recognized the importance of incorporating affect dynamics into understanding human interactions.Specifically, [7] proposes DyadFormer, multimodal transformer architecture to model individual and interpersonal features in dyadic interactions for personality prediction.[17] present MultiPar-T, a transformer-based model that can capture the contingent behavior in a multi-party setting by conducting an engagement prediction task.[23] models interactional communication in dyadic interaction by autoregressively outputting multiple possibilities of corresponding listener motion.[12] proposes a multimodal emotion detection framework that extracts multimodal features from conversational videos and hierarchically models the self-and inter-speaker emotional influences into global memories.Among these prior works, only a few studies have addressed the challenges of modeling affect dynamics in more complex conversational tasks; joint engagement, rapport, and belief dynamics prediction.Previous models have been limited in their ability to capture the nuances of human interactions by only recognizing a limited range of affective states and contextual cues.Furthermore, they have not fully accounted for the complexities of multimodal features.By addressing these limitations, our proposed model represents a comprehensive approach to modeling affect dynamics in interactive conversations, building on prior work on affect dynamics, multimodal features, and complex conversational tasks.By recognizing a broader range of affective states and contextual cues, our model can capture the nuances of human interactions and enable more accurate modeling of affective dynamics.

Language Models as Multimodal Guides
Language Models (LMs) have proven to be powerful tools in various domains, including affective computing.They have been successfully applied in guiding other modalities for video segmentation [20], context-aware prompting [25], and image classification [32].These studies highlight the potential of combining verbal context with nonverbal context to enhance the understanding and generation of nonverbal behaviors in affective computing.By leveraging the capabilities of LMs, we can effectively bridge the gap between verbal and nonverbal cues, enabling a more comprehensive and nuanced understanding of affective dynamics in human interactions.This integration allows us to capture the interplay between verbal and nonverbal expressions, fusing nonverbal behaviors that align with the given verbal context.
In our work, we extend the application of LMs to the modeling of affect dynamics in interactive conversations.By utilizing a large language model as a guiding source for nonverbal cues, we aim to enhance the performance of our proposed Crossperson Memory Transformer (CPM-T) framework in capturing the intricate relationship between verbal and nonverbal aspects of affective communication.Through the integration of verbal context provided by LMs, CPM-T can generate more contextually relevant and emotionally expressive nonverbal behaviors, thereby improving the overall fidelity and naturalness of affective communication modeling.The utilization of LMs in the affective computing domain not only enriches our understanding of human interactions but also opens up new possibilities for applications in social robotics, virtual agents, and human-computer interaction.By effectively combining verbal and nonverbal cues, we can create more engaging and empathetic systems that can better understand and respond to users' affective states and needs.

METHODS
In this section, we describe our proposed Crossperson Memory Transformer (CPM-T) (Figure 2).At the high level, CPM-T takes fused multi-modal representation from each person using Crossmodal Transformer and utilize Crossperson Attention (CPA) to discover the self and interpersonal influences.Next, we utilize Memory (Slot) Attention modules to incorporate an external dynamic memory to encode and retrieve past information.In Section 3.3, and 3.4, we present in details about the ingredients of the CPM-T architecture (see Figure 2) and explain the importance of each component.

Problem Statement
Consider a set of video-audio pairs D = {(, )} where  is the audio and video input and  ∈  , is the label from a set of  classes.We extract the features of all audio and video clips in D as   =   ( []) ∈ R   ×  and   =   ( []) ∈ R   ×  , respectively (for certain models, we add extra modalities such as pose  and text  along with audio and video).Given the task-specific concepts  = { 1 ,  2 , ...,   } and the task, we generate a set of reasoning sentences  =  () and feed these sentences to the memory encoder   , to generate verbal memory  =   ().Combined with the Cross-person Memory Transformer model, it produces a prediction, ŷ =  (, ) , in which  is the CPM-T model,  is the MLP layer,  is the sequence tokens and  is the memory tokens.

Intrapersonal Input Separation
In Figure 2, we display the individuals are separated from the original videos, and this process was done by 1) video inpainting (Figure 3) and 2) speaker diarization (Figure 4).
-Video Inpainting.For video inpainting, we use a flow-guided video inpainting model,  2   [19] which can handle videos with arbitrary resolution.This model exhibits a strong ability to generalize effectively to higher resolutions, as demonstrated by experimental results and validated performance metrics such as PSNR and SSIM.The video inpainting process can be divided into three interconnected stages.Firstly, flow completion is performed to estimate the missing optical flow fields in corrupted regions, as the absence of flow information in those areas can impact subsequent processes.Secondly, pixel propagation is employed to fill the holes in corrupted videos by bi-directionally propagating pixels from visible areas, leveraging the completed optical flow as a guide.Finally, content hallucination takes place, where the remaining missing regions are generated through the use of a pre-trained image inpainting network.
-Speaker Diarization.For speaker diarization, we use the model that uses attention to localize and group sound sources, and optical flow to aggregate information overtime which is presented in [2].The performance of the model has been validated through four downstream speech-oriented tasks: (a) multi speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection which showed the effectiveness of the model's learned audio-visual object embeddings.

Verbal Memory from LLM Reasoning
In this paper, we harness the power of LLMs to facilitate the extraction and integration of nonverbal cues within the realm of affective dynamics in interactive conversations.Specifically, we use OpenAI's Chat-GPT, which possesses advanced capabilities to understand and generate human-like text based on given prompts.By leveraging the reasoning abilities of LLMs, we can tap into their deep understanding of language and utilize their contextual comprehension to enhance our understanding of affective states in conversational settings.We prompt the model to generate a set of reasoning about the joint affective states, taking into account the context  and the conversation history within the designated window .The choice of window size determines the amount of context we consider for analysis.For example, for the DAMI-P2C dataset, we provide the following contexts to make a final formatted prompt: • the type of relationship: parent-child • the type of activity they are doing: story reading • the conversation history within window size  • the label we want to predict: joint engagement • the entities: parent, child, and both The prompt can be obtained by filling the contextual information to the pre-defined format (see Figure 2) and we collect responses for 1) parent, 2) child, and 3) dyad by inputting the prompt to the large language model.the encoder part from the memformer [30] which includes the memory reading and writing operations to generate verbal context.This verbal context is used to initialize the memory part of crossperson memory network which takes the nonverbal cue segments as input and updates the nonverbal context guided by the verbal context.
By incorporating LLMs into our methodology, we open up possibilities for more nuanced and insightful analysis of affective dynamics.The integration of nonverbal cues with verbal memory allows us to capture a more holistic view of affective experiences in social interactions.This approach offers valuable insights into the complex interplay between verbal and nonverbal communication, contributing to a deeper understanding of affective inertia and its temporal aspects.

Crossperson Memory Network (CPM-T)
In order to successfully address the complex affect dynamics taking place in interactive conversations, we must properly represent each person's individual nonverbal cues, address self and interpersonal influences, then take into account the long-range dependencies that last over time.

Affect Dynamics Encoding: Cross-person Attention (CPA).
To explicitly model the self-and interpersonal influences between pairs of people, we utilized Cross-person Attention (CPA), proposed in [17].This method states that given a pair of people, the target person   's behavior is contingent on person ℎ 's behavior if person   's behavior was likely to be influenced by person ℎ 's behavior (ℎ →  ).Hence, for the target person   , and another person ℎ , we utilize the multi-modal representations    ,  ℎ ∈ R  × 2 obtained from the Cross-modal Transformer, where  is the sequence length and  is the projected feature dimension.Following how Multimodal Transformer [29] calculated the cross-modal attention, the cross-person attention can be calculated similarly as below: (1) CPA ,multi  ℎ →  refers to the multi-headed from person other to person self at the m-th layer.CPA ℎ →  ( ℎ ,    ) outputs an embedding which has captured the person   's behavior contingent on person ℎ 's behaviors.Note that depending on the task, we concatenate the different outputs from the Crossperson Transformers (e.g., for DAMI-P2C (child-coordinated joint engagement), we concatenate the outputs only from  → ℎ and ℎ whereas we concatenate the outputs from ℎ →   ,   → ℎ and   for rest of datasets (rapport, human belief dynamics)).
Dynamic Memory Update: Memory Slot Attention.To encode and retain important past context, we utilize the external dynamic memory method presented in [30].The model interactively encodes and retrieves the information from memory in a recurrent way by conducting memory read and write operations.At each timestep , we have   = [ 0  ,  1  , ...,    ].For each slot in the batch, they keep separate memory representations by working individually.For each segment sequence, the model first read the memory to retain the past important information by using cross-attention.
Here, we project memory slot vectors into keys and values and input sequences into queries and use these queries to attend to all key-value pairs in the memory slots, ultimately resulting in the output of the final hidden states.This enables the model to learn the complex association of the memory.Next, memory writing happens with a slot attention module to update memory information and a forgetting method to clean up unimportant memory information.Memory writing only occurs at the last layer of the encoder and allows high-level contextual representations to be stored in memory.Slot attention happens in this stage where each memory slot only attends to itself and token representations, and this prevents each memory slot to write its own information to other slots directly, as memory slots should be independent of each other.
Each slot is separately projected into queries and keys.The segment token representations are projected into keys and values.Slot attention means that each memory slot can only attend to itself and the token representations.Thus, each memory slot cannot write its own information to other slots directly, as memory slots should not be interfering with each other.Finally, after the attention scores are calculated, the raw attention weights are divided by the temperature , and the next timestep's memory is collected with attention: For the details of how the memory read and write operation works, we encourage the readers to refer to Appendix C.

EXPERIMENTS
In this section, we empirically evaluate the Crossperson Memory Transformer (CPM-T) on three datasets that are frequently used to benchmark human affect communication tasks in prior works [6,9,22].Our goal is to compare CPM-T with prior competitive approaches on which almost all prior works employ) and unaligned(which is more challenging, and which CPM-T is generically designed for) multimodal language sequences.1: Results and standard deviations for the proposed and baseline models on DAMI-P2C dataset using 3 seeds.In "Modality" column,   +  stands for modality fusion for modality   and   .Eng stands for Engagement.  →   +   means   +   was guided by   modality using the memory encoder.For MulT, we provided the sentences from the original transcription as , and for Ours, we used the reasoning output sentences from LLM as  (see Section 3.3 and Appendix B for details).† represents statistical significance over state-of-the-art scores under the paired bootstrap test ( < 0.05) and Bonferroni correction.

Datasets and Evaluation Metrics
We utilize the DAMI-P2C [6], MPIIGroupInteraction [22], and BOSS [9] (See Appendix A.3 for more details) as benchmarks to measure the performance of our proposed method against other baselines.Each task requires the understanding of verbal and nonverbal cues from each person and modeling the affect dynamics to predict the joint labels between pairs of people.

DAMI-P2C
. DAMI-P2C is a corpus of multimodal, multiparty conversational interactions in which participants followed a collaborative parent-child interaction to elicit their joint engagement.The dataset was collected in a study of 34 families, where a parent and a child (3-7 years old) engage in reading storybooks together.From the original five-point ordinal scale [-2,2], we modified the labels to three discrete categories for the classification task; Low, Mid, and High joint engagement.A ten-second window was selected as the fragment interval of target audio-visual recordings for the annotation to capture the long-range context of affect dynamics between pairs of dyads.When annotating the recordings, annotators were instructed to judge whether a given fragment contained the story-related dyadic interaction and filter out those that did not.In total, 16,593 fragments have been utilized with 488.03 ± 123.25 fragments from each family on average.

MPIIGroupInteraction.
MPIIGroupInteraction is a dataset that collected audio-visual non-verbal behavior data and rapport ratings during small group interactions.It consists of 22 group discussions in German, each involving either three or four participants and each lasting about 20 minutes, resulting in a total of more than 440 minutes of audio-visual data.78 German-speaking participants were recruited from a German university campus, resulting in 12 group interactions with four participants, and 10 interactions with three participants.Since rapport is a subjective feeling that is hard to gauge through any existing equipment, the rapport was selfreported by the participants.Responses were recorded on sevenpoint Likert scales and we modified the labels into three categories (Low, Mid, and High Rapport) to conduct the classification task with the same model structure across different tasks.Each participant rated each item for other individuals in the group, yielding two rapport scores for each dyad inside the larger group and we evaluate the model's performance by averaging the results in different directions.
BOSS.BOSS is a 3D video dataset compiled from a sequence of social interactions between two individuals in an object-context scenario.The two participants are required to accomplish a collaborative task by inferring and interpreting each other's beliefs through nonverbal communication.Individuals' latent mental belief states were annotated, for which ground-truth labels are extremely challenging to obtain.Ten pairs of participants (five pairs of friends and five pairs of strangers) were recruited in 15 distinct contexts to compile the dataset.900 videos from both the egocentric and third-person perspectives were gathered, totaling 347,490 frames.However, the focus on object matching alone does not capture the rich nonverbal communication that occurs during social interactions.To capture these nonverbal cues and enable a more comprehensive analysis of social interactions, the annotation in the BOSS dataset has been modified in this work to include information on participants' joint attention, attention following, and communication.In detail, we define joint attention, attention following, and no communication using a threshold-based approach.Specifically, we considered an instance of joint attention when the number of matched objects exceeded a threshold of 30.For attention following, we set the threshold to a value between 0 and 30.Finally, we defined no communication as an instance where there were no matched objects which were inspired by [10].

Baselines
We compare our proposed model with a family of baselines in emotion recognition, action recognition, and personality recognition.We run the latest versions of these models and report their scores on a unified benchmark.For affect recognition models, we compare CPM-T to MulT [29] and MultiPar-T [17].For the action recognition model, we compare our method with I3D [5].Finally, for personality recognition model, we compare CPM-T to DyadFormer [7].

Implementation Details
We train our models on 4 NVIDIA GeForce GTX 2080 Ti with different training settings which are described in Appendix A. For all three datasets, we conduct cross-validation by iterating through 0.1 proportion of the groups' data as the test, 0.2 proportion of the other groups' data as the valid, and the rest of the other groups' data as the train set for 3 seeds.Our code can be found in the [Anonymous] and will be shared with the dataset access link through the GitHub repository with a camera-ready version.

RESULTS & DISCUSSION
In this section, we discuss the quantitative results of our experiments.We compare our approach CPM-T with state-of-the-art baselines.Then, we discuss the importance of each component in the framework and modality used to train the model through ablation studies.Finally, we leave qualitative analysis in Appendix E to show different types of memory slots obtained from memory writer and crossmodal attention weights to show the correlation learned from audio-visual inputs.Drawing on prior research [17], we report the macro, weighted F1-score, and F1-score for every class, as well as an accuracy metric.The macro F1-score, calculated as the unweighted average of per-class F1-scores, holds significant value in our study as it signifies the model's performance across all classes, irrespective of their representation in the dataset.
Comparison against baseline models.In Table 1 and 2, we evaluated the performance of the proposed model along with baseline models for the task of predicting joint engagement (DAMI-P2C), rapport (MPIIGroupInteraction), and human belief dynamics (BOSS) between dyads.The datasets are highly imbalanced (see Figure . 5) where predicting low engagement, low rapport, and joint attention is challenging.
For DAMI-P2C, our proposed model, which used audio and video modalities guided by verbal context, achieved the highest performance across all evaluation measures.Moreover, our model achieved a weighted F1-score of 0.677, an improvement of 8.8%, and the highest macro F1-score of 0.490, an improvement of 7.3% over the next best-performing model.Particularly, our model outperformed all baselines in predicting the low engagement class, achieving an F1-score of 0.286.This is particularly important given the imbalanced nature of the dataset, with only 493 instances of the low engagement class.Our model's ability to accurately predict low engagement instances could help parents and clinicians identify areas of potential concern and intervene early.In contrast, the I3D model, which only used the video modality, achieved the lowest performance across all evaluation measures.This suggests that the inclusion of other modalities, such as audio and text, can improve the model's ability to predict joint engagement.
For MPIIGroupInteraction and BOSS, our proposed model achieved the highest performance in accuracy and macro F1-score by using audio and video modality (text modality was not supported in the original dataset).As stated earlier, since the dataset is highly imbalanced, acquiring a high macro F1-score is important.Our model's ability to accurately predict low rapport and joint attention could help teachers or professors to identify the cohesion between students and the mental states of one another.It is also interesting to see that MultiPar-T performed worse compared to the other two datasets and it might be due to the information loss that came from the blurred faces (See Figure 5).In contrast, which only used the video modality, achieved the second-best performance in macro F1score and we assume this is due to the less meaningful information came from the audio modality (people kept repeating the objects they want to insist).3: Ablation.Effect of ablating key components of our method (CPM-T).We encourage the readers to see Figure 2. / LLM refers to the ablation of verbal memory initialization./ Memory refers to the ablation of memory modules in CPM-T but passes through a sequence model to do prediction./ Individuals refers to the ablation of video inpainting and speaker dimerization to separate individuals in in-person interaction videos.Results with different combinations of components are displayed, where Ours / All performs well in general.† represents statistical significance over state-of-the-art scores under the paired bootstrap test ( < 0.05) and Bonferroni correction.0.611 ± 0.00 0.597 ± 0.01 0.431 ± 0.01 0.226 ± 0.05 0.323 ± 0.06 0.743 ± 0.00 t → a+v 0.634 ± 0.00 † 0.677 ± 0.01 † 0.490 ± 0.00 0.286 ± 0.01 0.430 ± 0.01 0.755 ± 0.00 † Table 4: Ablation.Effect of different combinations of modalities for our method (CPM-T).Again,   +  stands for modality fusion for modality   and   , and   →   +   means   +   was guided by   modality using the memory encoder.† represents statistical significance over state-of-the-art scores under the paired bootstrap test ( < 0.05) and Bonferroni correction.

Dataset
train the model with the DAMI-P2C dataset, we performed two ablation studies (See Table 3 and 4).
We first systematically removed three components from the full model and compared the resulting performance to the baseline where all components were present.The three components that we removed were the LLM component, the Memory modules, and the separation of individuals from original videos.We measured the accuracy, weighted f1-score, macro f1-score, and f1-scores for each class.Our main interest was the macro f1-score, which provides a better indication of the overall performance of the model when the class distribution is imbalanced.Our results showed that removing any of the three components resulted in a decrease in macro f1score compared to the baseline.Specifically, removing the LLM component resulted in a decrease of 0.014 in macro f1-score, while removing the Memory modules and the component that separates individuals from in-person interaction videos resulted in decreases of 0.012 and 0.062, respectively.Notably, the full model achieved the highest macro f1-score of 0.490, which was a statistically significant improvement over the baseline.These results suggest that all three components are important for achieving high performance in joint engagement prediction.
In addition to the first ablation study, we conducted another study to investigate the impact of different modality inputs on our model's performance.Specifically, we tested three different input modalities: video only, audio and video, and video guided by verbal memory.We also tested the same input modalities with the addition of verbal memory guidance.Our results show that using both audio and video inputs significantly improved our model's performance, as indicated by a macro f1-score of 0.476, which was significantly higher than using video input only (f1-score of 0.385).Guiding the video input using verbal memory also improved the performance slightly (f1-score of 0.431), but not significantly so.Our findings suggest that using both audio and video inputs is crucial for accurate joint engagement prediction, and that verbal memory guidance can further enhance the video modality's performance.

CONCLUSION
In this paper, we presented Crossperson Memory Transformer (CPM-T), a multi-modal multi-party framework for modeling affect dynamics in interactive conversations.Our model capitalizes on modeling contextual information that incorporates self and inter-speaker influences.We accomplish this by using a memory and crossperson transformer.Experiments show that our model outperforms state-of-the-art models on three benchmark datasets.Extensive evaluations and case studies demonstrate the effectiveness of our proposed model.Additionally, the ability to visualize the attentions brings a sense of interpretability to the model, as it allows us to investigate which utterances in the conversational history provide important emotional cues for the current emotional state of the speaker.In the future, we plan to test our model on other relevant affective communication tasks and also explore more into the property of the momentum that comes from second derivatives.
Limitations & Future Works.This paper proposes a novel approach, the Cross-person Memory Transformer (CPM-T), which leverages long-range contextual information to predict affect communication tasks between individuals based on verbal and nonverbal cues.However, it is important to note that while the DAMI-P2C dataset used in this study relied on human-generated context, in real-world applications, it will be necessary for the agent to autonomously capture and reason about the context in order to produce appropriate verbal and nonverbal responses.Furthermore, affective momentum, which is a second-order derivative property that arises in the context of affect dynamics, was not explicitly considered in this study.As such, future research should focus on developing models that take into account this property and other higher-order affective phenomena.

A DATASETS A.1 DAMI-P2C
DAMI-P2C dataset was collected for capturing natural story-reading interactions between a parent and their child in a lab setting.The dataset consists of five major categories of content (audio and video, sociodemographic profiles, reading behavior features, affect annotations, and person identification and body tracking) necessary to understand the social-emotional behaviors and affective states of parent-child dyads in the co-reading context.This dataset focused on the parent-child co-reading interaction activity, a practice positively associated with both children's later reading and language outcomes and their interest and enjoyment in reading later in childhood.To capture the parent-child engagement quality, the Joint Engagement Rating Inventory (JERI) [1] was selected, as it quantitates and qualities the caregiver-child interaction during a joint activity where verbal and nonverbal behaviors related to engagement are observed and rated.Specifically, we selected to use Child Coordinated Engagement (CCE) in this work which involves the child's engagement with the parent instead of their engagement with the activity.The child's CCE will be rated low if the child is engaging in story listening or reading without attending to the parent and acknowledging their presence.

A.2 MPIIGroupInteraction
The data recording took place in a quiet office in which a larger area was cleared of existing furniture.To capture rich visual information and allow for natural bodily expressions, they used a 4DV camera system to record frame-synchronized video from eight ambient cameras.Specifically, two cameras were placed behind each participant and with a position slightly higher than the head of the participant.During the group forming process, experimenters ensured that participants in the same group did not know each other prior to the study.To prevent learning effects, every participant took part in only one interaction.To increase engagement, experimenters prepared a list of potential discussion topics and asked each group to choose the topic that was most controversial among group members.Afterward, the experimenter left the room and came back about 20 minutes later to end the discussion.Participants were then asked to complete several questionnaires about the other group members.

A.3 BOSS
Participants were instructed to form pairs and stand in front of a table.One table contained a list of contextual objects, and the other table contained a collection of objects that could be selected based on the presented context.Each contextual object had at least two and no more than three possible combinations of object table selections.The set of contextual objects is defined as    = {Chips, Magazine, Chocolate, Crackers, Sugar, Apple, Wine, Potato, Lemon, Orange, Sardines, TomatoCan, Walnut, Nail, and Plant} and the set of objects selected to match these contextual objects is defined as    = {Wine Opener, Knife, Mug, Peeler, Bowl, Scissors, Chips Cap, Marker, Water Spray, Hammer, and Can Opener }.This experimental design allowed for the investigation of participants' ability to match objects with contextual information, and the BOSS dataset contains the collected data from this task.

C FEATURES
The features for each modality are extracted using the following tools: -Audio.We use Vggish [13] for extracting low level acoustic features.The VGGish model was pre-trained on AudioSet.The extracted features are from the pre-classification layer after activation.The dimension of the feature tensor is 128.
-Vision.We use R(2+1)D [28] which the model was pre-trained on Kinetics 400.The model expects to input a stack of 16 RGB frames (112x112) and the dimension of the feature tensor is 512.
-Pose.We use OpenFace [3] which provides normalized eye gaze direction, location of the head, location of 3D landmarks, and facial action units with a 128-dimensional vector.For the BOSS dataset, the face part was blurred due to privacy issues.However, we could utilize provided OpenPose [4] features which support 25-keypoint body/foot keypoint estimation, including 6-foot key points.

D MEMORY MODULES
In Figure 6 (a), the input sequence  undergoes an attention mechanism that encompasses all the memory slots, allowing it to retain historical information.In (b), each memory slot attends over itself and the representations of the input sequence to generate the subsequent memory slot at the next time step.This approach assumes that each memory slot independently stores information and introduces a specific form of sparse attention pattern.In this pattern, each slot in the memory has the ability to attend solely to itself and the outputs of the encoder.The primary objective is to maintain the information within each slot for an extended period throughout the time horizon.By limiting the attention to the slot itself during the writing process, the information contained within that slot remains unchanged in the subsequent timestep.
In addition, forgetting is an essential aspect of learning since it enables the filtering out of trivial and temporary information, allowing for the retention of more significant and valuable knowledge.In [30], they propose the utilization of Biased Memory Normalization (BMN), a forgetting mechanism designed specifically for slot memory representations.BMN involves the normalization of memory slots at each step, preventing the memory weights from growing infinitely and ensuring gradient stability over extended periods.
To facilitate forgetting of previous information, they introduce a learnable vector bias,   .The initial state,    , is naturally incorporated after normalization.
The vector   serves as a control mechanism for the rate and direction of forgetting.Introducing   to the memory slot, it induces movement along the sphere, resulting in the forgetting of a portion of the stored information.If a memory slot remains unchanged for an extended period, it will eventually reach the terminal state, T, unless new information is injected.The terminal state also serves as the initial state and is subject to learning.The speed of forgetting is determined by the magnitude of   and the cosine distance between  ′  +1 (the updated memory slot) and   .For instance, if   is nearly opposite to the terminal state, it would be challenging to forget its information.On the other hand, if   is closer to the terminal state, it becomes easier to forget its information.

E ATTENTION ANALYSIS
To demonstrate how the memory network could be used to explain the needs for long-range dependencies, we analyzed the attention outputs from the memory writer following [30].We empirically categorized the memory slots into three different types and visualized three examples with normalized attention values in Figure 7 (a).We particularly selected memory slots with indexes of 0,17, and 22.These memory slots represent three types of memories.In 1st type of memory like  22 , their attention focused on themselves, meaning that they were not updating for the current timestep.This suggests that memory slots can carry information from the distant past.For the second type, the memory slot  17 had some partial attention over itself and the rest of the attention over other tokens.This type of memory slot is transformed from the first type of memory slot, and at the current timestep, they aggregate information from other tokens.The third type of memory slot looks like  0 .It completely attended to the input tokens.In the beginning, nearly all memory slots belong to this type, but later only 5% to 10% of the total memory slots account for this type.We also found that the forgetting vector's bias for  0 had a larger magnitude compared to some other slots, suggesting that the information was changing rapidly for this memory slot.
In addition, to see how crossmodal attention learned the correlation between different modalities [29], we visualize the visualize the attention activation in Figure 7 (b), which shows an example of a section of the crossmodal attention matrix on layer 3 of the  →  network (the original matrix has dimension   ×   ; the figure shows the attention corresponding to approximately a 10-sec short window of that matrix).We observe that crossmodal attention has learned to attend to meaningful signals across the two modalities.For example, stronger attention is given to the intersection of story-related utterances that tend to trigger the engagement (e.g., "Only Comes", "Big Brown") and drastic facial expressions and body gesture change in the video.This observation advocates the advantage of crossmodal transformer over conventional alignment; crossmodal attention enables direct capture of potentially long-range signals, including those off diagonals on the attention matrix [29].

Figure 1 :
Figure 1: Best viewed zoomed in and in color.Interactive conversation scenarios from (a) DAMI-P2C, (b) MPII, and (c) BOSS datasets.For each conversation, intrapersonal influence and interpersonal influence are evident with the affective displays and this overall affective momentum with the self and interpersonal influence dynamics are depicted in (d).

Figure 2 :
Figure2: Schematic visualization of the proposed method.(a) while there exist different combinations of modality inputs, we exemplary consider the case of joint engagement prediction between parent and child using audio, video, and conversation history inputs here.The original input is separated into intrapersonal and interpersonal inputs.(b) we utilize verbal cues to guide nonverbal cues by initializing the memory bank with the verbal context and feeding nonverbal segments with this memory to the network (for memory encoder, we borrow the encoder part of the memformer network[30] and utilize the outputs as verbal contexts).(c) nonverbal cues are split into  segments along the temporal axis and iteratively processed through the memory network to update the memory and output the encoded representation in the last layer.Crossperson Attention (CPA) is used to explicitly model the affect dynamics and the mechanism is described in section 3.4.

Figure 3 :
Figure 3: Video Inpainting.The corresponding modules work in an end-to-end manner.A qualitative visualization of the approach is shown in the figure (image borrowed from [19]).

Figure 4 :
Figure 4: Speaker Diarization.The model learns through self-supervision to represent a video as a set of discrete audio-visual objects.This model groups a scene into object instances and represents each one with a feature embedding (image borrowed from [2]).

Figure 6 :
Figure 6: (a) Memory Reading and (b) Memory Writing in memory modules.Image borrowed from [30].

Figure 7 :
Figure 7: Best viewed zoomed in and in color.(a) Visualization of three different memory slots.We analyze the attention outputs from the memory writer and show the representative memory slots of index 0, 17, and 22 from each category.In the case of the 3rd type (e.g. 22 ) of memory slots, attention focused on themselves, meaning that they were not updating for the current timestep.In contrast, in the case of the 1st type (e.g. 0 ) of memory slots, they completely attended to the input tokens.(b) Visualization of sample crossmodal attention weights from layer 3 of [ → ] crossmodal transformer on DAMI-P2C.The attention weights show that the model was able to learn the correlation between audio and vision modalities (especially where people's utterances triggered their facial expressions and bodily gestures).

Table 2 :
Results and standard deviations for the proposed and baseline models on MPIIGroupInteraction and BOSS dataset using 3 seeds.In "Modality" column,   +  stands for modality fusion for modality   and   .Rap, Comm, Follw, and Attn stands for Rapport, Communication, Following, and Attention respectively.For MPII and BOSS datasets, experiments on t modality are not reported since it was not provided in the original datasets.† represents statistical significance over state-ofthe-art scores under the paired bootstrap test ( < 0.05) and Bonferroni correction.

Table 5 :
Hyperparameters of CPM-T model for best performance in various tasks.