skip to main content
research-article
Public Access

DIPS: A Dyadic Impression Prediction System for Group Interaction Videos

Published:23 January 2023Publication History

Skip Abstract Section

Abstract

We consider the problem of predicting the impression that one subject has of another in a video clip showing a group of interacting people. Our novel Dyadic Impression Prediction System (DIPS) contains two major innovations. First, we develop a novel method to align the facial expressions of subjects pi and pj as well as account for the temporal delay that might be involved in pi reacting to pj’s facial expressions. Second, we propose the concept of a multilayered stochastic network for impression prediction on top of which we build a novel Temporal Delayed Network graph neural network architecture. Our overall DIPS architecture predicts six dependent variables relating to the impression pi has of pj. Our experiments show that DIPS beats eight baselines from the literature, yielding statistically significant improvements of 19.9% to 30.8% in AUC and 12.6% to 47.2% in F1-score. We further conduct ablation studies showing that our novel features contribute to the overall quality of the predictions made by DIPS.

Skip 1INTRODUCTION Section

1 INTRODUCTION

There are many situations in group settings where we wish to understand the impressions that a person \(p_i\) has of another person \(p_j\). For instance, in a diplomatic negotiation, it might be critical for one side to understand the mutual feelings of people on the other side toward one another as this can provide important leverage. A person called in to a business meeting with a group of people she doesn’t know might wish to understand the like/dislike relationships between the people she is meeting with.

We therefore consider the problem of dyadic impression prediction using nonverbal cues such as facial action units [4], facial emotions [37], gaze relationships [3], and more in order to predict subject \(p_i\)’s impression about subject \(p_j\)’s likability. Likability is captured through six survey questions designed to elicit \(p_i\)’s impression of \(p_j\).

In order to achieve this, we build upon a dataset of a popular party game called “The Resistance” [2, 3], which is similar to popular games such as Mafia and Werewolf, played by tens of thousands of people worldwide. Each game lasts for 30 to 60 minutes and involves five to eight players. We conducted a survey of the impressions of players toward other players in which we asked six questions at the end of the game. Each subject filled out these questions on a 7-point scale. Examples of questions included “Did you find \(p_j\) to be cold or warm?” and “Did you find \(p_j\) unpleasant or pleasant?” The answers to the six questions asked formed our dependent variables that we wish to predict. Our dataset included a total of 48 games involving distinct players per game (i.e., no player was present in more than one game).

Our DIPS (Dyadic Impression Prediction System) predicts the answers to these six questions and has several novel components:

(1)

Emotion, Facial Action Units, and Temporal Alignment. Social science theory [13] posits that \(p_i\)’s impression of \(p_j\) and \(p_j\)’s impression of \(p_i\) are not independent. Of course, we see this in our daily lives—if a person doesn’t like you, you may not like them back. We might therefore get clues about \(p_i\)’s impression of \(p_j\) by looking at \(p_j\)’s facial emotions. We define a novel class of alignment vectors that capture the alignment—with possible temporal delays in order to account for subjects’ response times—between the facial emotions and action units of subjects \(p_i\) and \(p_j\).

(2)

Temporal Delayed Network. We introduce the novel concept of a Temporal Delayed Network, which is a multi-layer network [14, 33] where each layer represents a particular time point. Within a single layer, nodes correspond to players and edges correspond to different interactions between players (e.g., look at, talk to, listen to). Within a layer, edges are labeled with the probability that the stated interaction occurs. Across layers, edges represent identity information by linking the same individuals in different layers, as well as delayed interaction information. To our knowledge, this is the first time that multi-layer networks have been used in predicting impressions of subjects. Using this multi-layer network as an underlying graph, we build a Graph Convolution Network [32] with an attention mechanism [56] to learn representations and predict dyadic impressions of \(p_i\) toward \(p_j\).

(3)

Social Network Features. We defined two classes of social network features. Emotion ranks consider emotions of a dyad \((p_i,p_j)\) simultaneously and define emotion score as the intensity of a given emotion on \(p_i\)’s face times the probability that that emotion is directed from \(p_i\) toward \(p_j\). Emotion rank takes these dyadic emotion scores as input and uses gaze networks to build a PageRank style metric. We leveraged social balance theory in a signed network [15, 19] in which edges can be positive or negative (whether \(\pm\) 1 or weighted amounts).1 We developed a novel degree of imbalance features capturing \(p_i\)’s impressions of \(p_j\) vis-a-vis such third parties.

We implemented DIPS and baselines encompassing past work [2, 5, 41]. We show that DIPS is able to generate AUCs ranging from 73% to 77% for the six dependent variables capturing impressions of a person \(p_i\) w.r.t. person \(p_j\) and improves upon the performance of eight competing baselines from the literature by 19.9% to 30.8% in AUC and 12.6% to 47.2% in F1-score. We further conduct ablation tests to show that the three novel components mentioned above contribute to this increase in predictive performance.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 Social and Psychological Science Efforts

Social scientists have studied likability for decades. Reysen [51] developed a likability scale asking subjects to rate other people on 11 variables, such as attractiveness, friendliness, similarity to a subject, and likelihood of being a friend with a subject. He observed that genuine laughter is a strong predictor of likability [51, 52]. One of the most commonly studied personality trait groups comprises the Big Five [22]: extraversion, agreeableness, conscientiousness, emotional stability, and culture. Burgoon’s [10] Expectancy Violation Theory linked impressions to fulfillment of expectations (e.g., X liking Y is linked to whether Y reciprocates).

Davydenko et al. [13] observed that the way people act (e.g., extroverted or introverted) affects how others judge them. They also serve as part of a feedback loop: a person interacting with an extroverted partner is perceived as nicer and displaying more positive social behavior. Seiter et al. [54] showed that even background nonverbal behavior can alter the impression of a person. They showed that a nonspeaking debater is less liked if they express negative facial expressions compared to a neutral expression. Smiling while looking at a person was shown to be more important than verbal expressions [21]. Hareli and Hess [23] showed that knowledge about the emotional response of a person can be used by people in forming impressions of that person; e.g., when someone is angry or unpleasant, others find the person less likable.

Interpersonal Adaptation Theory (IAT) [11] states that since people have preset expectations when interacting with another person, the impression toward her can be predicted by comparing actual behaviors with expectations. For instance, if person B seems unexpectedly aggressive to person A, IAT suggests that A will have a negative impression of B.

Another important quality of human interaction is rapport. Tickle-Degnen and Rosenthal [55] identify mutual attentiveness, positivity, and coordination between participants as essential components of rapport. Nonverbal behavior is seen as a quintessential component of identifying rapport; e.g., if A’s facial expressions mirror B’s, then they like each other more. Murata [42] suggests that people will emulate someone’s grin if they like that person: smiling means more smiling. Overall, a person’s gaze or attention seems to drive this mimicry. Murata [42] also notes that disliked people are mimicked less.

Initial impressions also seem to matter. Bruce and McDonald [9] suggest that a person will continue to like a stranger if there was an initial positive reaction. Interestingly, an initial negative reaction does not lead to one disliking another for long periods of time. Instead, people tend to forget unlikable faces, and thus the possibility of an impression changing from dislike to like appears to be higher than constant displeasure. To detect the change from dislike to like in individuals, various features were built to reflect social science theories. For instance, Nisbett and Smith [45] finds that sentiment is correlated with the subject’s popularity in the group. In other words, even if someone has a negative initial opinion of another person, that person’s popularity can decrease the negative sentiment from the former.

Interestingly, Floyd and Burgoon [21] find that people’s expression of dislike is more “uncontrollable” and external, while acceptance is generally internally expressed and more controlled in gesture. Based on the findings that relate both personal traits and group interactions to like/dislike impressions, we develop the DIPS framework to predict such dyadic impressions.

2.2 Computational Efforts

The past decade has seen heightened interest in automated analysis and modeling of human-to-human interaction in dyadic or group settings as well as computer-mediated interactions.

Datasets. Several datasets have been proposed as benchmarks for a variety of human interaction tasks. SEMAINE [40] contains videos of people conversing with human-driven, semi-automatic, and automatic virtual agents and annotations of perceived personality traits. ELEA [53] captures collaborative group interaction, the Big Five traits, and leadership behavior annotated by external observers as well as perceived leadership and likability reported by group participants. MATRICS [44] uses several modalities (motion capture, gaze tracking, head acceleration, video, audio, Kinect sensor) to record a small group of people participating in a task-oriented discussion. The ChaLearn First Impressions dataset [49] is a set of YouTube videos with human-annotated Big Five scores. VLOG [7] is another YouTube dataset with crowdsourced personality impressions. Recent work has also focused on job interviews and predicting hiring decisions [27, 43].

Personality Traits and Leadership Prediction. The majority of work in this area focuses on predicting the Big Five and leadership traits. Joshi et al. [29] used a Pyramid of Histogram of Gradients to predict perceived traits in videos of humans interacting with virtual agents expressing various personalities. Chávez-Martínez et al. [12] use multimodal features for multi-label prediction of moods and traits in the VLOG dataset. Fang et al. [20] use a variety of semantic audio and visual features as well as dyadic and group features for personality trait prediction in small groups. They also identify key features for such predictions. Çeliktutan and Gunes [62, 63] used visual features (e.g., Histogram of Gradients, Histogram of Optical Flow) and audio features such as MFCC to predict perceived traits in videos of humans interacting with virtual agents. Kindiroglu et al. [31] used multi-task and transfer learning for extraversion and leadership prediction on the ELEA and VLOG datasets with the same set of high-level audio-visual features. Kampman et al. [30] proposed an end-to-end deep model for multimodal impression prediction. Beyan et al. [5] used DNN-based features for leadership and high/low extraversion prediction. In another effort [6], they used dynamic images and activity-based information to infer personality traits. Mawalim et al. [39] investigated the effectiveness of multimodal features such as acoustic, head motion, and linguistic features in predicting personality traits. Anselmi et al. [1] used deep neural networks (DNNs) to infer self-reported personality traits from highly constrained still face images. Zhang et al. [61] proposed DNNs for joint prediction of apparent personality and emotions, showing that the joint task improves upon separately predicting traits and/or emotions. Jia et al. [28] considered low-rank label correlations for facial emotion distribution learning increasing quality of facial expression recognition. Muller et al. [41] proposed a framework for detecting low rapport in a small group setting using speech, facial, and body movement features. Bai et al. [2] suggested a framework for detecting the most dominant person in a group and also relative dominance between two people by exploiting interaction dynamics within the group. Wang et al. [57] proposed a diffusion graph convolutional network to predict several social behavioral traits from social dynamic interaction networks. Okada et al. [46] proposed co-occurrence pattern mining in multimodal features such as speech, head movement, body movement, and gaze for leadership and the Big Five prediction. Zhang et al. [60] also considered co-occurring visual events for predicting Big Five traits. Some recent efforts focus on explainability and interpretability of impression prediction systems [18].

Significant effort has been devoted to analyzing group and dyadic emotional states. Otsuka et al. [48] proposed a method to quantify the amount of influence in interpersonal communication based on the amount of attention paid during that conversation estimated from gazes and speech patterns. Kumano et al. [35] improved upon that work by incorporating facial expressions and pose estimation in the process, allowing them to predict not just the amount of influence but also attitudes of participants towards each other. Another effort by Kumano et al. [34] models the perceived dyadic impressions between two people using time lag in participants’ facial expression congruency.

Predicting Activity Outcomes. Some research has focused on assessing communication skills in group settings [44, 47]. Lin and Lee [38] built a conversational Graph Convolution Network using acoustic and lexical features in order to predict group performance outcomes in the ELEA dataset. Eloy et al. [17] used a combination of facial analysis and upper movement multi-dimensional recurrence quantification in order to model team level dynamics and predict the outcome on collaborative tasks. Yan et al. [58] proposed methods to improve group activity recognition based on individual participation.

How We Differ. Our work differs from past research in several key aspects. First, we predict dyadic impressions as opposed to group impressions; i.e., we predict how person A likes person B, not how the group as a whole perceives person B, or how a group of external observers perceives that person. Second, we are the first to look at how interactions between individuals within a group affect shape A’s impressions of B. Third, we propose an important method by which we can align facial action units and emotions over time. Fourth, we propose a novel Temporal Delayed Network framework that uses multi-layer graphs to capture temporal effects and also includes an important attention mechanism to make predictions that beat past baselines.

Skip 3THE RESISTANCE DATASET Section

3 THE RESISTANCE DATASET

3.1 Dataset Description

We use the Resistance social game2 video dataset [2] to test and validate our approach. In every game, five to eight players are secretly divided into two teams called “villagers” and “spies.” Two to three players are spies. Spies know who the other spies are, but villagers do not know the role of other players. Figure 1 shows an example of the game layout with players’ roles.

Fig. 1.

Fig. 1. Sample of the game. Players sit in a circle; roles (“spy” and “villager”) are assigned randomly. All players have tablets in front of them recording frontal videos of players.

The game proceeds in rounds (every round is called a “mission”) comprising three stages:

(1)

First, players nominate and elect a mission leader. Votes are cast iteratively. The process repeats until a nominated player receives a majority of player votes.

(2)

The mission leader then nominates several players for the mission. The number of players to go on a mission is announced before each round and increases with the progress of the game. Players vote to approve or reject the proposed mission team. If the proposed team does not receive majority support, the mission leader has to nominate another group of players. If proposed mission teams are rejected three times in a row, the round ends with a point for the spies, so spies are motivated to become both mission leaders and members.

(3)

Finally, the selected players “go on a mission” by secretly voting to succeed or fail in the mission. If a specified number of “fail” votes is achieved, the mission fails and spies get a point. Otherwise, the mission is successful and villagers get a point.

The team with the highest score at the end of the game wins. Spies and villagers have different incentives throughout the game: Spies want to stay stealthy as long as possible and get elected on as many missions as possible so they can fail the missions and earn points for their team. On the other hand, villagers want to identify spies as early as possible in order to prevent them from getting on the missions.

Post-game Survey. The survey was carefully designed by a multidisciplinary team of distinguished social scientists with a long history of work in studying human behaviors and communication [16] who asked players to fill out an end-game survey that included six impression-related questions (see Table 1). Each question asked players to rate each other on a 7-point scale. These six questions were carefully designed by social scientists with a long history of working on facial expressions and non-verbal communications, using a modified version of the Reysen likeability scale [51], a widely cited tool in psychology for measuring the likeability of a person.

Table 1.
Question #Variables in the Survey
Q1Very cold : Very warm
Q2Very negative : Very positive
Q3Very unpleasant : Very pleasant
Q4Very unfriendly : Very friendly
Q5Very unlikable : Very likable
Q6Very unsociable : Very sociable
  • All players had to participate in a 6-question survey in which they rated all other players \(p_i\) on a 7-point scale. All questions had the same form but different variables to rate: “Was Player \(p_j\) friendly, likable, and pleasant or cold, negative, and unfriendly?”

Table 1. Survey Questions

  • All players had to participate in a 6-question survey in which they rated all other players \(p_i\) on a 7-point scale. All questions had the same form but different variables to rate: “Was Player \(p_j\) friendly, likable, and pleasant or cold, negative, and unfriendly?”

The dataset contains frontal videos captured using tablets placed in front of the players as shown in Figure 1(b). The set of videos includes games conducted in eight locations with different cultures: three locations in the United States and one each in Zambia, Israel, Singapore, Hong Kong, and Fiji. In total, the dataset consists of 48 games with 348 players, each player appearing in exactly one game. Games typically last two to eight rounds or 30–65 minutes.

3.2 Dataset Analysis

Our dataset contains video and survey data for 348 players (135 spies, 213 villagers). Gender distribution is the following: 44% of players were male and 56% were female. Participants were recruited from college student populations with a median age of 21.

First, we test the hypothesis that the gender of players affects the impression, i.e., that females rate players differently than males, and that players rate female and male players differently based on their own gender. Table 2 shows the means and standard deviations of the variables depending on the gender of rating and rated players. We used the Mann-Whitney U-test and found no significant difference between these groups.

Table 2.
Rating PlayerRated PlayerQ1Q2Q3Q4Q5Q6
FemaleFemale4.94 (1.59)4.87 (1.58)5.17 (1.44)4.87 (1.67)5.21 (1.45)4.78 (1.77)
FemaleMale5.05 (1.55)4.93 (1.56)5.21 (1.40)4.99 (1.65)5.23 (1.41)4.86 (1.70)
MaleFemale5.01 (1.59)4.99 (1.49)5.23 (1.32)5.10 (1.52)5.32 (1.34)4.82 (1.61)
MaleMale4.95 (1.57)4.79 (1.54)4.99 (1.45)4.83 (1.64)5.04 (1.43)4.76 (1.65)
  • We found no statistically significant difference depending on genders of players.

Table 2. Gender-based Distribution of Scores: Mean (Std)

  • We found no statistically significant difference depending on genders of players.

Second, we checked a similar hypothesis about impression differences based on players’ roles in the game. Table 3 shows the corresponding means and standard deviations. We did not find any significant differences in this case either.

Table 3.
Rating PlayerRated PlayerQ1Q2Q3Q4Q5Q6
VillagerVillager5.12 (1.53)5.05 (1.53)5.25 (1.38)5.08 (1.63)5.32 (1.37)5.00 (1.62)
VillagerSpy4.80 (1.57)4.78 (1.52)5.05 (1.40)4.83 (1.61)5.10 (1.38)4.58 (1.74)
SpyVillager4.93 (1.64)4.75 (1.59)5.10 (1.47)4.88 (1.65)5.14 (1.49)4.74 (1.71)
SpySpy5.14 (1.54)5.04 (1.54)5.23 (1.36)4.95 (1.59)5.28 (1.43)4.89 (1.68)
  • We found no statistically significant difference depending on the roles of players.

Table 3. Role-based Distribution of Scores: Mean (Std)

  • We found no statistically significant difference depending on the roles of players.

Table 4 shows that correlations between different ratings of questions are relatively high, which means players who score other players high on one variable tend to score the same players high on the other variables. The highest correlation is for a question about Pleasant–Unpleasant and a question about Likable–Unlikable. The lowest correlation is for questions Warm–Cold and Sociable–Unsociable.

Table 4.
Q2Q3Q4Q5Q6
Q10.630.710.560.680.49
Q20.700.680.630.59
Q30.640.780.56
Q40.670.64
Q50.56
  • All correlations are statistically significant with \(p\le 0.05\).

Table 4. Mutual Spearman Correlation between Different Variables

  • All correlations are statistically significant with \(p\le 0.05\).

3.3 Problem Description

Given the past social science findings that negative impressions are expressed via facial expressions [21] (as opposed to positive impressions that may be “internalized” and not facially expressed), we study the problem of predicting if player \(p_i\) will have a negative impression of player \(p_j\) according to each of six variables in the survey (Table 1). As responses to the six variables can range from 1 to 7, we define the impression on a given variable to be positive if player \(p_i\) rates player \(p_j\) as 4 or above on the 7-point scale, and negative for ratings of 3 and below. Depending on the six dependent variables considered, we see that 11% to 23% of ratings provided across our entire dataset are negative, which is expected, as by default people tend to have a neutral or positive impression of strangers. We observed this in our own data and it has also been noted in earlier social science research [9]. Yet interactions and observations of a person over time (i.e., during the game) can change the impression to negative. Therefore, we consider a binary classification problem of predicting negative impressions between people according to each of the six variables (in other words, negative impression rating is the positive class in our problem), with six tasks in total.

Skip 4THE DIPS APPROACH Section

4 THE DIPS APPROACH

The psychological literature has shown that emotions and facial expressions play an important role in impressions a person has of another [13, 23]. We therefore use the emotions and facial action units extracted using off-the-shelf tools [4, 37] as inputs. After extracting these values for the whole video, we get the vector of values \(\mathbf {P}(p_i, e) = [v_1(p_i, e), v_2(p_i, e), \ldots , v_T(p_i, e)]\), where \(v_t(p_i,e)\) is either the probability of emotion \(e\) [37] for player \(p_i\) at time \(t\) or the intensity of a particular facial action unit for OpenFace [4]. As negative emotions are causes of negative impressions according to social theories [23], we split emotional expressions into two subsets: positive emotions \(\mathcal {E}^{+}\) (happy) and negative emotions \(\mathcal {E}^{-}\) (angry, disgusted, fearful, sad).

To capture the dynamics of group interactions, we also consider three dynamic interaction networks \(G_I = (V_I, E_I)\) derived from [2, 3, 36]: look-at, talk-to, and listen-to networks. Vertices in these networks are participants and edges are interactions among them evolving over time. These three networks are built based on what we can infer from the video. We know who is looking at whom [3]. We say person A is talking to person B if A is speaking and A is looking at B at the same time. We say A is listening to person B if B is speaking and A is looking at B. It is important to note that these are all probabilities [2]. We emphasize that talk-or-not is not a network in the same sense as other networks considered as it only represents the probability of each player talking at a given timestamp. When a player speaks, he may be speaking to a specific individual or to the group as a whole. Formally, a vertex in any of these networks \(p_{i,t} \in V_I\) represents player \(p_i\) in the game at time \(t\). Each directed edge \((p_{i,t}, p_{j,t}) \in E_I\) has an associated weight representing the probability of a particular interaction between players \(p_i\) and \(p_j\) at time \(t\): whether player \(p_i\) looks at, talks to, or listens to player \(p_j\). The probability of player \(p_i\) talking to player \(p_j\) is the product of probabilities of player \(p_i\) speaking (estimated from facial movements) and the probability of player \(p_i\) looking at player \(p_j\) (estimated using the collective classification approach in [3]). Similarly, the probability of player \(p_i\) listening to player \(p_j\) is defined as the product of probability of player \(p_i\) looking at player \(p_j\) and the probability of player \(p_j\) speaking.

Figure 2 shows our overall DIPS framework, an ensemble of four novel components. We extract facial expressions and Interaction Networks (look-at, listen-to, talk-to) from the frontal videos of the players. Extraction of the interaction networks also uses the layout of the players in space as described in [3]. We then extract novel Emotion Rank (Section 4.1) and Sign Imbalance (Section 4.2) features using the interaction networks and facial expressions (though our experiments will later show that these features are not very important for impression prediction). We also use facial expressions to calculate our novel alignment features (Section 4.3). Furthermore, we use Interaction Networks to build and train our novel Temporal Delayed Network (Section 4.4) to produce player embeddings that we use to predict impressions. Finally, we use individual predictions of all of these methods to build an ensemble with late fusion. The rest of this section describes each of these methods in detail.

Fig. 2.

Fig. 2. DIPS framework. We extract facial expressions and Interaction Networks (look-at, listen-to, talk-to) from frontal videos. We use facial expressions to calculate our novel FAU/EMO alignment features (Section 4.3). We use interaction networks and facial expressions to build Emotion Rank (Section 4.1) and Sign Imbalance (Section 4.2) features. Furthermore, we use Interaction Networks to build and train our novel Temporal Delayed Network (Section 4.4) algorithm to produce player embeddings that we use to predict impressions. Each of the feature classes is calculated using all three Interaction Networks (but for the sake of simplicity, only one is shown in the figure). Finally, we use individual predictions of all of these methods to build ensembles with late fusion.

4.1 Emotion Rank

We first define the notions of emotion scores and emotion vectors that capture the “amount” of emotions directed from one person to another in a period of time. Given a dynamic interaction network \(G_I = (V_I, E_I)\), an emotion \(e \in \mathcal {E}=\mathcal {E}^{+} \cup \mathcal {E}^{-}\), participants \(p_1,p_2 \in V_I\), a time window \(\tau\), and a weight function \(w\) associated with every edge (probability of the given interaction), we define the emotion score(ES) as follows: (1) \(\begin{equation} ES(e,p_i,p_j,G_I,\tau) = \pm \frac{1}{|\tau |}\sum _{t \in |\tau |} v_t(p_i, e)\cdot w(p_{i,t},p_{i,t}), ~~(p_{i,t},p_{j,t}) \in E_I, \end{equation}\) where the summation goes over the time window \(\tau\) with length \(|\tau |\), and the sign depends on the emotion \(e\): positive if \(e \in \mathcal {E}^{+}\) and negative otherwise. We further define the emotion vector as a vector \(EV(p_i,p_j,G_I,\tau) = [ES(e,p_i,p_j,G_I,\tau)]_{e \in \mathcal {E}}\) of emotion scores for all emotions considered.

Next, we aggregate the emotion vector into a scalar in order to define the emotion rank.3 We combine the vector \(EV(p_i,p_j,G_I,\tau)\) into a single score using any one of five aggregation functions:

  • \(f(EV) = \mathbb {I}_{e}(EV) = ES(e,p_i,p_j,G_I,\tau),\)

  • \(f(EV) = \max (EV^{+}) + \min (EV^{-})\),

  • \(f(EV) = \operatorname{avg}(EV^{+}) + \operatorname{avg}(EV^{-})\),

  • \(f(EV) = sel(max(EV^{+}),min(EV^{-}))\),

  • \(f(EV) = sel(\operatorname{avg}(EV^{+}),\operatorname{avg}(EV^{-}))\),

where \(\begin{equation*} sel(x_1,x_2) = \left\lbrace \begin{array}{ll} x_1, if~ x_1 \gt -x_2\\ x_2, if~ x_1 \lt -x_2\\ 0, otherwise. \end{array} \right. \end{equation*}\) Note that the \(sel\) function takes \(x_1 (x_1 \ge 0)\) and \(x_2 (x_2 \le 0)\) as inputs and either returns the one whose absolute value is larger or returns 0 if their absolute values are equal. In our case, the first function selects one of the emotion components, the second and third aggregation functions sum up the attitudes from positive and negative emotions to get an overall attitude, and the last two forms try to select the valence (positive vs. negative) that is more prominent.

Next, we recursively define the Emotion Rank \(ER_{f}(p_i,p_j,G_I,\tau)\) as (2) \(\begin{align} ER_{f}(p_i,p_j,G_I,\tau) &= \alpha _0 + \alpha _1 f(EV(p_i,p_j,G_I,\tau)) \\ &\quad \ +\ \alpha _2 \sum _{k \ne i,j} \frac{ER_{f}(p_k,p_j,G_I,\tau)\cdot f(EV(p_k,p_j,G_I,\tau))}{out(p_k)} \nonumber \nonumber\\ &\quad \ +\ \alpha _3 \sum _{k \ne i,j} \left\lbrace \frac{ER_{f}(p_i,p_k,G_I,\tau)\cdot f(EV(p_i,p_k,G_I,\tau))}{out(p_i)}\right. \nonumber \nonumber\\ &\quad \ \cdot \ \left. \frac{ER_{f}(p_k,p_j,G_I,\tau)\cdot f(EV(p_k,p_j,G_I,\tau))}{out(p_k)}\right\rbrace ,\nonumber \nonumber \end{align}\) where \(f\) is one of the five aggregation functions defined above, \(\alpha _i \ge 0\), \(\sum _i \alpha _i = 1\), and \(out(p_i)\) is an out-degree of the vertex \(p_i\) in the graph \(G_I\).

Intuitively, the Emotion Rank from \(p_i\) to \(p_j\) depends on (1) the direct edge \((p_i, p_j)\in E_I\); (2) other people’s Emotion Rank toward \(p_j\); and (3) any path of length 2 \((p_i,p_k,p_j\), where \((p_i,p_k), (p_k,p_j) \in E_I\)).

As a result, for a given pair of players, we get the vector of values \([ER_{f}(p_i,p_j,G_I,\tau)]\) over a varying set of time intervals \(\tau\) spanning from 1 second to the length of the whole video \(T\). To make predictions for the whole video, we need to aggregate these values into a fixed-length vector to be able to apply standard classifiers. As in the case of [2], we calculate histograms of these values with a fixed number of bins and use these histograms as features for our classification task.

4.2 Sign Imbalance

Social scientists have studied balance theory for many years [15, 24, 26]. A triangle in a graph is balanced if the number of negative edges is even (i.e., 0 or 2). In the case of weighted graphs, a triangle is balanced if the product of the edge weights is positive; otherwise, it is imbalanced.

We define a class of sign imbalance features as follows. For a given time window \(\tau\) and a given interaction graph \(G_I\), we build a weighted multi-layer graph \((V, E)\), where \(V\) is a set of participants, and for every ordered pair of vertices \(p_i, p_j \in V\) there are two edges in \(E\):

  • \((p_i, p_j)^{+} \in E\) with the associated weight \(w^{+}(p_i, p_j) = max_{e \in \mathcal {E}^{+}}|ES(e, p_i, p_j, G_I, \tau)|\)

  • \((p_i, p_j)^{-} \in E\) with the associated weight \(w^{-}(p_i, p_j) = max_{e \in \mathcal {E}^{-}}|ES(e, p_i, p_j, G_I, \tau)|\)

Note that \(w^{+}(p_i, p_j) \in [0,1]\) and \(w^{-}(p_i, p_j) \in [0,1]\) because of the way Emotion Scores are calculated (see Section 4.1).

Balance theory [15] suggests there are four possible balanced situations in any given triangle (Figure 3): a friend of my friend is my friend (Figure 3(a)), an enemy of my friend is my enemy (Figure 3(b)), a friend of my enemy is my enemy (Figure 3(c)), and an enemy of my enemy is my friend (Figure 3(d)). Since in our graph every edge \((p_i,p_j)\) has a weight \(w(p_i, p_j) \in [0,1]\) representing the intensity of emotions of a particular sign aligned with a given interaction, for any triangle to be balanced, balance theory suggests that the following equality will hold: (3) \(\begin{equation} w(p_i, p_j)\cdot w(p_j, p_k) = w(p_i, p_k), \end{equation}\) where \(w\) corresponds to \(w^{+}\) or \(w^{-}\) depending on the sign of the edge of the triangle (Figure 3).

Fig. 3.

Fig. 3. Balanced directed signed triads: for any triangle to be balanced the product of signs should be positive.

We define a sign imbalance feature for a participant \(p_i\) as the average discrepancy in balance (Equation (3)) over all triangles \(\lbrace (p_i,p_j), (p_j,p_k), (p_i,p_k)\rbrace\) involving \(p_i\) in the graph: (4) \(\begin{equation} SI(p_i, G_I, \tau) = \frac{1}{N}\sum _{p_j, p_k \in V, i \ne j \ne k}|w(p_i, p_j)\cdot w(p_j, p_k) - w(p_i, p_k)|, \end{equation}\) where the summation goes over all possible triangles in the graph containing vertex \(p_i\), \(N\) is the number of such triangles, and \(w\) corresponds to \(w^{+}\) or \(w^{-}\) depending on the sign of the edge of the triangle (Figure 3).

Similar to the Emotion Rank features, we aggregate the values over all possible time windows \(\tau\) by calculating histograms with a fixed number of bins. These histograms are used for the classification task at hand.

4.3 Emotion and Facial Action Unit Alignment

The social science literature draws a connection between mutual liking between two people and establishing rapport [55] by synchronizing body language and emotional states. A computational effort [46] also built on this idea and mined the co-occurrence patterns between the features of two people in order to successfully predict personality traits and behaviors.

Since we are considering a task concerning two people, we are interested in how well their emotions and facial expressions are aligned with each other and whether emotions expressed by one player cause the same or different emotions in the other player. We use cosine distance \(\cos (\mathbf {P}(p_i, e), \mathbf {P}(p_j, e))\) as a measure of alignment between two time series of emotions or facial action units, where \(\mathbf {P}(p_i, e)\) is a vector of emotion or facial action unit \(e\) intensities that player \(p_i\) shows, and \(\cos (\mathbf {x},\mathbf {y}) = \frac{\mathbf {x}^{T}\mathbf {y}}{||\mathbf {x}||\cdot ||\mathbf {y}||}\).

As it usually takes time for a person to see another person’s emotional state and react to it [50], we also consider the alignment between vector values shifted forward in time \(\mathbf {P}_{+\Delta t}(p_i, e) = [v_{\Delta t}(p_i, e), v_{\Delta t + 1}(p_i, e), \ldots , v_T(p_i, e)]\). In order to compute the cosine distance between \(\mathbf {P}_{+\Delta t}(p_i, e)\) and \(\mathbf {P}(p_j, e))\), we trim the latter to match the length of the shifted vector.

As we also do not know the direction of the effect, i.e., whether player \(p_i\) reacts to player \(p_j\) or the other way around, we also consider the alignment between vectors shifted backward in time by a factor \(\Delta t\) as follows: \(\begin{equation} \cos (\mathbf {P}_{-\Delta t}(p_i, e), \mathbf {P}(p_j, e)) = \cos (\mathbf {P}(p_i, e), \mathbf {P}_{+\Delta t}(p_j, e)). \end{equation}\)

Finally, we form a vector \(AL(p_i, p_j, e)\) of cosine distances for time shifts varying from \(-\Delta t\) to \(+\Delta t\): (5) \(\begin{equation} AL(p_i, p_j, e) = [\cos (\mathbf {P}_{-\Delta t}(p_i, e), \mathbf {P}(p_j, e)), \ldots , \cos (\mathbf {P}(p_i, e), \mathbf {P}(p_j, e)), \ldots , \cos (\mathbf {P}_{+\Delta t}(p_j, e), \mathbf {P}(p_i, e))]. \end{equation}\)

We also extend the definition by considering possible pairs of facial expressions for a given pair of players \(AL(p_i, p_j, e_l, e_k)\). For the prediction task at hand, for each pair of players we concatenate alignment vectors for different pairs of emotions or facial action units \(e_l\) and \(e_k\) to form a feature vector \(\begin{equation*} AL(p_i, p_j) = [Al(p_i, p_j, e_l, e_k)],~(l,k) \in [1,N]\times [1,N], \end{equation*}\) where \(N\) is the number of facial expressions considered.

4.4 Temporal Delayed Network

We leverage the concept of multi-layer networks [14, 33] as well as recent advances in non-Euclidean learning such as Graph Convolution Networks [32, 56], which have been proved to be powerful for learning in social networks. We propose an approach to building graphs that captures the interaction between players as well as the dynamics of players’ behavior. We call this model a Temporal Delayed Network (TDN).

Given a dynamic interaction graph \(G_I = (V_I,E_I)\) (for instance, look-at graph), we build a multi-layer network [8] \(G = (V, E)\) in the following way (Figure 4):

Fig. 4.

Fig. 4. Example of a Temporal Delayed Network (TDN) for one of the games in the dataset (best viewed in color). Thick gray edges represent the interaction graph (in this example, look-at graph), thin orange edges represent identification edges (connecting the same player in different layers), and blue dotted edges represent the delayed influence edges. Color intensity represents the probability of the given interaction occurring at that time step (in other words, edge weights). Here we show a subset of players and a subset of edges for clarity.

  • Vertices \(p_{i,t} \in V\) represent player \(p_i\)’s state at time point \(t\).

  • We introduce three types of edges \(E=E_1 \cup E_2 \cup E_3\):

    (1)

    Interaction edges are derived from the interaction graph \(G_I\): \((p_{i,t}, p_{j,t}) \in E_1\) if and only if \((p_{i,t}, p_{j,t}) \in E_I\). Interaction edges carry the same weight as their counterparts in the interaction graph \(G_I\).

    (2)

    Identification edges connect the vertices corresponding to the same player at different points in time: \((p_{i,t}, p_{i,t^{\prime }}) \in E_2\) if \(t^{\prime } -t \le \Gamma\). Each identification edge has a weight that exponentially decays with difference in time steps: \(c(p_{i,t}, p_{i,t^{\prime }}) = \gamma ^{t^{\prime }-t}\). This allows propagation of the player’s inner state in time but restricts the effect of the past behavior on the present behavior.

    (3)

    Delayed influence edges build on the idea that interactions can have a delayed effect: for instance, player \(p_i\) seeing player’s \(p_j\) facial expression can affect player \(p_i\)’s impression only on the next time step. So, \((p_{i, t}, p_{j, t^{\prime }}) \in E_3\) if and only if \((p_{i, t}, p_{j, t}) \in E_1\) and \((p_{j, t}, p_{j, t^{\prime }}) \in E_2\). Associated weight is \(c(p_{i, t}, p_{j, t^{\prime }}) = c(p_{i, t}, p_{j, t^{\prime }}) \cdot c(p_{j, t}, p_{j, t^{\prime }})\).

For any person \(p_i\) at time \(t\), we want to learn an embedding of the corresponding node \(p_{i,t}\) that contains the temporal visual information of the person, the influence from the person to others in the group, and conversely the influence from others in the group to the person. These representations are further grouped pairwise to learn the dyadic impression one person has of another.

For the sake of simplicity, we denote vertices of the network with letters \(u\) and \(v\) below. We use \(IN_k(v)=\lbrace u|(u,v) \in E_i\rbrace\) and \(OUT_k(v)=\lbrace u|(v,u) \in E_k\rbrace\) to denote the incoming and outgoing edge sets of a vertex \(v \in V_I\) for edge type \(E_k\), respectively. Inspired by [59], who build spatial temporal Graph Neural Networks to model the temporal dynamics of skeleton joints, we employ the graph convolution layer in our three sets of directed edges to update the node embedding \(x_{v} \in \mathbf {R}^{m}\) of a vertex \(v\): (6) \(\begin{equation} \tilde{x}_{v} = \sum _{k=1}^3\left(\sum _{u \in IN_k(v)} c(u,v) w_k(u,v) f_k(x_{v}) + \sum _{u \in OUT_k(v)} c(v,u) w_k(v,u) f_k(x_{v})\right), \end{equation}\) where \(f_k(\cdot)\) is a fully connected layer, and \(w_k(u,v)\) denotes the learnable weights of the edge \((u,v) \in E_k\). We use the graph attention mechanism [56] to allow the model to attend edge importance from the projected features: (7) \(\begin{equation} w_k(u,v) = attn(f_k(x_{u}), f_k(x_{v})), \end{equation}\) where \(attn: \mathbf {R}^n \times \mathbf {R}^n \rightarrow \mathbf {R}\) is the asymmetric attention block from [56]: (8) \(\begin{equation} attn(x_1, x_2) = \frac{exp(LeakyReLU(a^T [x_1 || x_2]))}{\sum _{x_2} exp(LeakyReLU(a^T [x_1 || x_2]))}, \end{equation}\) where \(a \in R^{2n}\) is a learnable attention vector, \(x_1, x_2 \in R^{n}\), and \(||\) denotes vector concatenation. Note that for any given \(v\), this and the normalization of \(c(u,v)\) ensures that \(\sum _v c(u,v) = 1\) for all \(u\) and \(\sum w_k(u,v) = 1\) for all \(k\).

Finally, we update the node embedding \(x_v\) using the ReLU function: \(\begin{equation*} x_v = ReLU(\tilde{x}_v). \end{equation*}\)

After two layers of graph convolutions, we apply temporal average pooling for node embeddings of each person: (9) \(\begin{equation} \bar{x}_{p_i} = \frac{1}{T}\sum _{t=1}^T x_{p_{i,t}}. \end{equation}\) To predict whether person \(p_i\) has a negative impression of player \(p_j\), we apply the prediction layer below to output the probability (10) \(\begin{equation} P(p_i,p_j) = \sigma \left(o^T[\bar{x}_{p_i} || \bar{x}_{p_j}]\right)\!, \end{equation}\) where \(o\) is the trainable projection vector and \(\sigma\) is the sigmoid function.

Initialization of node embeddings. We use the facial expression embeddings [37] to initialize our node embeddings. Specifically, we remove the last fully connected layer of their proposed CNN and use the extracted features as our initial node embeddings.

Skip 5EXPERIMENTAL RESULTS Section

5 EXPERIMENTAL RESULTS

5.1 Experimental Setup

General Setup. We split the dataset into 10 folds by games. Since each player appears in only one game, we always make predictions about players never seen before. We use four standard classifiers for our predictions: k-Nearest Neighbor, Logistic Regression, Gaussian Naive Bayes, and Random Forest.

As our dataset is highly imbalanced (the percentage of positive samples in the dataset varies from 11% to 23% depending on the six dependent variables being predicted), we use AUC as the performance metric. We test our proposed feature types with each of the aforementioned classifiers. We report the best AUC for any given class of features among all classifiers.

In the case of Emotion Rank features, we initialize the values in Equation (2) to \(\frac{1}{n}\), where \(n\) is the number of players in a particular game. We then iteratively calculate the values until convergence with tolerance level \(10^{-5}\) for \(L_{\infty }\) distance between values at the end of consecutive iterations or for up to 100 iterations.

For facial expression alignment features, we perform iterative greedy feature selection by first considering one feature \(Al(p_i, p_j, e_{l_1}, e_{k_1})\) at a time and then selecting the best-performing one. We then concatenate two features: the first is the best feature from the previous stage and the second is the one discovered in the current iteration that, when added to the feature selected in the previous iteration, gives the best result. We repeat this process until adding new features does not improve the performance on a validation fold. We use this process instead of exhaustive search because the number of features increases exponentially with increasing length.

For Emotion Rank and Sign Imbalance features, we consider all possible combinations of features produced by different aggregation functions. We report the performance of the best combination and we further analyze the influence of aggregation functions on the performance.

Baselines. We adopted emotion and FAU histograms as described in Bai et al. [2]; speech acts and facial and multi-modal features (face and speech) as described in Muller et al. [41]; and speaking acts, the visual focus of attention (VFOA), and combined features as described in Beyan et al. [5]. In all three cases, we used the best-performing features that we could calculate on our dataset (for instance, we did not use features related to hand movements because not all videos contain a clear view of hands). Since all the aforementioned papers predict values for a single person and our tasks are dyadic, we form dyadic features by concatenating features of a pair of individuals just as we do with our proposed methods. We then applied the same battery of classifiers mentioned earlier.

Temporal Delayed Network ( TDN ) training. We split each video into 100 clips. For each clip, we sample \(\Gamma =5\) frames (1 frame per second) to build a Temporal Delayed Network. We set the decay rate \(\gamma =0.8\). We use two graph convolution layers with 128 dimensions of node embeddings for both layers. Each layer is followed by Batch Normalization and ReLU activation. We use the Adam optimizer with learning rate \(10^{-4}\) and weight decay \(10^{-4}\). The network is trained for 200 epochs with a batch size of 64.

5.2 Head-to-head Feature Comparisons

First, we compare the individual performance of the proposed feature classes. Table 5 shows the performance (AUC) of our proposed methods compared to each other and to the chosen set of baselines. We see that on all of the variables, at least one of our proposed methods outperforms the baselines:

Table 5.
Feature ClassQ1Q2Q3Q4Q5Q6
FAU + hist. [2]0.6120.5810.5940.5670.5860.609
Emotions + hist. [2]0.5840.5550.6000.5770.5950.579
Speech features [41]0.5890.5140.5630.5620.5770.608
Face features [41]0.5890.5190.5740.5350.5690.566
BaselinesFace and speech features [41]0.5950.5230.5550.5520.5690.605
Speaking act [5]0.5740.5060.5380.5330.5550.584
VFOA [5]0.5010.5220.5470.5110.5540.545
VFOA-Spk-Act [5]0.5870.5130.5060.5250.5450.614
FAU alignment0.6760.5970.6850.6230.7270.627
EMO alignment0.5970.5650.6270.5980.6260.583
Emotion Rank: look-at0.5730.5620.5880.5730.5800.583
Emotion Rank: talk-to0.5720.5770.5770.5700.5830.558
DIPSEmotion Rank: listen-to0.5930.5710.5830.5780.5780.605
(ours)Sign Imbalance: look-at0.5720.5640.5950.5580.5860.561
Sign Imbalance: talk-to0.5920.5600.5980.5780.6090.563
Sign Imbalance: listen-to0.5810.5680.5760.5550.5990.588
TDN: look-at0.6350.5740.6060.5770.6170.633
TDN: talk-to0.6150.6380.6260.6100.6120.573
TDN: listen-to0.6490.6110.6050.6100.6330.630
  • The top eight rows show results for baseline features. The rest of the table shows the performance of our proposed features derived from different interaction graphs, as well as the performance of TDN built upon these interactions graphs. On all of the variables our proposed methods yield the best performance with statistically significant difference over the best baseline results (\(p\lt 0.05\)).

Table 5. Performance (AUC) of Individual Features on Six Variables

  • The top eight rows show results for baseline features. The rest of the table shows the performance of our proposed features derived from different interaction graphs, as well as the performance of TDN built upon these interactions graphs. On all of the variables our proposed methods yield the best performance with statistically significant difference over the best baseline results (\(p\lt 0.05\)).

  • Our TDN models yield the best performance on two variables out of six.

  • Our FAU alignment features with greedy feature selection perform best on the other four variables.

  • When it comes to particular classes of features, even though Emotion Rank features and Sign Imbalance features do not always yield the best performance, on all of the variables their results are either on par with baseline features or higher. If we consider F1 as a metric of interest (Table 6) rather than AUC, the same two of our proposed methods still yield the highest F1-score and the rest of the features stay competitive with the baselines.

    Table 6.
    Feature ClassQ1Q2Q3Q4Q5Q6
    FAU + hist [2]0.3060.3180.2390.3180.2070.350
    Emotions + hist [2]0.2660.3010.2250.3410.2120.365
    Speech features [41]0.3020.2690.2170.3000.1980.369
    Face features [41]0.3100.2770.2210.2850.1940.360
    BaselinesFace and speech features [41]0.3070.2690.1900.3130.1840.363
    Speaking act [5]0.2910.2450.1980.2910.2010.360
    VFOA [5]0.2460.2020.1900.2620.1750.290
    VFOA-Spk-Act [5]0.2950.2670.1790.3080.1910.388
    FAU alignment0.3370.2430.2520.2990.2690.372
    EMO alignment0.2980.2740.2370.2800.2220.354
    Emotion Rank: look-at0.2640.2930.2120.2920.2160.358
    Emotion Rank: talk-to0.2770.2940.1980.3300.1960.344
    DIPSEmotion Rank: listen-to0.2900.3040.2140.3200.1850.349
    (ours)Sign Imbalace: look-at0.2900.3020.2030.3030.1750.343
    Sign Imbalance: talk-to0.2540.2960.1980.2930.2010.325
    Sign Imbalance: listen-to0.2860.3090.2120.3300.2100.365
    TDN: look-at0.3500.3230.2430.3520.2380.410
    TDN: talk-to0.4030.3590.2630.3650.2490.403
    TDN: listen-to0.3780.3570.2420.3680.2340.358
    • The top eight rows show results for baseline features. The rest of the table shows the performance of our proposed features derived from different interaction graphs, as well as the performance of TDN built upon these interactions graphs. Our proposed methods show statistically significant improvements (with \(p\lt 0.05\)) over the baseline approaches (highlighted in the table).

    Table 6. Individual Features Performance (F1-score) on Six Variables

    • The top eight rows show results for baseline features. The rest of the table shows the performance of our proposed features derived from different interaction graphs, as well as the performance of TDN built upon these interactions graphs. Our proposed methods show statistically significant improvements (with \(p\lt 0.05\)) over the baseline approaches (highlighted in the table).

5.3 Late Fusion

Figure 2 shows how several individual classes of features provide predictions for our task. To further improve performance and take advantage of the diversity of individual approaches, these predictions are combined using late fusion. Given a predicted probability \(p_i\) from the \(i\)th individual predictor, we combine the predictions linearly as \(\Sigma _{i=1}^N w_i p_i\) (where each \(w_i\in [0,1]\) and \(\Sigma _{i=1}^N w_i = 1)\) to compute an overall probability. We use a grid search over the space of possible values to find the best \(w_i\) values. The best \(w_i\)s learned on the training and validation sets are used in the predictions on the test set (so in particular, the test set was never used when computing the \(w\)s).

Table 7 shows the best-performing ensembles for each predicted variable as well as the results of an ablation study for those ensembles. The AUC numbers are shown by default and F1-scores are shown in parentheses. DIPS improves the best baseline models by 19.9% to 30.8% for AUC and 12.6% to 47.2% when using the F1-score metric.4 To assess the importance of each feature in the ensemble, we exclude features one at a time, find the performance of the reduced ensemble, and compare with the performance of the full ensemble. The Excl. columns show the reduced performance when excluding the specific features from late fusion. Comparing Table 7 with Tables 5 and 6, we see that ensembles significantly outperform the individual features. Of all the types of features considered, we observe that FAU alignment features (Section 4.3) and TDN models (Section 4.4) are the most important across all predicted variables. Emotion Ranks and Sign Imbalance features are weaker than our TDN and Alignment features when compared head to head. However, when they are dropped from the corresponding ensembles one at a time, they cause AUC to drop by 1% to 3%. When they are both dropped at the same time, excluding (leaving only Alignment + TDN features) performance drops by 1.5 to 4 points (statistically significant with \(p\lt 0.05\)).

Table 7.

Table 7. DIPS Results (AUC)

5.4 Ablation Study

In this section, we report experiment results from our ablation study.

5.4.1 Time Effect Analysis..

Our proposed features such as Alignment features, Emotion Rank, and Sign Imbalance features use all available video footage. We are interested in identifying which part of the video provides the most important information for the problem of impression prediction. In order to determine this, we ran our experiments on videos restricted to specific time windows defined by varying the length of the window and starting time of the window.

Figure 5 shows the relative performance change we observed for various time windows.

Fig. 5.

Fig. 5. Effect on time on feature performance. Heatmaps show how performance of the corresponding features drops if we restrict available video length and vary the starting time of that video relative to the highest achieved performance (equal to 1 on heatmaps). Numbers in heatmaps are averaged over all six variables.

Finding 1.

We found that considering only 20% of the video yields more than 86% of the classification performance achieved on the whole video. The longer the window we consider, the higher the performance we get. To achieve the best result, we need to consider the whole video. Given the same window length, we can achieve slightly better results if we consider the second half of the game rather than the first half, but the starting time is less important than video length.

5.4.2 Interaction Graph Effect..

To analyze which of the three interaction graphs provides the best performance, we build ensembles from features built using only one of the graphs. For each of the look-at, listen-to, and talk-to graphs we used Emotion Rank, Sign Imbalance, and TDN models. Table 8 shows the performance of corresponding ensembles for each of the predicted variables.

Table 8.
Interaction Graph \(G_I\)Q1Q2Q3Q4Q5Q6
look-at0.6590.6060.6480.6140.6450.651
listen-to0.6160.6540.6430.6330.6770.629
talk-to0.6670.6500.6550.6480.6220.655
  • Features used in the ensemble: Emotion Rank (ER), Sign Imbalance (SI) and TDN.

Table 8. Interaction Graph Importance: We Find Ensemble Performance for Features Obtained Using Only One of the Interaction Graphs

  • Features used in the ensemble: Emotion Rank (ER), Sign Imbalance (SI) and TDN.

Finding 2.

First, we see that different features in the ensemble are complementary to each other, as every ensemble improves each individual feature’s performance (Table 5). We also see that look-at is the least important graph when taken individually, as ensembles based on the other two graphs outperform it on each predicted variable. From Table 7, we can see that all three graphs contribute to the best-performing ensembles on each of the variables, however.

5.4.3 Emotions/FAU Effect..

To get more insight into which of the facial expressions provide the most information to our models, we look at the combinations of FAU and emotion alignment features that yield the best performance in our models (Table 5). Table 9 shows the most and the least important facial action units defined by how often they occur among the best-performing expression pairs in alignment features (Section 4.3).

Table 9.
FAUEmotions
Most OftenLeast OftenMost OftenLeast Often
Scoring player \(p_i\)AU02, AU05, AU15, AU20, AU25AU07 AU14HappyAngry
Scored player \(p_j\)AU23 AU14 AU01 AU17 AU25AU04 AU20HappyFearful
OverallAU17 AU05 AU45 AU23 AU15 AU25AU07 AU12 AU04HappyAngry
  • FAUs are sorted in the increasing order of their occurrence among the best performing combinations (according to greedy selection process).

Table 9. Most Important Facial Expressions in the Alignment Features

  • FAUs are sorted in the increasing order of their occurrence among the best performing combinations (according to greedy selection process).

Finding 3.

The most common pair of expressions was (AU05, AU23), suggesting that raising of the upper lid (AU05) and tightening of lips (AU23) are the most important FAUs. These findings are consistent with a similar analysis of Action Unit importance in a different case, namely low rapport detection [41].

5.4.4 Attention Weights of the Temporal Delayed Network..

In this experiment, we study the learned attention weights (Equation (7)) of interaction edges and delayed influence edges defined in Temporal Delayed Network in Section 4.4. For any given dependent variable and a given graph, we first get all pairs of people \((p_i,p_j)\) whose impression labels are predicted correctly by the trained TDN model within the training set. Second, we compute the average attention weights of these pairs for the two types of edges separately. The larger the average weight of an edge, the more the trained TDN focuses on the type of edge in order to make correct predictions. Therefore, larger numbers indicate higher importance of edges in making predictions. Table 10 shows the results.

Table 10.
Q1Q2Q3Q4Q5Q6
Graph \(G_I\)\(E_1\)\(E_3\)\(E_1\)\(E_3\)\(E_1\)\(E_3\)\(E_1\)\(E_3\)\(E_1\)\(E_3\)\(E_1\)\(E_3\)
look-at0.3420.0760.1110.0710.2940.3120.1040.3410.3010.2940.3500.306
talk-to0.2070.3530.2440.3540.3350.2840.2350.3360.2630.2250.1350.313
listen-to0.3370.0680.2930.0970.3770.2590.2600.2950.3330.2280.3610.276
  • For each variable, the left number (\(E_1\)) shows the average attention weights for interaction edges, while the right (\(E_3\)) shows the average attention weights for delayed influence edges.

Table 10. Average Attention Weights of Edges for Correctly Predicted Dislike Pairs

  • For each variable, the left number (\(E_1\)) shows the average attention weights for interaction edges, while the right (\(E_3\)) shows the average attention weights for delayed influence edges.

Finding 4.

Comparing the two types of edges, we observe that interaction edges (\(E_1\)) get more attention from TDN models than delayed influence edges (\(E_3)\) on average.

Finding 5.

Among the three types of graphs, we find that the TDN focuses more on the interaction edges (\(E_1\)) of the listen-to graph, while it focuses more on the delayed influence edges (\(E_3\)) of speak-to graphs.

5.4.5 Variable Analysis..

We want to use our experimental results to answer the question: which of the dependent variables is the hardest to predict and which is the easiest? From the results on the performance of individual features (Table 5) and late fusion models (Table 7), we see that our models yield the highest performance for Question 5 of the survey (very unlikable to very likable scale). At the same time, our models perform the worst on Question 4 (very unfriendly to very friendly scale) and Question 2 (very negative to very positive scale). This effect could be partly attributed to higher imbalance in Question 5: only 12% of samples are positive for this variable, compared to 19% and 21% for Question 2 and Question 4, respectively. Another possible explanation is the nature of the questions: it could be easier to answer questions such as whether a person is likable or unlikable and whether that person is pleasant or unpleasant (Questions 5 and 3) as opposed to more vague questions such as whether a person is positive or negative and friendly or unfriendly (Questions 2 and 4, respectively).

5.4.6 Gender and Role Effect.

We have already discussed some preliminary statistical analysis in Section 3.2. We wanted to see if players’ roles and/or gender affect the binary labels in our classification tasks. We perform Barnard’s exact test for these possible confounding variables. We considered four groups: rating player is male vs. female, rated player is male vs. female, rating player is a spy vs. a villager, and rated player is a spy vs. a villager. With \(\alpha = 0.05\) and Bonferroni correction for multiple comparisons, we found only one statistically significant difference: gender of a rated player affects the probability of receiving a low score on Question 5 (unlikeable vs. likeable).

To address this effect, we also considered adding binary variables encoding gender and role to our feature vectors. We present the results in Table 11. In all cases we see the drop in performance for 3.1 to 13.9 points AUC. This result suggests that using the gender or role of the participants does not improve predictive power of our proposed methods. However, a more detailed study focused on this specific issue could represent valuable future work.

Table 11.
Additional VariablesQ1Q2Q3Q4Q5Q6
Baseline0.6760.5970.6850.6230.7270.627
Scoring player’s gender0.5740.5060.5880.5240.5860.550
Scored player’s gender0.5780.5020.5900.5410.5820.552
Both players’ genders0.5740.4960.5870.5260.5800.567
Scoring player’s role0.5760.4790.5890.5010.5830.536
Scored player’s role0.5760.4710.5850.5090.5880.551
Scored player’s roles0.5730.5270.5840.5170.5840.543
Both players’ roles and genders0.5680.5110.5850.5380.5760.581

Table 11. Classification Results (AUC) for the the Best Alignment Feature (Baseline) with Added Binary Variables for Roles and Genders of Players

Skip 6CONCLUSION Section

6 CONCLUSION

There are many applications where it is important to understand the like/dislike relationships between people in a group. A particular case is a diplomatic or trade negotiation between countries where it might be useful for country C1 to understand the like/dislike relationships between people in the delegation for country C2.

In this article, we provide a framework called DIPS (Dyadic Impression Prediction System). DIPS has three major innovations. First, we develop the novel concept of emotion scores and emotion ranks that combine facial emotions with gaze networks. Second, we use social balance theory for the first time in order to propose sign imbalance features. Third, we develop a novel class of alignments of facial action units and emotions. Fourth, we develop a novel TDN framework that combines multi-layer networks with Graph Convolution Networks (unlike most past work in computer vision that focuses on GCNs alone). We investigated the value of all of these novel types of features in predicting six types of like/dislike impressions of one person toward another. Overall, our DIPS framework generates AUCs ranging from 71.9% to 77.8% for these six dependent variables. We found that in head-to-head comparisons, FAU/Emotion Alignments perform very well. However, in ablation testing, we found that all three types of new features (emotion ranks, sign imbalance, and FAU/Emotion Alignments) help improve prediction performance by a statistically significant amount.

We show that the DIPS framework beats out several existing baselines in predicting dyadic impressions by 19.9% to 30.8% for AUC and 12.6% to 47.2% when using the F1-score metric.

Footnotes

  1. 1 Balance theory [25] suggests that for a balanced triad (three individuals in this case), the products of any pair of edge weights must be positive. Important phenomena explained by balance theory include the ideas that “a friend of my friend is my friend” and “an enemy of my enemy is my friend.”

    Footnote
  2. 2 https://en.wikipedia.org/wiki/The_Resistance_(game). Variants of the Resistance game include Mafia and Werewolf.

    Footnote
  3. 3 For notational simplicity, we drop the parameters for \(EV\) and use \(EV^{+} = [ES(e,p_i,p_j,G_I,\tau)]_{e \in \mathcal {E}^{+}}\) and \(EV^{-} = [ES(e,\) \(p_i,p_j,G_I,\tau)]_{e \in \mathcal {E}^{-}}\) to denote the positive and negative subvectors, respectively, of \(EV(p_i,p_j,G_I,\tau)\).

    Footnote
  4. 4 The improvement provided by DIPS over a baseline algorithm \(Base\) w.r.t. a metric \(\mu\) (e.g., AUC or F1 score) is defined as \(impr=\frac{\mu ({\textsf {DIPS}})-\mu (Base)}{\mu (Base}\). Thus, if \(\mu\) is AUC and DIPS and a baseline algorithm yield AUCs of 0.8 and 0.7, respectively, then the improvement ratio would be \(\frac{0.8-0.7}{0.7}=14.29\)%.

    Footnote

REFERENCES

  1. [1] Anselmi Fabio, Noceti Nicoletta, Rosasco Lorenzo, and Ward Robert. 2019. Genuine personality recognition from highly constrained face images. In Image Analysis and Processing (ICIAP’19), Ricci Elisa, Bulò Samuel Rota, Snoek Cees, Lanz Oswald, Messelodi Stefano, and Sebe Nicu (Eds.). Springer International Publishing, Cham, 421431.Google ScholarGoogle Scholar
  2. [2] Bai Chongyang, Bolonkin Maksim, Kumar Srijan, Leskovec Jure, Burgoon Judee, Dunbar Norah, and Subrahmanian V. S.. 2019. Predicting dominance in multi-person videos. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 46434650. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bai Chongyang, Kumar Srijan, Leskovec Jure, Metzger Miriam, Nunamaker Jay F., and Subrahmanian V. S.. 2019. Predicting the visual focus of attention in multi-person discussion videos. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 45044510. Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Baltrusaitis Tadas, Zadeh Amirali Bagher, Lim Yao Chong, and Morency Louis-Philippe. 2018. OpenFace 2.0: Facial behavior analysis toolkit. In 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG’18). 5966.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Beyan Cigdem, Shahid Muhammad, and Murino Vittorio. 2018. Investigation of small group social interactions using deep visual activity-based nonverbal features. In Proceedings of the 26th ACM International Conference on Multimedia (MM’18). Association for Computing Machinery, New York, NY, 311319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Beyan Cigdem, Zunino Andrea, Shahid Muhammad, and Murino Vittorio. 2021. Personality traits classification using deep visual activity-based nonverbal features of key-dynamic images. IEEE Transactions on Affective Computing 12, 4 (2021), 10841099. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Biel Joan-Isaac and Gatica-Perez Daniel. 2012. The Youtube lens: Crowdsourced personality impressions and audiovisual analysis of vlogs. IEEE Transactions on Multimedia 15, 1 (2012), 4155.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Boccaletti Stefano, Bianconi Ginestra, Criado Regino, Genio Charo I. Del, Gómez-Gardenes Jesús, Romance Miguel, Sendina-Nadal Irene, Wang Zhen, and Zanin Massimiliano. 2014. The structure and dynamics of multilayer networks. Physics Reports 544, 1 (2014), 1122.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Bruce A. Jerry and McDonald Brian G.. 1993. Face recognition as a function of judgments of likability/unlikability. Journal of General Psychology 120, 4 (1993), 451462. arXiv: PMID: 8189210.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Burgoon Judee K.. 2015. Expectancy violations theory. InThe International Encyclopedia of Interpersonal Communication, 19.Google ScholarGoogle Scholar
  11. [11] Burgoon Judee K., Stern Lesa A., and Dillman Leesa. 1995. Interpersonal Adaptation: Dyadic Interaction Patterns. Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chávez-Martínez Gilberto, Ruiz-Correa Salvador, and Gatica-Perez Daniel. 2015. Happy and agreeable? Multi-label classification of impressions in social video. In Proceedings of the 14th International Conference on Mobile and Ubiquitous Multimedia (MUM’15). Association for Computing Machinery, New York, NY, 109120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Davydenko Mariya, Zelenski John M., Gonzalez Ana, and Whelan Deanna. 2020. Does acting extraverted evoke positive social feedback? Personality and Individual Differences 159 (2020), 109883. Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Domenico Manlio De, Solé-Ribalta Albert, Cozzo Emanuele, Kivelä Mikko, Moreno Yamir, Porter Mason A., Gómez Sergio, and Arenas Alex. 2013. Mathematical formulation of multilayer networks. Physical Review X 3, 4 (2013), 041022.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Doreian Patrick and Mrvar Andrej. 2009. Partitioning signed social networks. Social Networks 31, 1 (2009), 111.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Dorn Bradley, Dunbar Norah E., Burgoon Judee K., Nunamaker Jay F., Giles Matt, Walls Brad, Chen Xunyu, Wang Xinran Rebecca, Ge Saiying Tina, and Subrahmanian V. S.. 2021. A system for multi-person, multi-modal data collection in behavioral information systems. In Detecting Trust and Deception in Group Interaction. Springer, 5773.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Eloy Lucca, Stewart Angela E. B., Amon Mary Jean, Reinhardt Caroline, Michaels Amanda, Sun Chen, Shute Valerie, Duran Nicholas D., and D’Mello Sidney. 2019. Modeling team-level multimodal dynamics during multiparty collaboration. In 2019 International Conference on Multimodal Interaction (ICMI’19). Association for Computing Machinery, New York, NY, 244258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Escalante Hugo Jair, Kaya Heysem, Salah Albert Ali, Escalera Sergio, Gucluturk Yagmur, Güçlü Umut, Baró Xavier, Guyon Isabelle, Junior Julio Jacques, Madadi Meysam, Ayache Stéphane, Viegas Evelyne, Gurpinar Furkan, Wicaksana Achmadnoer Sukma, Liem Cynthia C. S., Gerven Marcel A. J. van, and Lier Rob van. 2022. Explaining first impressions: Modeling, recognizing, and explaining apparent personality from videos. IEEE Transactions on Affective Computing 13, 2 (2022), 894911. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Facchetti Giuseppe, Iacono Giovanni, and Altafini Claudio. 2011. Computing global structural balance in large-scale signed social networks. Proceedings of the National Academy of Sciences 108, 52 (2011), 2095320958.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Fang Sheng, Achard Catherine, and Dubuisson Séverine. 2016. Personality classification and behaviour interpretation: An approach based on feature categories. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI’16). Association for Computing Machinery, New York, NY, 225232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Floyd Kory and Burgoon Judee K.. 1999. Reacting to nonverbal expressions of liking: A test of interaction adaptation theory. Communication Monographs 66, 3 (1999), 219239. arXiv:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Goldberg Lewis R.. 1992. The development of markers for the Big-Five factor structure. Psychological Assessment 4, 1 (1992), 26.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hareli Shlomo and Hess Ursula. 2010. What emotional reactions can tell us about the nature of others: An appraisal perspective on person perception. Cognition and Emotion 24, 1 (2010), 128140. arXiv:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Heider Fritz. 1946. Attitudes and cognitive organization. Journal of Psychology 21 (1946), 107112.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Heider Fritz. 1958. The Psychology of Interpersonal Relations.Wiley.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Heider FFritz. 1958. The Psychology of Interpersonal Relations. John Wiley & Sons.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Hoque Mohammed (Ehsan), Courgeon Matthieu, Martin Jean-Claude, Mutlu Bilge, and Picard Rosalind W.. 2013. MACH: My automated conversation coach. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp’13). Association for Computing Machinery, New York, NY, 697706. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Jia Xiuyi, Zheng Xiang, Li Weiwei, Zhang Changqing, and Li Zechao. 2019. Facial emotion distribution learning by exploiting low-rank label correlations locally. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Joshi Jyoti, Gunes Hatice, and Goecke Roland. 2014. Automatic prediction of perceived traits using visual cues under varied situational context. In 2014 22nd International Conference on Pattern Recognition. 28552860.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Kampman Onno, Barezi Elham J., Bertero Dario, and Fung Pascale. 2018. Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 606611. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Kindiroglu Ahmet Alp, Akarun Lale, and Aran Oya. 2017. Multi-domain and multi-task prediction of extraversion and leadership from meeting videos. EURASIP Journal on Image and Video Processing 2017, 1 (Nov.2017), 77. Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Kipf Thomas N. and Welling Max. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  33. [33] Kivelä Mikko, Arenas Alex, Barthelemy Marc, Gleeson James P., Moreno Yamir, and Porter Mason A.. 2014. Multilayer networks. Journal of Complex Networks 2, 3 (2014), 203271.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Kumano Shiro, Otsuka Kazuhiro, Matsuda Masafumi, and Yamoto Junji. 2014. Analyzing perceived empathy based on reaction time in behavioral mimicry. IEICE Transactions on Information and Systems E97.D, 8 (2014), 20082020. Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Kumano Shiro, Otsuka Kazuhiro, Mikami Dan, and Yamato Junji. 2009. Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings. In (ICMI-MLMI’09). Association for Computing Machinery, New York, NY, 99106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Kumar Srijan, Bai Chongyang, Subrahmanian V. S., and Leskovec Jure. 2021. Deception detection in group video conversations using dynamic interaction networks. In Proceedings of the 15th International AAAI Conference on Web and Social Media (ICWSM’21), Budak Ceren, Cha Meeyoung, Quercia Daniele, and Xie Lexing (Eds.). AAAI Press, 339350. https://ojs.aaai.org/index.php/ICWSM/article/view/18065.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Li Shan, Deng Weihong, and Du JunPing. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 25842593.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Lin Yun-Shao and Lee Chi-Chun. 2020. Predicting performance outcome with a conversational graph convolutional network for small group interactions. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 80448048.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Mawalim Candy Olivia, Okada Shogo, Nakano Yukiko I., and Unoki Masashi. 2019. Multimodal BigFive personality trait analysis using communication skill indices and multiple discussion types dataset. In Social Computing and Social Media. Design, Human Behavior and Analytics, Meiselwitz Gabriele (Ed.). Springer International Publishing, Cham, 370383.Google ScholarGoogle Scholar
  40. [40] McKeown Gary, Valstar Michel, Cowie Roddy, Pantic Maja, and Schroder Marc. 2012. The SEMAINE database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing 3, 1 (2012), 517.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Müller Philipp, Huang Michael Xuelin, and Bulling Andreas. 2018. Detecting low rapport during natural interactions in small groups from non-Verbal behaviour. In 23rd International Conference on Intelligent User Interfaces. 153164.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Murata Aiko, Saito Hisamichi, Schug Joanna, Ogawa Kenji, and Kameda Tatsuya. 2016. Spontaneous facial mimicry is enhanced by the goal of inferring emotional states: Evidence for moderation of “automatic” mimicry by higher cognitive processes. PloS One 11, 4 (2016), e0153128.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Nguyen Laurent, Frauendorfer Denise, Mast Marianne, and Gatica-Perez Daniel. 2014. Hire me: Computational inference of hirability in employment interviews based on nonverbal behavior. IEEE Transactions on Multimedia 16 (2014), 10181031. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Nihei Fumio, Nakano Yukiko I., Hayashi Yuki, Hung Hung-Hsuan, and Okada Shogo. 2014. Predicting influential statements in group discussions using speech and head motion information. In Proceedings of the 16th International Conference on Multimodal Interaction (ICMI’14). Association for Computing Machinery, New York, NY, 136143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Nisbett Richard E. and Smith Michael. 1989. Predicting interpersonal attraction from small samples: A reanalysis of newcomb’s acquaintance study. Social Cognition 7, 1 (1989), 6773. https://search-proquest-com.dartmouth.idm.oclc.org/docview/848856429?accountid=10422. Copyright - 1989 Guilford Publications Inc; Last updated - 2018-10-15; CODEN - SOCOEE.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Okada Shogo, Nguyen Laurent Son, Aran Oya, and Gatica-Perez Daniel. 2019. Modeling dyadic and group impressions with intermodal and interperson features. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s, Article 13 (Jan.2019), 30 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Okada Shogo, Ohtake Yoshihiko, Nakano Yukiko I., Hayashi Yuki, Huang Hung-Hsuan, Takase Yutaka, and Nitta Katsumi. 2016. Estimating communication skills using dialogue acts and nonverbal features in multiple discussion datasets. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI’16). Association for Computing Machinery, New York, NY, 169176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Otsuka Kazuhiro, Yamato Junji, Takemae Yoshinao, and Murase Hiroshi. 2006. Quantifying interpersonal influence in face-to-face conversations based on visual attention patterns. In CHI’06 Extended Abstracts on Human Factors in Computing Systems (CHI EA’06). Association for Computing Machinery, New York, NY, 11751180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Ponce-López Víctor, Chen Baiyu, Oliu Marc, Corneanu Ciprian, Clapés Albert, Guyon Isabelle, Baró Xavier, Escalante Hugo Jair, and Escalera Sergio. 2016. ChaLearn LAP 2016: First round challenge on first impressions - dataset and results. In Computer Vision – ECCV 2016 Workshops, Hua Gang and Jégou Hervé (Eds.). Springer International Publishing, Cham, 400418.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Rayner Keith. 2009. Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology 62, 8 (2009), 14571506.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Reysen Stephen. 2005. Construction of a new scale: The Reysen likability scale. Social Behavior and Personality 33, 2 (2005), 201208. Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Reysen Stephen. 2006. A new predictor of likeability: Laughter. North American Journal of Psychology 8, 2 (June2006), 373382. https://search-proquest-com.dartmouth.idm.oclc.org/docview/197929542?accountid=10422. Copyright - Copyright North American Journal of Psychology Jun/Jul 2006; Last updated - 2011-06-17.Google ScholarGoogle Scholar
  53. [53] Sanchez-Cortes Dairazalia, Aran Oya, Mast Marianne Schmid, and Gatica-Perez Daniel. 2011. A nonverbal behavior approach to identify emergent leaders in small groups. IEEE Transactions on Multimedia 14, 3 (2011), 816832.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Seiter John S., Jr. Harry Weger, Kinzer Harold J., and Jensen Andrea Sandry. 2009. Impression management in televised debates: The effect of background nonverbal behavior on audience perceptions of debaters’ likeability. Communication Research Reports 26, 1 (2009), 111. arXiv:Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Tickle-Degnen Linda and Rosenthal Robert. 1990. The nature of rapport and its nonverbal correlates. Psychological Inquiry 1, 4 (1990), 285293. http://www.jstor.org/stable/1449345.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Liò Pietro, and Bengio Yoshua. 2018. Graph attention networks. In International Conference on Learning Representations. https://openreview.net/forum?id=rJXMpikCZ.Google ScholarGoogle Scholar
  57. [57] Wang Yanbang, Li Pan, Bai Chongyang, and Leskovec Jure. 2021. TEDIC: Neural modeling of behavioral patterns in dynamic social interaction networks. In Proceedings of the Web Conference 2021 (WWW’21). Association for Computing Machinery, New York, NY, 693705. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Yan Rui, Tang Jinhui, Shu Xiangbo, Li Zechao, and Tian Qi. 2018. Participation-contributed temporal dynamic model for group activity recognition. In Proceedings of the 26th ACM International Conference on Multimedia (MM’18). Association for Computing Machinery, New York, NY, 12921300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Yan Sijie, Xiong Yuanjun, and Lin Dahua. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In 32nd AAAI Conference on Artificial Intelligence. 74447452.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhang Lingyu, Bhattacharya Indrani, Morgan Mallory, Foley Michael, Riedl Christoph, Welles Brooke, and Radke Richard. 2020. Multiparty visual co-occurrences for estimating personality traits in group meetings. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV’20).Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Zhang Le, Peng Songyou, and Winkler Stefan. 2022. PersEmoN: A deep network for joint analysis of apparent personality, emotion and their relationship. IEEE Transactions on Affective Computing 13, 1 (2022), 298305. Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Çeliktutan Oya and Gunes Hatice. 2014. Continuous prediction of perceived traits and social dimensions in space and time. In 2014 IEEE International Conference on Image Processing (ICIP’14). 41964200.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Çeliktutan Oya and Gunes Hatice. 2017. Automatic prediction of impressions in time and across varying context: Personality, attractiveness and likeability. IEEE Transactions on Affective Computing 8, 1 (2017), 2942.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DIPS: A Dyadic Impression Prediction System for Group Interaction Videos

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1s
        February 2023
        504 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572859
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 January 2023
        • Online AM: 6 May 2022
        • Accepted: 19 April 2022
        • Revised: 20 January 2022
        • Received: 5 June 2021
        Published in tomm Volume 19, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)180
        • Downloads (Last 6 weeks)41

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!