Multimodal Fusion Interactions: A Study of Human and Automatic Quantification

In order to perform multimodal fusion of heterogeneous signals, we need to understand their interactions: how each modality individually provides information useful for a task and how this information changes in the presence of other modalities. In this paper, we perform a comparative study of how humans annotate two categorizations of multimodal interactions: (1) partial labels, where different annotators annotate the label given the first, second, and both modalities, and (2) counterfactual labels, where the same annotator annotates the label given the first modality before asking them to explicitly reason about how their answer changes when given the second. We further propose an alternative taxonomy based on (3) information decomposition, where annotators annotate the degrees of redundancy: the extent to which modalities individually and together give the same predictions, uniqueness: the extent to which one modality enables a prediction that the other does not, and synergy: the extent to which both modalities enable one to make a prediction that one would not otherwise make using individual modalities. Through experiments and annotations, we highlight several opportunities and limitations of each approach and propose a method to automatically convert annotations of partial and counterfactual labels to information decomposition, yielding an accurate and efficient method for quantifying multimodal interactions.


INTRODUCTION
A core challenge in multimodal machine learning lies in understanding the ways that different modalities interact with each other (2) counterfactual labels, where the same annotator is tasked to annotate the label given the first modality ( 1 ), before asking them to reason about how their answer chances when given the second ( 1+2 ), and vice versa ( 2 and  2+1 ), and (3) information decomposition where annotators annotate the degrees of modality redundancy, uniqueness, and synergy in predicting the task.Finally, we also propose a scheme based on PID [66] to automatically convert annotations of partial and counterfactual labels to information decomposition.
in combination for a given prediction task [35].We define the study of multimodal fusion interactions as the categorization and measurement of how each modality individually provides information useful for a task and how this information changes in the presence of other modalities [14,42,46].Learning complex interactions are often quoted as motivation for many successful multimodal modeling paradigms in the machine learning and multimodal interaction communities, such as contrastive learning [29,47], modality-specific representations [57,70], and higher-order interactions [26,33,71].Despite progress in new models that seem to better capture various interactions from increasingly complex real-world multimodal datasets [26,71], formally quantifying and measuring the interactions that are necessary to solve a multimodal task remains a fundamental research question [24,34,35].
In this paper, we perform a comparative study of how reliably human annotators can be leveraged to quantify different interactions in real-world multimodal datasets (see Figure 1).We first start with a conventional method which we term partial labels, where different randomly assigned annotators annotate the task given only the first modality ( 1 ), only the second modality ( 2 ), and both modalities ( 12 ) [14,42,46,48].Beyond partial labels, we extend this idea to counterfactual labels, where the same annotator is tasked to annotate the label given the first modality ( 1 ), before giving them the second modality and asking them to explicitly reason about how their answer changes ( 1+2 ), and vice versa ( 2 and  2+1 ) [50].Additionally, we propose an alternative taxonomy of multimodal interactions grounded in information theory [31,66], which we call information decomposition: decomposing the total information two modalities provide about a task into redundancy, the extent to which individual modalities and both in combination all give similar predictions on the task, uniqueness, the extent to which the prediction depends only on one of the modalities and not the other, and synergy, the extent to which task prediction changes with both modalities as compared to using either modality individually [31,66].Information decomposition has an established history in understanding feature interactions in neuroscience [45,55,64,65], physics [16,22], and biology [10,12] since it exhibits desirable properties such as disentangling redundancy and synergy, normalization with respect to the total information two features provide towards a task, and established methods for automatic computation.
However, it remains a challenge to scale information decomposition to real-world high-dimensional and continuous modalities [7,31,32], which has hindered its application in machine learning and multimodal interaction where complex video, audio, text, and other sensory modalities are prevalent.To quantify information decomposition for real-world multimodal tasks, we propose a new human annotation scheme where annotators provide estimates of redundancy, uniqueness, and synergy when presented with both modalities and the label.We find that this method works surprisingly well with strong annotator agreement and self-reported annotator confidence.Finally, given the promises of information decomposition [15,26,31], we additionally propose a scheme to automatically convert annotations of partial and counterfactual labels to information decomposition using an information-theoretic method [7,66], which makes it compatible with existing methods of annotating interactions [14,42,46,48].Through comprehensive experiments on multimodal analysis of sentiment, humor, sarcasm, and question-answering, we compare these methods of quantifying multimodal interactions and summarize our key findings.We release our data and code at https://github.com/pliang279/PID.

RELATED WORK
Multimodal fusion interactions have been studied based on the dimensions of response, information, and mechanics [35].We define and highlight representative works in each category: Interaction response studies how the inferred response changes when two or more modalities are fused [35] (see Figure 2).For example, two modalities create a redundant response if the fused response is the same as responses from either modality or enhanced if the fused response displays higher confidence.Non-redundant interactions such as modulation or emergence can also happen [43].Many of these terms actually started from research in human and animal communicative modalities [17,43,44,48] and multimedia [5,37].Inspired by these ideas, a common measure of interaction response redundancy is defined as the distance between prediction logits using either feature [38].This definition is also commonly used in minimum-redundancy feature selection [3,68,69] Categories of interaction response: redundancy happens when using either and both modalities give similar task predictions, uniqueness studies whether prediction depends on one of the modalities and not the other, and synergy measures how prediction changes with both modalities as compared to using either modality individually.
in multimedia has also categorized interactions into divergent, parallel, and additive [5,30,73].Finally, human annotations have been leveraged to identify redundant modalities via a proxy of cognitive load [48].This paper primarily focuses on interaction response since it is the one easiest understood and annotated by humans, but coming up with formal definitions and measures of other interactions are critical directions for future work.
Interaction information investigates the nature of information overlap between multiple modalities.The information important for a task can be shared in both modalities, unique to one modality, or emerge only when both are present [35].Information-theoretic measures naturally provide a mathematical formalism in the study of interaction information, for example through the mutual information between two variables [54,56].In the presence of two modalities and a label, extensions of mutual information to three variables, such as through total correlation [19,63] interaction information [39,53], or partial information decomposition [7,66] have been proposed, and recent work has explored their estimation on large-scale real-world multimodal datasets [31,32].From a semantic perspective, research in multimedia has studied various relationships that can exist between images and text [37,41], which has also inspired work in representing shared information through contrastive learning [47].While interaction information and response are naturally related, interaction response can be more fine-grained with respect to individual datapoints.
Finally, the study of interaction mechanics studies how mathematical operators can be used to capture interactions during multimodal fusion.For example, interaction mechanics can be expressed as additive [18], multiplicative [26], tensor [71], non-linear [40], and recurrent [33] forms, as well as logical, causal, or temporal operations [61].By making assumptions on a specific functional form of interactions (e.g., additive vs non-additive), prior work has been able to quantify their presence or absence [51,59,60] in real-world multimodal datasets and models through studies of architecture-specific attention and parameter weights [], modelagnostic gradient-based visualizations [34,36,62], and projections into simpler models [24,67].
Instruction: Look at the following input modality and provide your annotation of the label using only modality 1 (please mute the audio of the videos): Label annotation scales: mosei: [-3, 3], where -3 is highly negative, 0 is neutral, +3 is highly positive.vqa & clevr: write your answer in text.sarcasm: [0, 3], where 0 is no sarcasm and 3 is very sarcastic.humor: [0, 3], where 0 is no humor and 3 is very humorous.
then score your confidence in your answer from a scale of 0 (no confidence) -5 (high confidence): Instruction: Look at the following input modalities and provide your prediction of the label using both modality 1 and 2: Label annotation scales: mosei: [-3, 3] sentiment, where -3 is highly negative, 0 is neutral, +3 is highly positive.vqa & clevr: write your answer in text.sarcasm: [0, 3] degree of sarcasm, where 0 is no sarcasm and 3 is very sarcastic.humor: [0, 3] degree of humor, where 0 is no humor and 3 is very humorous.
then score your confidence in your answer from a scale of 0 (no confidence) -5 (high confidence):  (c) Sample user interface for annotating how the label changes from observing the video modality and then language by the same annotator.
Instruction: Look at the data and provide your rating for the following questions on a scale of 0 (none at all) -5 (large extent) and score your confidence in your answer from a scale of 0 (no confidence) -5 (high confidence): 1.The extent to which both modalities enable you to make the same predictions about the task.
2. The extent to which modality 1 enables you to make a prediction about the task that you would not if using modality 2.
3. The extent to which modality 2 enables you to make a prediction about the task that you would not if using modality 1.

4.
The extent to which both modalities enable you to make a prediction about the task that you would not otherwise make using either modality individually.

ANNOTATING MULTIMODAL INTERACTIONS
In order to study interaction response during multimodal fusion, we first review the estimation of partial labels via random assignment, before discussing an alternative approach through counterfactual labels.Finally, we motivate information decomposition into redundancy, uniqueness, and synergy, which offers a different perspective and new benefits for studying multimodal interactions.

Annotating partial labels
The standard approach involves tasking randomly assigned annotators to label their prediction of the label when presented with only the first modality ( 1 ), the label when presented with only the second modality ( 2 ), and the label when presented with both modalities ( 12 ) [14,42,46].Annotators are typically randomly assigned to each modality so that their labeling process is not influenced by observing other modalities, resulting in independently annotated partial labels.In this setup, the instructions given are: (1)  1 : Show modality 1, and ask the annotator to predict the label.
(2)  2 : To another annotator, show only modality 2, and ask the annotator to predict the label.(3)  12 : To yet another annotator, show both modalities, and ask the annotator to predict the label.
After reporting each partial label, the annotators are also asked to report confidence on a 0-5 scale (0: no confidence, 5: high confidence).We show a screenshot of a sample user interface in Figure 3 (top) and provide more annotation details in Appendix A.1.

Annotating counterfactual labels
As another alternative to random assignment, we draw insight from counterfactual estimation where the same annotator annotates the label given a single modality, before giving them the second modality and asking them to reason about how their answer changes.
The instructions provided to the first annotator are: (1)  1 : Show modality 1, and ask them to predict the label.
(2)  1+2 : Now show both modalities and ask if their predicted label explicitly changes after seeing both modalities.To a separate annotator, we provide the following instructions: (1)  2 : Show modality 2, and ask them to predict the label.
(2)  2+1 : Now show both modalities and ask if their predicted label explicitly changes after seeing both modalities.The annotators also report confidence on a 0-5 scale (see sample user interface in Figure 3 (middle) and exact annotation procedures in Appendix A.2).While the first method by random assignment estimates the average effect of each modality on the label as is commonly done in randomized control trials [6] (since estimates of partial labels for each modality are done separately in expectation over all users), this counterfactual approach measures the actual causal effect of seeing the second modality towards the label for the same user [1,21,28].

Annotating information decomposition
Finally, we propose an alternative categorization of multimodal interactions based on information theory, which we call information decomposition: decomposing the total information two modalities provide about a task into redundancy, the extent to which individual modalities and both in combination all give similar predictions on the task, uniqueness, the extent to which the prediction depends only on one of the modalities and not the other, or synergy, the extent to which task prediction changes with both modalities as compared to using either modality individually [31,66].
This view of interactions is useful since it has a formal grounding in information theory [49] and information decomposition [66].Information theory formalizes the amount of information that a variable ( ) provides about another ( ), and is quantified by Shannon's mutual information (MI): which measures the amount of information (in bits) obtained about  1 by observing  2 .By extension, conditional MI is the expected value of the MI of two random variables (e.g.,  1 and  2 ) given the value of a third (e.g.,  ): most natural extension, through interaction information [39,53], has often been indirectly used as a measure of redundancy in cotraining [4,8,11] and multi-view learning [52,54,56,58].It is defined for three variables as the difference in mutual information and conditional mutual information: and can be defined inductively for more than three variables.However, interaction information has some significant shortcomings:  ( 1 ;  2 ;  ) can be both positive and negative, leading to considerable difficulty in its interpretation when redundancy as an information quantity is negative [25,31].Furthermore, the total information is only equal to redundancy and uniqueness ( ( 1 ,  2 ;  ) =  ( 1 ;  2 ;  ) +  ( 1 ;  | 2 ) +  ( 2 ;  | 1 )), and there is no measurement of synergy in this framework.

Information decomposition.
Partial information decomposition (PID) [66] was designed to solve some of the issues with multivariate information theory.PID is a class of definitions for redundancy  between  1 and  2 , unique information  1 in  1 and  2 in  2 , and synergy  when both  1 and  2 are present such that the following equations hold (see Figure 4 for a visual depiction): −  =  ( 1 ;  2 ;  ).
PID resolves the issue of negative  ( 1 ;  2 ;  ) in conventional information theory by separating  and  such that  −  =  ( 1 ;  2 ;  ), identifying that prior redundancy measures confound actual redundancy and synergy.Furthermore, if  ( 1 ;  2 ;  ) = 0 then existing frameworks are unable to distinguish between positive values of true  and  canceling each other out, while PID separates and can estimate non-zero (but equal) values of both  and .

Annotating information decomposition.
While information decomposition has a formal definition and exhibits nice properties, it remains a challenge to scale information decomposition to real-world high-dimensional and continuous modalities [7,31].
To quantify information decomposition for real-world tasks, we investigate whether human judgment can be used as a reliable estimator.We propose a new annotation scheme where we show both We treat the dataset of partial labels D = {( 1 ,  2 , )  =1 } as a joint distribution with  1 and  2 as 'multimodal inputs' and the target label  as the 'output'.Estimating response redundancy, uniqueness, and synergy then boils down to solving a convex optimization problem with marginal constraints, which can be done accurately and efficiently.This method is applicable to many annotated multimodal datasets and yields consistent, comparable, and standardized interaction estimates.modalities and the label and ask each annotator to annotate the degree of redundancy, uniqueness, and synergy on a scale of 0-5 using the following definitions inspired by the formal definitions in information decomposition: (1) : The extent to which using the modalities individually and together gives the same predictions on the task, (2)  1 : The extent to which  1 enables you to make a prediction about the task that you would not if using  2 , (3)  2 : The extent to which  2 enables you to make a prediction about the task that you would not if using  1 , (4) : The extent to which only both modalities enable you to make a prediction about the task that you would not otherwise make using either modality individually, alongside their confidence in their answers on a scale of 0-5.We show a sample user interface for the annotations in Figure 3 (bottom) and include exact annotation procedures in Appendix A.3.

CONVERTING PARTIAL LABELS TO PID
Finally, we propose a method to automatically convert partial labels, which are present in many existing multimodal datasets [14,42,46], into information decomposition interaction values.Define the multimodal label  as  12 in the case of partial labels and the average of  1+2 and  2+1 in the case of counterfactual labels.Then, the partial and counterfactual labels are related to redundancy, uniqueness, and synergy in the following ways: (1)  is high when  1 ,  2 , and  are all close to each other, (2)  1 is high when  1 is close to  but  2 is far from , (3)  2 is high when  2 is close to  but  1 is far from , (4)  is high when  1 ,  2 are both far from .
While these partial labels are intuitively related to information decomposition, coming up with a concrete equation to convert  1 ,  2 , and  to actual interaction values is surprisingly difficult and involves many design decisions.For example, what distance measure do we use to measure closeness in label space?Furthermore, computing  depends on 3 distances,  1 and  2 depend on 2 distances but inversely on 1 distance, and  depends on 2 distances.How do we obtain interaction values that lie on comparable scales so that they can be compared reliably?

Automatic conversion
Our key insight is that the aforementioned issues are exactly what inspired much of the research in information theory and decomposition in the first place: in information theory, the lack of a distance measure is solved by working with probability distributions where information-theoretic distances like KL-divergence are well-defined and standardized, the issues of normalization are solved using a standardized unit of measure (bits in log-base 2), and the issues of incomparable scales are solved by the consistency equations ( 4)- (6) relating PID values to each other and to the total task-relevant information in both modalities.
Armed with these formalisms of information theory and information decomposition, we propose a method to convert humanannotated partial predictions into redundancy, uniqueness, and synergy (see Figure 5 for an overview).To do so, we treat the dataset of partial predictions D = {( 1 ,  2 , )  =1 } as a joint distribution with  1 and  2 as 'multimodal inputs' sampled over the label support Y, and the target label  as the 'output' also over Y.Following this, we adopt a precise definition of redundancy, uniqueness, and synergy used by Bertschinger et al. [7], where the interactions are defined as the solution to the optimization problems: where  [20] {image, question} 1, 100, 000 QA Multimedia CLEVR [27] {image, question} 853, 554 QA Multimedia MOSEI [72] {text, video, audio} 22, 777 sentiment, emotions Affective Computing UR-FUNNY [23] 9) depending on the full  distribution.

Estimating information decomposition
These optimization problems can be solved accurately and efficiently using convex programming.Importantly, the  * that solves ( 7)-( 9) can be rewritten as the solution to the max-entropy optimization problem: . Since the support of the label space Y is usually small and discrete for classification, or small and continuous for regression, we can represent all valid joint distributions ( 1 ,  2 , ) as a set of tensors  of shape |Y|×|Y|×|Y| with each entry representing  [, , ] = ( 1 = ,  2 = ,  = ).
The problem then boils down to optimizing over tensors  that are valid joint distributions and that match marginals over each modality and the label (i.e., making sure  ∈ Δ  ).Given a tensor parameter , our objective is   ( | 1 ,  2 ), which is concave.This is therefore a convex optimization problem and the marginal constraints can be written as linear constraints.Given a dataset  = {( 1 ,  2 , )  =1 },  ( 1 , ) and  ( 2 , ) are first estimated before enforcing ( 1 , ) =  ( 1 , ) and ( 2 , ) =  ( 2 , ) through linear constraints: the 3D-tensor  summed over the second dimension gives ( 1 , ) and summed over the first dimension gives ( 2 , ).Our final optimization problem is given by arg max such that ∑︁ Since this is a convex optimization problem with linear constraints, CVXPY [13] returns the exact answer  * efficiently.Plugging the learned  * into equations ( 7)-( 9) yields the desired estimates for redundancy, uniqueness, and synergy.Therefore, this estimator can automatically convert partial or counterfactual labels annotated by humans in existing multimodal datasets [14,42,46] into information decomposition interactions, yielding consistent, comparable, and standardized estimates.

EXPERIMENTS
In this section, we design experiments to compare the annotation of multimodal interactions via randomized partial labels, counterfactual labels, and information decomposition into redundancy, uniqueness, and synergy.

Experimental setup
5.1.1Datasets and tasks.Our experiments involve a large collection of datasets spanning the language, visual, and audio modalities across affective computing and multimedia.We summarize the datasets used in Table 1 and provide more details here: 1. VQA 2.0 [20] is a balanced version of the popular VQA [2] dataset by collecting complementary images such that every question is associated with a pair of similar images that result in two different answers to the same question.This reduces the occurrence of spurious correlations in the dataset and enables the training of more robust models.
2. CLEVR [27] is a dataset for studying the ability of multimodal systems to perform visual reasoning.It contains 100, 000 rendered images and about 853, 000 unique automatically generated questions that test visual reasoning abilities such as counting, comparing, logical reasoning, and memory.
3. MOSEI [72] is a collection of 22, 000 opinion video clips annotated with labels for subjectivity and sentiment intensity.The dataset includes per-frame, and per-opinion annotated visual features, and per-milliseconds annotated audio features.Sentiment intensity is annotated in the range [−3, +3].Videos are collected from YouTube with a focus on video blogs which reflect real-world speakers expressing their behaviors through monologue videos.
4. UR-FUNNY [23]: Humor is an inherently multimodal communicative tool involving the effective use of words (text), accompanying gestures (visual), and prosodic cues (acoustic).UR-FUNNY consists of more than 16, 000 video samples from TED talks annotated for humor, and covers speakers from various backgrounds, ethnic groups, and cultures.
5. MUStARD [9] is a multimodal video corpus for research in sarcasm detection compiled from popular TV shows including Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous.MUStARD consists of 690 audiovisual utterances annotated with sarcasm labels.Sarcasm requires careful modeling of complementary information, particularly when the information from each modality does not agree with each other.
Overall, the datasets involved in our experiments cover diverse modalities such as images, video, audio, and text, with prediction tasks spanning humor, sarcasm sentiment, emotions, and questionanswering from affective computing and multimedia.

Annotation details.
Participation in all annotations was fully voluntary and we obtained consent from all participants prior to annotations.The authors manually took anonymous notes on all results and feedback in such a manner that the identities of annotators cannot readily be ascertained directly or through identifiers linked to the subjects.Participants were not the authors nor in the same research groups as the authors, but they all hold or are working towards a graduate degree in a STEM field and have knowledge of machine learning.None of the participants knew about this project before their session and each participant only interacted with the setting they were involved in.
We sample 50 datapoints from each of the 5 datasets in Table 1 and give them to a total of 12 different annotators: • 3 annotators for direct annotation of interactions, • 3 annotators for partial labeling of  1 ,  2 , and  12 , • 3 annotators for counterfactual, labeling  1 first then  1+2 , • 3 annotators for counterfactual, labeling  2 first then  2+1 .
We summarize the results and key findings:

Annotating partial and counterfactual labels
We show the agreement scores of partial and counterfactual labels in Table 2 and note some observations below: • Comparing partial with counterfactual labels: Counterfactual label agreement (0.70) is similar to randomized label agreement (0.72).In particular, annotating the video-only modality ( 1 ) for video datasets in the randomized setting appears to be confusing with an agreement of only 0.51.We hypothesize that this is due to the challenge of detecting sentiment, sarcasm, and humor in videos without audio and when no obvious facial expression or body language is shown.Furthermore, we observe similar confidence in predicting the label when adding the second modality in the counterfactual setting versus showing both modalities upfront in the randomized setting: 4.42 vs 4.68.• Agreement and confidence datasets: We examined the agreement for each dataset in the randomized and counterfactual settings respectively.In both settings, we found MOSEI is the easiest dataset with the highest agreement of 0.75, 0.60, 0.65 for annotating  1 ,  2 , and  12 and 0.88, 0.66, 0.83, 0.91 for annotating  1 ,  1+2 ,  2 , and  2+1 .Meanwhile, MUStARD is the hardest, with the agreement as low as −0.21, 0.04, and 0.17 in the randomized setting.The average confidence for annotating partial labels is actually high (above 3.5) for all datasets except unimodal predictions for VQA and CLEVR, which is as low as 0.43 and 0.33.This is understandable since these two image-based questionanswering tasks are quite synergistic and cannot be performed using only one of the modalities, whereas annotator confidence when seeing both modalities is a perfect 5/5.• Effect of counterfactual order: Apart from a slight decrease in agreement in labeling  1 first then  1+2 and the slight increase in agreement in  2 then  2+1 , we do not observe a significant difference caused by the counterfactual order.This is confirmed by the qualitative feedback from annotators: one responded that they found no difference between both orders and gave mostly similar responses to both.Overall, we find that while both partial and counterfactual labels are reasonable choices for quantifying multimodal interactions, the annotation of counterfactual labels yields higher agreement and confidence than partial labels via random assignment.

Annotating information decomposition
We now turn our attention to annotating information decomposition.Referencing the average annotated interactions in Table 3 with agreement scores in Table 2, we explain our findings regarding annotation quality and consistency.We also note qualitative feedback from annotators regarding any challenges they faced.
• General observations on interactions, agreement, and confidence: The annotated interactions align with prior intuitions on these multimodal datasets and do indeed explain the interactions between modalities, such as VQA and CLEVR with significantly high synergy, as well as language being the dominant modality in sentiment, humor, and sarcasm (high  1 values).Overall, the Krippendorff's alpha for inter-annotator agreement in directly annotating the interactions is quite high (roughly 0.5 for each interaction) and the average confidence scores are also quite high (above 4 for each interaction), indicating that the humanannotated results are reasonably reliable.• Uniqueness vs synergy in video datasets: There was some confusion between uniqueness in the language modality and synergy in the video datasets, resulting in cases of low agreement in annotating  1 and : −0.09, −0.07 for MOSEI, −0.14, −0.03 for UR-FUNNY and −0.08, −0.04 for MUStARD respectively.We believe this is due to subjectivity in interpreting whether sentiment, humor, and sarcasm are present in the language only or present only when contextualizing both language and video.• Information decomposition in non-video datasets: On nonvideo datasets, there are cases of disagreement due to the subjective definitions of information decomposition.For example, there was some confusion regarding VQA and CLEVR, where images are the primary source of information that must be selectively filtered by the question.This results in response synergy but information uniqueness.One annotator consistently annotated high visual uniqueness as the dominant interaction, while the other two recognized synergy as the dominant interaction, so the agreement of annotating synergy was low (−0.04).• On presence vs absence of an attribute: We further investigated the difference between agreement and confidence in the presence or absence of an attribute (e.g., humor or sarcasm).Intuitively, the presence of an attribute is clearer: taking the example of synergy, humans can judge that there is no inference of sarcasm from text only and there is no inference of sarcasm from the visual modality only, but there is sarcasm when both modalities interact together [9].Indeed, we examined videos that show and do not show an attribute separately and found in general, humans reached higher agreement on annotating attribute-present videos.The agreement of annotating  is 0.13 when the attribute is present, compared to −0.10 when absent.Overall, we find that while annotating information decomposition can perform well, there are some sources of confusion regarding certain interactions and during the absence of an attribute.

Converting partial and counterfactual labels to information decomposition
Finally, we present results on converting partial and counterfactual labels into interactions using our information-theoretic method (PID).We report these results in Table 3 in the rows called Par-tial+PID and Counterfactual+PID, and note the following: • Partial+PID vs counterfactual+PID: In comparing conversions on both partial and counterfactual labels, we find that the final interactions are very consistent with each other: the highest interaction is always the same across the datasets and the relative order of interactions is also maintained.• Comparing with directly annotated interactions: In comparison to the interaction that human annotators rate as the highest, PID also assigns the largest magnitude to the same interaction ( for VQA 2.0 and CLEVR,  1 for UR-FUNNY and MUStARD), so there is strong agreement.For MOSEI there is a small difference: both  and  1 are annotated as equally high by humans, while PID estimates  as the highest.• Normalized comparison scale: Observe that the converted results fall into a new scale and range, especially for the MOSEI, UR-FUNNY, and MUStARD video datasets.This is expected since PID conversion inherits the properties of information theory where  + 1 + 2 +  add up to the total information that the two modalities provide about a task, indicating that the three video datasets are more subjective and are harder to predict.• Propagation of subjectivity: On humor and sarcasm, the subjectivity in initial human partial labeling can be propagated when we subsequently apply automatic conversion -after all, we do not expect the automatic conversion to change the relative order apart from estimating interactions in a principled way.Therefore, we believe that the conversion method we proposed is a stable method for estimating information decomposition, combining human-in-the-loop labeling of partial labels (which shows high agreement and scales to high-dimensional data) with informationtheoretic conversion which enables comparable scales, normalized values, and well-defined distance metrics.

An overall guideline
Given these findings, we summarize the following guidelines for quantifying multimodal fusion interactions: • For modalities and tasks that are more objective (e.g., visual question answering), direct annotation of information decomposition is a reliable alternative to conventional methods of partial and counterfactual labeling to study multimodal interactions.• For modalities and tasks that may be subjective (e.g., sarcasm, humor), it is useful to obtain counterfactual labels before using PID conversion to information decomposition values, since counterfactual labeling exhibits higher annotator agreement while PID conversion is a principled method to obtain interactions.

CONCLUSION
Our work aims to quantify various categorizations of multimodal interactions using human annotations.Through a comprehensive study of partial labels, counterfactual labels, and information decomposition, we elucidated several pros and cons of each approach and proposed a hybrid estimator that can convert partial and counterfactual labels to information decomposition interaction estimates.On real-world multimodal fusion tasks, we show that we can estimate interaction values accurately and efficiently which paves the way towards a deeper understanding of these multimodal datasets.
Limitations and future work: The annotation schemes in this work are limited by the subjectivity of the modalities and task.Automatic conversion of partial labels to information decomposition requires the label space to be small and discrete (i.e., classification), and does not yet extend to regression or text answers unless approximate discretization is first performed.Future work can also scale up human annotations to more datapoints and fusion tasks, and ask annotators to provide their explanations for ratings that have low agreement.Finally, we are aware of challenges in evaluating interaction estimation and emphasize that they should be interpreted as a relative sense of which interaction is most important and a guideline to inspire model selection and design.

Figure 1 :
Figure 1: We study human annotation of multimodal fusion interactions via three categorizations: (1) partial labels in which different randomly assigned annotators annotate the task given the first ( 1 ), second ( 2 ), and both modalities ( 12 ), (2) counterfactual labels, where the same annotator is tasked to annotate the label given the first modality ( 1 ), before asking them to reason about how their answer chances when given the second ( 1+2 ), and vice versa ( 2 and  2+1 ), and (3) information decomposition where annotators annotate the degrees of modality redundancy, uniqueness, and synergy in predicting the task.Finally, we also propose a scheme based on PID[66] to automatically convert annotations of partial and counterfactual labels to information decomposition.

Figure 2 :
Figure2: Categories of interaction response: redundancy happens when using either and both modalities give similar task predictions, uniqueness studies whether prediction depends on one of the modalities and not the other, and synergy measures how prediction changes with both modalities as compared to using either modality individually.
Sample user interface for annotating partial labels of the video modality.
humor: [0, 3], where 0 is no humor and 3 is very humorous.then score your confidence in your answer from a scale of 0 (no confidence) -5 (high confidence):Now look at both input modality 1 and 2, Does your predicted label change after seeing both modalities?Please annotate the new label after seeing both, then score your confidence in your answer from a scale of 0 (no confidence) -5 (high confidence)false Christians into thinking they are already safe and secure and on their way to heaven.mosei2 video Currently, over four-hundered thousand Americans with disabilities are being paid less than the minimum wage, some of them mere pennies per hour.mosei3 video So once again, we call on the regime to cease these absolutely senseless attacks, which are, of course, violations of the cessation of hostilities.mosei4 video hey hey hey Sherri Brown here with a super sweet and awesome promotion for anybody looking to take their business to the next level.mosei5 video I am absolutely optimistic, I think that there is no other way to be.mosei6 video In 2008, there were very serious tribal warfare in Kenya following a disputed election.mosei7 video They're not going to give this thing any floor time at all because they know that Donald Trump is not going to divest from his corporations mosei8 video And you saw some of the leading brands out there like McDonalds, like General Mills and Coca Cola, who are doing such a great job of tapping into those insights and bringing them to life in their advertising.mosei9 video Know who's in it and follow the B2B CFO\u00ae golden rule; Let the Finders find, Minders mind, and Grinders grind.mosei10 videoYou have been invited to a special event that includes an invitation, so let's take a look at how should we respond to it.small shiny thing that is the same shape as the tiny yellow shiny object is what color?" clevr2 image "How many large things are either cyan metallic cylinders or yellow blocks?" clevr3 image "There is a yellow thing to the right of the rubber thing on the left side of the gray rubber cylinder; what is its material?"

"
vqa1 image How many pictures are there?vqa2 image What color is the woman's shirt on the left?vqa3 image Is this a hospital?vqa4 image What is the green fruit called?vqa5 image What are these animals?vqa6 image Is the door closed?The other small shiny thing that is the same shape as the tiny yellow shiny object is what color?" clevr2 image "How many large things are either cyan metallic cylinders or yellow blocks?" "There is a yellow thing to the right of the rubber thing on the (d) Sample user interface for annotating information decomposition of redundancy, uniqueness, and synergy.

( 2 ) 3 . 3 . 1 Figure 4 :
Figure 4: Partial information decomposition gives a principled way to estimate the interactions that are redundant between 2 modalities, unique to one modality, and synergistic only when both modalities are present.

Figure 5 :
Figure 5: Overview of our proposed method to convert partial or counterfactual labels to information decomposition values.We treat the dataset of partial labels D = {( 1 ,  2 , )  =1 } as a joint distribution with  1 and  2 as 'multimodal inputs' and the target label  as the 'output'.Estimating response redundancy, uniqueness, and synergy then boils down to solving a convex optimization problem with marginal constraints, which can be done accurately and efficiently.This method is applicable to many annotated multimodal datasets and yields consistent, comparable, and standardized interaction estimates.

Table 1 :
2}} and the notation   (•) and   (•) disambiguates MI under joint distributions  and  respectively.The key difference in this definition of PID lies in optimizing  ∈ Δ  to satisfy the marginals ( Collection of datasets used for our study of multimodal fusion interactions covering diverse modalities, tasks, and research areas in multimedia and affective computing. , ) =  (  , ), but relaxing the coupling between  1 and  2 : ( 1 ,  2 ) need not be equal to  ( 1 ,  2 ).The intuition behind this is that one should be able to infer redundancy and uniqueness given only ( 1 , ) and  ( 2 , ), and therefore they should only depend on  ∈ Δ  which match these marginals.Synergy, however, requires knowing the coupling  ( 1 ,  2 ), and this is reflected in equation (

Table 3 :
Comparing (1) direct human annotation of information decomposition, (2) converting human-annotated partial labels to interactions via PID, and (3) converting human-annotated counterfactual labels to interactions via PID.