skip to main content
research-article
Open Access

On Modality Bias Recognition and Reduction

Published:25 February 2023Publication History

Skip Abstract Section

Abstract

Making each modality in multi-modal data contribute is of vital importance to learning a versatile multi-modal model. Existing methods, however, are often dominated by one or few of modalities during model training, resulting in sub-optimal performance. In this article, we refer to this problem as modality bias and attempt to study it in the context of multi-modal classification systematically and comprehensively. After stepping into several empirical analyses, we recognize that one modality affects the model prediction more just because this modality has a spurious correlation with instance labels. To primarily facilitate the evaluation on the modality bias problem, we construct two datasets, respectively, for the colored digit recognition and video action recognition tasks in line with the Out-of-Distribution (OoD) protocol. Collaborating with the benchmarks in the visual question answering task, we empirically justify the performance degradation of the existing methods on these OoD datasets, which serves as evidence to justify the modality bias learning. In addition, to overcome this problem, we propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned according to the training set statistics. Thereafter, we apply this method on 10 baselines in total to test its effectiveness. From the results on four datasets regarding the above three tasks, our method yields remarkable performance improvements compared with the baselines, demonstrating its superiority on reducing the modality bias problem.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Real-world data often exhibit multiple modalities. Learning a versatile multi-modal model has naturally attracted increasing research interests from academic and industrial practitioners. Existing cutting edge technologies mostly resort to multi-modal machine learning or deep learning [28, 66], wherein considerable advancement has been brought by the computer vision and natural language processing communities. Despite promising progress having been achieved, we recognize one detrimental problem in multi-modal learning, which severely degrades the decision making integrity. That is, current models prefer trivial solutions due to the shortcut between targets and certain modalities. This phenomenon is, however, ubiquitous among various multi-modal learning tasks, including classification, generation, and clustering. In this work, we shed light on the context of multi-modal classification. It takes as input the mixed modality data and outputs a human-friendly semantic label.

The shortcut is due to the strong correlation between semantic labels and specific modalities for multi-modal classification. The language prior problem in Visual Question Answering (VQA) [1, 26, 50, 52] serves as one typical manifestation. It refers to blindly answering questions without performing visual reasoning over images, since there exists a spurious connection between question types and answers (Figure 1). In fact, this issue can be attributed to the modality bias influence, namely one modality (language) dominates the class prediction than the other (vision). Inspired by this, we extend the problem to a broader scope and comprehensively study it from a larger perspective—the modality bias problem in multi-modal learning.

Fig. 1.

Fig. 1. Modality bias problem illustration over three multi-modal tasks. Both training and testing sets are independently and identically distributed, which drives the model decision during testing toward the modality bias in the training set. For (a), (b), and (c), the biased modality is color, language, and object (frame), respectively.

The modality bias problem is noticeably common during multi-modal learning. As illustrated in the two examples from Figure 1(a) and (c): For the colored digit recognition, the color modality overwhelms the shape [22] when predicting digits, and the modality motion is restrained by the frame for video action recognition. As a matter of fact, it poses several drawbacks to the model learning: The robustness and explainability of the existing methods are largely limited; generalization across datasets becomes impossible, to name a few. Nevertheless, this problem has been largely unaddressed thus far to the best of our knowledge. Toward this end, this work tentatively highlights the modality bias problem for the multimedia research community.

Evaluating such a problem is rather difficult, as a simple model can achieve satisfactory performance over current benchmarks due to the Independently and Identically Distributed (I.I.D) property of training and testing sets. Therefore, the comparison among methods is oblivious toward the alleviation of the modality bias problem. Thanks to the recent progress on the Out-of-Distribution (OoD) generalization, Agrawal et al. [1] proposed to re-split the VQA datasets to build their associated OoD counterparts. In the new curated benchmarks, the training and testing sets have different prior distributions of answers for each question type. For instance, the answer 2 is the most frequent one to how many during training. By contrast, the answer 1 becomes the dominating answer in the testing set. This phenomenon violates the I.I.D property that is viewed as standard by traditional machine learning algorithms. In this way, the modality bias is manually cut off, since a VQA model trained on such datasets cannot leverage the shortcut between questions and answers to perform well. Following this paradigm, we further construct the OoD datasets for the colored digit recognition and video action recognition tasks. In particular, the label distribution with respect to the biased modality is made dissimilar between training and testing sets. As a result, when evaluating some strong baselines on these datasets, drastic performance degradation can be observed (Section 3).

In addition to the OoD benchmark construction, we extend our previous work in Reference [26] to a novel and generic loss function, namely Multi-Modal De-Bias (MMDB), from the viewpoint of feature space learning. To implement this, we transform the feature space from Euclidean to Cosine, whereinto the decision boundary is determined only by the angle between multi-modal fused feature vector and the final classification weight matrix. Specifically, an adaptive margin is introduced to achieve the goal that frequent and sparse classes take broader and tighter spaces, respectively. To evaluate the effectiveness of the proposed MMDB, we apply it to 10 baselines in total across three typical multi-modal classification tasks. From the experimental results, our MMDB can enhance the baselines with significant performance gains, while introducing no inference overload.

In summary, the contribution of this article is fourfold:

  • We systematically study the modality bias problem in the context of multi-modal classification. To the best of our knowledge, we are the first to present and investigate this problem for multi-modal learning from a comprehensive view.

  • To facilitate the evaluation of this problem, we construct several Out-of-Distribution datasets for benchmarking purpose.

  • A novel multi-modal de-bias loss function method based on feature space learning is devised to reduce the modality bias problem. Notably, the proposed method is model agnostic that can be integrated into any existing approaches and demands zero-incremental inference time.

  • We apply this loss function to various baselines over three multi-modal classification tasks. When equipped with our method, promising performance improvements from baselines can be observed on the OoD datasets. As a side product, we achieve a new state of the art on two publicly available VQA-CP benchmarks for the VQA task. The code has been released to facilitate further research along this line.1

Our prior work [26] presents a de-bias loss function for tackling the language prior problem in VQA, and this article extends it in the following aspects: (1) We formally define and comprehensively study the modality bias problem for multi-modal learning, while Reference [26] focuses on the VQA task only. (2) We apply the de-bias loss function to two more tasks, i.e., colored digit recognition and video action recognition. For the VQA task, another strong baseline, LXMERT [56], is explored with the equipment of our loss function and aids us in achieving a new state of the art on two VQA-CP benchmarks. (3) We provide an empirical explanation of the proposed de-bias loss function in Section 4.

The rest of this article is organized as follows. In Section 2, we briefly review the related literature. We then present the recognition and reduction of the modality bias problem in Section 3 and 4, respectively. In the next, the experimental settings are detailed in Section 5, followed by the results over the three tasks in Section 6. We summarize this article and discuss the possible future work in Section 7.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 Bias Identification and Mitigation

The bias problem has long been recognized as an issue of concern in AI algorithms [39, 44, 58, 60]. Though effectively leveraging the bias can achieve acceptable results, the methods become less reliable and less robust to generalize over diverse datasets. In the following, we exemplify this problem from both vision and language domains.

Existing studies often refer the bias in images or videos to certain unbalanced attributes [14]. For instance, researchers have discovered the age bias [44] and texture bias [33, 58] in image classification. In addition, human faces produce strong racial bias [23] and gender bias [46], which seriously damages face recognition accuracy. In addition, images in three-dimensional faces are generally accompanied with different poses and lighting conditions [39]. To tackle these challenges, studies have been devoted to transfer learning, domain adaptation [60], adversarial learning [61], or utilizing external knowledge.

The most ubiquitous bias in the language is the semantic bias learned from large corpora. Language modeling serves as the foundation of natural language processing and has extensively been proven to introduce discrimination in its embeddings [15, 55]. These embeddings often involve unintended correlations and societal stereotypes (e.g., connecting medical doctors more frequently to male than female [10]). Perez et al. [49] studied the bias from language modeling in the context of offensive contents detection. To address this problem, balancing the existing datasets from some statistical views makes an popular option. For instance, one can augment original data with external labeled data, oversampling or downsampling, sample weighting [63], and identity term swapping [48]. Reference [18] appends non-toxic samples containing identity terms from Wikipedia articles into the training data.

Orthogonal to the above-mentioned methods, in this work, we explore the bias from the perspective of modalities. In fact, some modalities show strong correlation with labels than others. And learning on such data often results in a severe over-fitting problem. To pinpoint the importance and influence of the issue, this article, for the first time, comprehensively studies the modality bias problem in multi-modal learning.

2.2 Language Prior Problem in VQA

Considerable efforts have been devoted to the language prior problem in VQA, as most VQA models blindly answer questions without performing visual reasoning on images [1, 25, 52, 57]. Current studies can be grouped into the following two categories.

Dataset Re-balancing. Crowd-sourcing with human annotators makes it difficult to circumvent biases in the VQA datasets. Since the presentation of the first large-scale VQA dataset [4], the bias problem has impeded the development of more generally applicable methods. To amend this problem, References [1, 25, 27] demonstrate that the bias still remains, which can potentially induce VQA models to learn language priors. In view of this, VQA-CP [1] is later curated through data re-splitting. Consequently, the answer distribution of training and testing sets is distinct with respect to question types (e.g., the most frequent answers in training and testing sets can be 2 and 1 for the question type how many, respectively). The performance of many VQA models drops significantly on the VQA-CP datasets. More recent studies construct brand-new datasets following the answer distribution balancing rule to avoid the language prior problem [35].

Model De-bias. Balancing datasets is often time-consuming and labor-intensive, and thus some methods make efforts to directly counter this problem, which mainly fall into two groups: single-branch and two-branch models. Specifically, single-branch methods are devised to enhance the visual feature learning in VQA [54, 62]. For example, HINT [54] and SCR [62] align the region importance with the additionally collected human attention maps. VGQE [36] considers the visual and textual modalities equally when encoding the question, where the question features include information from both modalities. Differently, two-branch methods mostly introduce another question-only branch for deliberately capturing the language priors, followed by the question-image branch to restrain it. For example, Q-Adv [52] trains the above two models in an adversarial way, which minimizes the loss of the question-image model while maximizing the question-only one. More recent fusion-based methods [9, 13, 16] employ the late fusion strategy to combine the two predictors and guide the model to pay more attention on these answers that cannot be correctly addressed by the question-only branch.

2.3 Out-of-Distribution Generalization

The I.I.D property is deemed as de facto in traditional machine learning algorithms. However, some studies point out that the robustness and generalization are greatly limited by the distribution shifts [7, 30]. To deal with this, Out-of-Distribution evaluation has been evoked, wherein the testing data come from a distribution different with that of the training data. For instance, Benjamin et al. [53] built the ImageNetV2 benchmark to maintain the naturally occurring distribution shift. Previous strong baselines exhibit a dramatic performance drop on this dataset [20]. In addition, some methods approach this new challenge with domain-invariant learning [19], feature decomposition [6], and pre-training [31].

The OoD property offers a desirable criterion to test a model’s generalization capability in real-world data. In view of this, we, for the first time, curated two Out-of-Distribution dataset versions for colored digit recognition and video action recognition. These benchmarks are essential to diagnose the modality bias problem and well support its evaluation.

Skip 3MODALITY BIAS PROBLEM RECOGNITION Section

3 MODALITY BIAS PROBLEM RECOGNITION

A good multi-modal model is expected to do prediction using informative features from all modalities. Existing methods have pushed the boundaries of various multi-modal benchmarks. Nevertheless, the improved performance is actually somewhat misleading, as both training and testing sets follow the I.I.D property. In this way, a model fitted on the training set may take a shortcut to perform well on its counterpart testing set. One undesirable shortcut recognized by this article is the bias inherent in modalities, which refers to making prediction based on the correlation between certain factors from one modality and the labels. Take the colored digit recognition task as an example, if all the 0 digits are colored blue, then it is effortless for the current strong deep learning models to bias on this color modality while ignoring the discriminative shape modality.

In the following, we illustrate this problem from two aspects: performance degradation on OoD datasets and prediction toward modality bias.

3.1 Performance Degradation on OoD Datasets

The past few years have witnessed an increasing interest in OoD generalization [7, 30]. It offers a strong test bed for evaluating the generalization capability of existing biased methods. As a matter of fact, the label distribution between training and testing sets is distinct from each other. Agrawal et al. [1] curated the OoD version of traditional VQA datasets [4, 24]. In particular, the answer distributions of each question type are significantly different between the training and testing sets. We re-implemented three baselines, including two well-studied methods (i.e., Counter [64] and UpDn [2]) and a recently developed BERT-based one (LXMERT [56]) and tested their performance on both in-domain and OoD versions of two VQA datasets. The results in Table 1 demonstrate that the performance of all these methods drops drastically on the OoD datasets (see the All category). For instance, the performance degradation of Counter is almost half on the two VQA-CP datasets with respect to the All category. This phenomenon is mainly attributed to the blind model learning, i.e., answering questions without the visual information. Pertaining to the in-domain dataset, the answer distribution under question types are consistent for both training and testing sets, e.g., 2 is the most frequent answer for how many. When it comes to OoD, how many questions may frequently correspond to 1 in the testing set. Therefore, models leveraging such shortcuts suffer on this dataset.

Table 1.
(a) Accuracy comparison on VQA v2 dataset and its OoD version VQA-CP v2.
MethodVQA v2 Val (In-Domain)VQA-CP v2 Test (OoD)
Y/NNum.OtherAllY/NNum.OtherAll
Counter [64]81.6347.1256.5464.7341.0112.9842.6937.67
UpDown [2]79.8741.7352.2961.2749.7814.0743.4240.79
LXMERT [56]88.3456.6465.7873.0646.7027.1461.2051.78
(b) Accuracy comparison on VQA v1 dataset and its OoD version VQA-CP v1.
MethodVQA v1 Val (In-Domain)VQA-CP v1 Test (OoD)
Y/NNum.OtherAllY/NNum.OtherAll
Counter [64]84.3942.1256.2965.0339.1213.0942.3536.11
UpDown [2]82.5837.8151.5961.4643.7612.4942.5738.02
LXMERT [56]79.7940.5962.0065.9754.0825.0562.7252.82

Table 1. Accuracy Comparison from In-Domain and OoD Evaluations for the Visual Question Answering Task

As for the colored MNIST dataset, we constructed its OoD version based on the rule that the colors for each digit are made distinct in the training and testing sets. We leveraged three methods to demonstrate the results of this problem in Table 2. Similar observations can be seen from this task as the correlation between the color modality and labels is cut off. As a result, models cannot generalize well on this dataset due to the modality bias learning.

Table 2.
MethodIn-DomainOoD
ACCLossACCLoss
MLPs\(99.18 \pm .17\)\(0.05 \pm .02\)\(55.55 \pm 1.36\)\(1.31 \pm .03\)
LeNet [40]\(99.36 \pm .15\)\(0.03 \pm .01\)\(57.39 \pm 11.22\)\(1.38 \pm .12\)
ResNet18 [29]\(99.02 \pm .53\)\(0.03 \pm .01\)\(40.19 \pm 12.71\)\(1.57 \pm .16\)

Table 2. Accuracy Comparison from In-Domain and OoD Evaluations on the Colored MNIST Dataset

In addition, we also constructed the OoD dataset for Kinetics-400 Kinetics-700, two widely exploited benchmarks for video action recognition. Specifically, we made the action distribution with respect to the detected object different in the re-constructed dataset. For example, the most frequent action for object apple is picking fruit in the training set while it is other actions in the testing set. In this way, the bias from the frame modality is manually removed to some extent. With this operation, we evaluated the I3D network [11] under two backbones and show the results in Table 3. The model’s performance also drops though the degradation is not as severe as in the above two tasks. One reason might be that the actions are relatively balanced with the most dominant objects human.

Table 3.
MethodIn-DomainOoD
[email protected][email protected][email protected][email protected]
I3D-ResNet5055.9280.5352.2080.47
I3D-ResNet10157.7081.5953.6481.54

Table 3. Performance Comparison from In-Domain and OoD Evaluations on the Kinetics-400 Dataset

3.2 Prediction Toward Modality Bias

To further understand how the models predict toward the modality bias, we computed the Jensen–Shannon Divergence (JSD) values of the biased label distribution and the model output of wrongly predicted instances. The results are illustrated in Figure 2. Ideally, without any biases, the predicted wrong labels from models should be more diverse or roughly follow the uniform distribution over all classes instead of being proportional to the label distribution conditioned on the biased modality in the training set. However, as shown in this figure, we can observe that most JSD values are below 0.5, implying that the two distributions are very similar. This indicates that models tend to provide labels according to the patterns observed between the biased modality and labels in the training set rather than performing reasoning for the current instance. For example, a large portion of images with blue digits are mis-predicted to 0 for the colored digit recognition task, which also corresponds to the most frequent digits with the blue color in the training set.

Fig. 2.

Fig. 2. Jensen–Shannon Divergence values computed from two distributions: The label distribution with respect to the biased modality (i.e., color for colored digit recognition, question type for VQA, and object for video action recognition) and the model outputted scores of incorrect instances.

Skip 4MODALITY BIAS REDUCTION Section

4 MODALITY BIAS REDUCTION

A typical multi-modal classification model can be abstracted into three consecutive stages: (a) multi-modal input representation, (b) multi-modal fusion, and (c) classifier. The first stage takes as inputs the raw multi-modal data and outputs the embedded features. Thereafter, the features are fused with delicately designed manners yet not limited to simple concatenation [47] and addition [66]. Finally, a cross entropy loss function is employed to map the fused features to one or a few classes. It is worth noting that the modality bias problem can be triggered by any of the above three stages. In this work, from a generic view, we aim to tackle this problem based on classifier balancing and design a novel loss function. To this end, our objective becomes the frequent and sparse classes respectively taking broader and tighter feature spans in the final feature space, respectively.

4.1 Formulation Background

Following the prevalent formulation, we consider the multi-modal classification as a multi-class single-/multi-label classification problem. That is, for input data with multiple modalities \(M=\lbrace M_1, M_2, \ldots M_n\rbrace\), the objective function is given by (1) \(\begin{equation} \hat{y} = \mathop {\arg \max }_{y \in \Omega } p(y|M_1, M_2, \ldots M_n; \Theta), \end{equation}\) where \(\Omega\) and \(\Theta\) denote the available class set and the model parameters, respectively.

SoftMax. The most popular cross entropy loss function (we tag it as SoftMax) is then formulated as2 (2) \(\begin{equation} \begin{aligned}L_{softmax} &= \sum _{i=1}^{|\Omega |} - y_i \log p_i \\ &= \sum _{i=1}^{|\Omega |} - y_i \log \frac{\exp (\mathbf {W}_i^T \mathbf {x})}{\sum _{j=1}^{|\Omega |} \exp (\mathbf {W}_j^T \mathbf {x})}, \end{aligned} \end{equation}\) where \(\mathbf {W}\) and \(\mathbf {x}\) denote the weight matrix and feature vector directly adjacent to the class prediction, respectively. Note that for single-label classification, there is only one label \(y_i\) equals 1 while others in the label vector \({\bf y}\) are kept as 0. When it comes to multi-label scenario, the involved ground-truth can be smoothed within the range of \((0, 1]\). Besides, we remove the bias vector for simplicity as we found it contributes little to the final model performance.

Normalized SoftMax loss (NSL). Recently, some studies have been dedicated to challenging the domination of traditional SoftMax loss function for classification tasks [17, 59]. Among these efforts, switching from Euclidean space to Cosine space has been proven to be an intriguing fashion, which employs L2 normalization on both the final features as well as the weight vectors. By removing radial variations, it relieves the need for joint supervision of the norm and angle from the SoftMax loss. To approach this idea, we first leverage the L2 normalization on weight vector \(\mathbf {W}_i\) and feature vector \(\mathbf {x}\) [51]. It is leveraged to ensure the posterior probability to be determined by the angle \(\theta _i\) between \(\mathbf {W}_i\) and \(\mathbf {x}\), i.e., \(||\mathbf {W}_i||_2=1\) and \(||\mathbf {x}||_2=1\) given the condition that \(\mathbf {W}_i^T \mathbf {x} = ||\mathbf {W}_i|| ||\mathbf {x}|| cos \theta _i\). Accordingly, the feature space is converted from the Euclidean space to the Cosine one. We then provide the modified NSL [59] as follows: (3) \(\begin{equation} L_{nsl} = \sum _{i=1}^{|\Omega |} - y_i \log \frac{\exp {(s \times \cos {\theta _i})}}{\sum _{j=1}^{|\Omega |} \exp {(s \times \cos {\theta _j})}}, \end{equation}\) where s is a scale factor for more stable computation.

LMCL. To achieve a more discriminative classification boundary, LMCL [59] introduces a fixed cosine margin to NSL, (4) \(\begin{equation} L_{lmcl} = \sum _{i=1}^{|\Omega |} - y_i \log \frac{\exp {s (\cos {\theta _{i}} - m)}}{\sum _{j \ne i} \exp {s \times \cos {\theta _j}} + \exp {s (\cos {\theta _{i}} - m)}}, \end{equation}\) where m implies the fixed cosine margin. Compared to the Euclidean space, the Cosine space is relatively easy to manipulate, as the margin range reduces from \((-\infty , +\infty)\) to \([-1, 1]\).

Regarding the implementation, we found that applying a fixed cosine margin cannot obtain satisfactory results. The key reason is that label distribution is highly skewed given the biased modality, resulting in the incapability of learning a sufficient representation with a fixed margin in the Cosine space. In the next subsection, we will introduce a more sophisticated adapted margin cosine loss to overcome this issue.

4.2 Proposed Method

The results in our experiments (see Section 6.2) explicitly demonstrate that a fixed cosine margin yields limited improvements or even degrades the model performance. Based upon this observation, we argue that an adapted cosine margin is more favorable for tackling the bias problem in multi-modal classification. In view of this, a new loss function named MMDB is defined as (5) \(\begin{equation} \left\lbrace \begin{aligned}& L_{MMDB} = \sum _{i=1}^{|\Omega |} - y_i \log \frac{\exp {s (\cos {\theta _{i}} - m_i)}}{\sum _{j=1}^{||\Omega ||} \exp {s (\cos {\theta _j} - m_j)}}, \\ & m_i = 1 - \bar{m}_i, \\ & \bar{m}_i = \frac{n_i^k + \epsilon }{\sum _{j=1}^{|\Omega |}n_j^k + \epsilon }, \end{aligned} \right. \end{equation}\) where \(m_i\) is the adapted margin for label i, which is estimated solely on the biased modality, \(n_i^k\) denotes the number of label i under the biased modality \(b_k\) (e.g., color for colored digit recognition) in the training set, and \(\epsilon = 1e-6\) is a hyper-parameter for avoiding computational overflow. The underlying intuition is that, for the current given biased modality, the frequent classes span broader in the Cosine space (smaller margin), while sparse classes span tighter (larger margin). In other words, frequent classes imply more training samples, which require a broader feature space to sufficiently cover these classes. In contrast, a tighter feature space is acceptable for sparse classes as the number of training samples is much smaller. This setting enables the models to place a better margin in the Cosine feature space.

Partial derivatives. We further provide the partial derivatives of the weight vector \(\mathbf {W}_i\) and feature vector \(\mathbf {x}\) from our loss function. Let \(p_i = s (\cos {\theta _{i}} - m_i) = s(\frac{\mathbf {W}_i^T}{||\mathbf {W}_i||_2} \cdot \frac{\mathbf {x}}{||\mathbf {x}||_2} - m_i)\), and \(\hat{p}_i = \frac{\exp {p_i}}{\sum _{j=1}^{|\Omega |}\exp {p_j}}\). The partial derivatives are obtained via (6) \(\begin{equation} \frac{\partial L_{MMDB}}{\partial \mathbf {x}} = \sum _{i=1}^{|\Omega |} \left(\sum _{j=1}^{|\Omega |} y_j \times \hat{p}_i - y_i\right) \times s \times \frac{||\mathbf {x}||_2 \mathbf {I} - \mathbf {x}\mathbf {x}^T}{||\mathbf {x}||_2^3} \frac{\mathbf {W}_i}{||\mathbf {W}_i||_2}, \end{equation}\) and (7) \(\begin{equation} \frac{\partial L_{MMDB}}{\partial \mathbf {W}_i} = \sum _{i=1}^{|\Omega |} \left(\sum _{j=1}^{|\Omega |} y_j \times \hat{p}_i - y_i\right) \times s \times \frac{||\mathbf {W}_i||_2 \mathbf {I} - \mathbf {W}_i \mathbf {W}_i^T}{||\mathbf {W}_i||_2^3} \frac{\mathbf {x}}{||\mathbf {x}||_2}. \end{equation}\)

Lower bound for s. The scale factor s is critical for the final feature learning. A too small s leads to an insufficient convergence, as it limits the feature space span (as we found in our experiments that the loss goes “nan” with a small s). In view of this, a lower bound for s should be prescribed. Without loss of generality, let \(P_{i}\) denote the expected minimum of the class i, the lower bound is defined as (8) \(\begin{equation} \begin{equation} s \ge \frac{\ln ({1}/{P_{i}} - 1)}{m_{i} + {\sum _{j \ne i} m_j}/{(|\Omega | - 1}) - 2}. \end{equation}\) The detailed proof is provided in Appendix A.

4.3 Application over Specific Tasks

We apply our MMDB loss function to three tasks according to their distinctive characteristics.

Colored Digit Recognition. As discussed in Section 3, compared to the shape modality, the color serves as the key biased factor in this task. For instance, most 0 digits correspond to the blue color (Figure 1). In view of this, we compute the margin \(\bar{m}_i\) with the constraint of colors, namely \(n_i^k\) is the number of digit i under the color k.

Visual Question Answering. Some studies have been conducted on the language prior problem in VQA [1, 26, 52]. The language shortcut is deemed as the bias factor, where its expression is the strong link between question type (the first few words in a question) and the textual answer. Motivated by this observation, \(n_i^k\) in Equation (5) can be simply estimated by the number of answer i under the question type k. As a result, for the given question and its corresponding question type, the frequent answers span broader in the Cosine space (smaller margin) while sparse answers learn tighter feature space (larger margin).

Video Action Recognition. Though actions in videos are expected to be recognized with the temporal information, nevertheless, we find that some of them are easy to be classified with the spatial modality, or more specifically, the objects in static frames. To this end, we leverage the number of action i under the detected object k to compute the \(n_i^k\), which is expected to alleviate the modality bias problem within this task.

4.4 Comparison with Different Loss Functions

We consider the binary-class scenario for intuitively illustrating the decision boundary from different loss functions. As can be seen from Figure 3, the decision boundary of the plain SoftMax one can be negative, which is enlarged to be equal to zero of the NSL loss [59]. The LMLC [59] defines a fixed margin for different classes and yet is not suitable in our case for overcoming the modality bias problem. Regarding our MMDB, a sparser class (green one) is mapped to a smaller feature space while a more frequent class (orange one) engages with a larger feature space.

Fig. 3.

Fig. 3. The visual comparison of decision boundary from different loss functions.

4.5 An Empirical Explanation

To understand why the proposed loss function works, recall that \(\mathbf {W}_i^T \mathbf {x} = ||\mathbf {W}_i|| \times ||\mathbf {x}|| \times \cos {\theta _i} = s \times \cos {\theta _i}\), we first unfold \(L_{MMDB}\) with (9) \(\begin{equation} \begin{aligned}L_{MMDB} &= \sum _{i=1}^{|\Omega |} - y_i \log \frac{\exp {s (\cos {\theta _{i}} - m_i)}}{\sum _{j=1}^{||\Omega ||} \exp {s (\cos {\theta _j} - m_j)}} \\ &= \sum _{i=1}^{|\Omega |} - y_i \log \frac{\exp {(\mathbf {W}_i^T \mathbf {x} - s \times m_i)}}{\sum _{j=1}^{||\Omega ||} \exp {(\mathbf {W}_j^T \mathbf {x} - s \times m_j)}} \\ &= \sum _{i=1}^{|\Omega |} - y_i \log \frac{{\exp {(\mathbf {W}_i^T \mathbf {x}})}/{\exp {(s \times m_i)}}}{\sum _{j=1}^{||\Omega ||} {\exp {(\mathbf {W}_j^T \mathbf {x})}}/{\exp {(s \times m_j)}}}. \end{aligned} \end{equation}\) We then rewrite (10) \(\begin{equation} T_i = \exp {(s \times m_i)}, \end{equation}\) where we name \(T_i\) as the temperature for class i, which produces, (11) \(\begin{equation} L_{MMDB} = \sum _{i=1}^{|\Omega |} - y_i \log \frac{{\exp {(\mathbf {W}_i^T \mathbf {x})}}/{T_i}}{\sum _{j=1}^{||\Omega ||} {\exp {(\mathbf {W}_j^T \mathbf {x})}}/{T_j}}. \end{equation}\) Note that the logits are softened in the well-studied knowledge distillation (KD) domain [32, 42] through (12) \(\begin{equation} q_i = \frac{\exp ({\mathbf {W}_i^T \mathbf {x}}/{T})}{\sum _{j=1}^{||\Omega ||} \exp ({\mathbf {W}_j^T \mathbf {x}}/{T})}, \end{equation}\) where T denotes the temperature during distilling and \(q_i\) represents the probability for label i. There are two main differences between our method and KD: (1) We move the temperature outside the scope of the exponential function, and (2) the temperature is distinct for different labels. A visual comparison of the produced probabilities is demonstrated in Figure 4. We can observe that larger temperatures from KD leads to more balanced probabilities, which produces more “informative” labels as argued in Reference [32]. However, this may contradict the scenario studied in this article, as the most “informative” classes to the current target are naturally the ones inducing the modality bias problem. In contrast, our method in Figure 4 demonstrates a sharp probability distribution, which aids the reduction of this problem to some extent.

Fig. 4.

Fig. 4. Probability comparison from 20 classes between KD and our method where the probabilities are arranged from small to large. The left three ones respectively denote the KD with temperature 0.1, 1.0 (i.e., SoftMax), and 10.0. And the right most illustrates the probability yields from our method. The averaged probability from the 20 classes are outlined in red for all cases.

Skip 5EXPERIMENTAL SETTING Section

5 EXPERIMENTAL SETTING

We conducted extensive experiments on the aforementioned three multi-modal classification tasks, to validate the effectiveness of the proposed method. In particular, the experiments are mainly performed to answer the following research questions:

  • RQ1: Can the proposed multi-modal de-bias loss function method overcome the modality bias problem?

  • RQ2: How do fixed margins in Equation (4) and the scale in Equation (5) affect the final model performance?

  • RQ3: Why does the proposed method outperform the baselines?

To answer these questions, we first present the experimental setup for the three tasks.

5.1 Colored Digit Recognition

Dataset. Li et al. [43] first introduced the Colored MNIST dataset, where the color for the 10 classes is made distinctive from each other. We observed that the training and testing sets still follow an I.I.D property, which is, however, deficient to evaluate the modality bias problem. To this end, we proposed to perturb the color assignment for these two sets. In our experiments, we mainly tested the model performance on the newly curated OoD Colored MNIST dataset.

Evaluation Metric. We adopted the standard accuracy metric for this experiment.

Tested Baselines. As the Colored MNIST dataset is relatively simple to address, therefore, a two-layer MLPs, LeNet [40] as well as a slightly cumbersome ResNet18 [29] baselines are utilized. In addition, we also compared with two approaches working on the bias problem, i.e., Repair [43] and BiaSwap [38]. And we ran each method for five times and reported the averaged accuracy.

5.2 Visual Question Answering

Datasets. We justified our proposed method on the two VQA-CP datasets: VQA-CP v2 and VQA-CP v1 [1], which are widely accepted benchmarks for evaluating the models’ capability to overcome the language prior problem. The VQA-CP v2 and VQA-CP v1 datasets consist of \(\sim\)122 K images, \(\sim\)658 K questions and \(\sim\)6.6 M answers, and \(\sim\)122 K images, \(\sim\)370 K questions, and \(\sim\)3.7 M answers, respectively. Moreover, the answer distribution per question type is significantly different between training and testing sets (OoD property). For all the datasets, the answers are divided into three categories: Y/N, Num, and Other.

Evaluation Metric. We adopted the standard metric in VQA for evaluation [4]. For each predicted answer a, the accuracy is computed as (13) \(\begin{equation} Acc = \text{min} \left(1, \frac{\#\text{humans that provide answer $a$}}{3}\right). \end{equation}\) Note that each question is answered by 10 annotators, and this metric takes the human disagreement into consideration [4, 24].

Tested Baselines. Inherited from previous attention-based VQA models, Counter [64] introduces a counting module to enable robust counting from object proposals. UpDn [2] first leverages the pre-trained object detection frameworks to obtain salient object features for high-level reasoning. It then employs a simple attention network to focus on the most important objects that are highly related with the given question. In addition, the strong baseline LXMERT [56] is built upon the Transformer encoders to learn the connections between vision and language. It is pre-trained with diverse pre-training tasks on several large-scale datasets of image and sentence pairs and achieves significant performance improvement on downstream tasks including VQA.

5.3 Video Action Recognition

Dataset. For this multi-modal task, we performed experiments on the widely exploited Kinetics-400 and Kinetics-700 datasets [37]. In general, there are 400 and 700 action classes in these two datasets, respectively, with 400–1,150 clips for each action. And each clip is from a unique video. Each clip lasts around 10 s and corresponds to a total number of 306,245 and 647,907 videos for the two datasets, respectively.

Evaluation Metric. Followed previous video action recognition methods [12], we adopted the clip-level accuracy, e.g., [email protected] and [email protected] as the evaluation metrics.

Tested Baselines. We employed our loss function over the I3D [11] network with two backbones, ResNet50 and ResNet101. As demonstrated in Reference [12], this simple method performs on par with or even outperforms many recent methods that are claimed to be better. Due to the resource limitation, we reduced the number of frames per video and the batch size for both methods to 8 and 16, respectively, while kept other settings the same as the released code.3 In addition, we also studied the effectiveness of our loss function on two recently developed strong baselines, TimeSformer [8] and ViViT [5].

It is worth noting that for all the baselines with respect to the above three tasks, we simply replaced the cross entropy loss with our MMDB and did NOT change any other settings, such as embedding size, learning rate, optimizer, and batch size. Therefore, our method will not introduce any incremental cost into those baseline models during the testing stage, since the inference of both our method and the baselines are identical.

Skip 6EXPERIMENTAL RESULTS Section

6 EXPERIMENTAL RESULTS

6.1 Overall Performance Comparison (RQ1)

6.1.1 Colored Digit Recognition.

We applied our method to three baselines: MLPs, LeNet, and ResNet18 and report the results in Figure 5. The observations are as follows:

Fig. 5.

Fig. 5. Performance comparison between our MMDB and three baselines on the colored digit recognition task. The error bars are also shown.

  • All the three methods can achieve certain performance improvements over the plain baselines. Our method, i.e., MMDB can outperform Repair [43] and BiaSwap [38] with significant gains on all the three baselines.

  • Our method can significantly enhance the baselines with large performance margins. Specifically, the improved accuracy on MLPs, LeNet, and ResNet18 are around 14%, 12%, and 25%, respectively. This demonstrates that the modality bias problem is reduced by our MMDB.

  • Compared to the baselines, more stable error bar can be seen for MMDB. It shows that our method is more robust over multiple runs, which serves as another merit.

6.1.2 Visual Question Answering.

The experimental results on VQA-CP v2 and VQA-CP v1 are illustrated in Tables 4 and 5, respectively. The main observations from these two tables are listed below:

Table 4.
MethodVQA-CP v2 test
Y/NNum.OtherAll
NMN [3]38.9411.9225.7227.47
MCB [21]41.0111.9640.5736.33
Counter† [64]41.0112.9842.6937.67
UpDn [2]42.2711.9346.0539.74
UpDn† [2]49.7814.0743.4240.79
LXMERT† [56]46.7027.1461.2051.78
GVQA [1]57.9913.6822.1431.30
AdvReg [52]65.4915.4835.4841.17
Rubi [9]68.6520.2843.1847.11
LMH [16]52.05
LMH† [16]70.2944.1044.8652.15
SCR [62]72.3610.9348.0249.45
VGQE [36]66.3527.0846.7750.11
CSS [13]43.9612.7847.4841.16
Decomp-LR [34]70.9918.7245.5748.87
Counter+Ours61.0053.2243.1749.90
UpDn+Ours72.4753.8145.5854.67
LXMERT+Ours91.3765.5562.6171.44
  • Regarding the method group, the top group denotes plain approaches, the middle group represents methods directly applied on the UpDn baseline, and the approaches from the last group are with our loss function. “\(-\)” and “\(\dagger\)” denote the numbers are unavailable and our implementation, respectively. The best performance in current splits is highlighted in bold.

Table 4. Accuracy Comparisons with Respect to Different Answer Categories over the VQA-CP v2 Dataset

  • Regarding the method group, the top group denotes plain approaches, the middle group represents methods directly applied on the UpDn baseline, and the approaches from the last group are with our loss function. “\(-\)” and “\(\dagger\)” denote the numbers are unavailable and our implementation, respectively. The best performance in current splits is highlighted in bold.

Table 5.
MethodVQA-CP v1 test
Y/NNum.OtherAll
NMN [3]38.8511.2327.8829.64
MCB [21]37.9611.8039.9034.39
Counter† [64]40.9312.8742.7237.08
UpDn† [2]43.7612.4942.5738.02
GVQA [1]64.7211.8724.8639.23
AdvReg [52]74.1612.4425.3243.43
LMH† [16]76.6129.0543.3854.76
LXMERT [56]54.0825.0562.7252.82
Counter+Ours72.0149.2842.6055.92
UpDn+Ours91.1741.3439.3861.20
LXMERT+Ours92.6761.3764.0475.47
  • \(\dagger\)” denotes our implementation. The best performance in current splits is highlighted in bold.

Table 5. Accuracy Comparisons with Respect to Different Answer Categories over the VQA-CP v1 Dataset

  • \(\dagger\)” denotes our implementation. The best performance in current splits is highlighted in bold.

  • In general, the methods in the middle group of Table 4 (specially designed to overcome the language bias problem) often outperform previous strong baselines (e.g., FiLM [50]). This result is intuitive as conventional approaches may introduce biases to model learning.

  • Our method obtains the best results on the two benchmark datasets. With the help of the recent strong baseline LXMERT, it surprisingly achieves a new state-of-the-art on these two benchmarks.

  • For all the three baselines, i.e., Counter, UpDn, and LXMERT, when equppied with our MMDB loss function, a drastic performance improvement (15% on average) can be observed. For example, on the VQA-CP v2 dataset, LXMERT+Ours achieves an absolute performance gain of 19.66% on the All answer category, and on the VQA-CP v1 dataset, UpDn+Ours outperforms the baseline UpDn with 23.18% on the All category.

  • Compared with other methods whose backbone model is also UpDn on the VQA-CP v2 dataset, our method (UpDn+Ours) still surpasses them by a large margin, especially for the three newly developed approaches VGQE, CSS, and Decomp-LR.

6.1.3 Video Action Recognition.

We tested the effectiveness of our method on the I3D network [11] with two backbones, ResNet50 and ResNet101 and then TimeSformer [8] and ViViT [5]. From the results in Table 6, it can be seen that our method can boost the backbone with a significant improvement, especially for the [email protected]; the improvements are 2.27% and 2.02% for ResNet50 and ResNet101 on the Kinetics-400 dataset, respectively.

Table 6.
DatasetMethodBaselineMMDB
[email protected][email protected][email protected][email protected]
Kinetics-400I3D-ResNet50 [11]52.2080.4754.5781.10
I3D-ResNet101 [11]53.6481.5455.6681.82
TimeSformer [8]72.0190.7573.6591.36
ViViT [5]75.2493.2575.3993.50
Kinetics-700I3D-ResNet50 [11]37.2166.0538.9567.47
I3D-ResNet101 [11]30.5259.2034.6062.25
TimeSformer [8]61.3383.9065.6586.28
ViViT [5]72.0091.4174.8392.01

Table 6. Performance Comparison on the OoD Version of the Kinetics-400 and Kinetics-700 Datasets

6.2 Ablation Study (RQ2)

For a deeper understanding of our MMDB, we further provided detailed ablation studies over these three tasks.

Fixed margin results. As mentioned in Section 4, models with a fixed margin perform unsatisfactorily when compared with our adaptive ones. The results are shown in Table 7, and we have the following observations:

Table 7.
MethodMarginColored MNISTVQA-CP v2Kenetics-400
MLPsLeNetResNetCounterUpDnLXMERTI3D-ResNet50I3D-ResNet101
Baseline55.5557.3940.1937.6740.7951.7852.2053.64
NSL56.4855.9348.7049.0840.9758.0653.9555.45
Fixed Margin0.158.5560.0654.8430.1239.4256.5652.1547.32
0.319.9258.1852.7327.7037.7555.0546.5446.78
0.59.809.8051.2713.0736.6053.8143.3644.18
0.79.809.8053.8011.4135.8353.1942.7143.31
0.99.809.8056.053.1235.6852.6243.1943.33
Adapted Marginadaptive68.0569.5964.0049.9054.6771.4254.5755.66

Table 7. Effectiveness Validation of the Proposed Loss Function on Three Multi-Modal Learning Tasks

  • When using the L2 normalization on both the weight vector and the feature vector, the de facto NSL function, the results are inconsistent amongst different methods. For example, the Counter in VQA with NSL surpasses the baseline with 11.41%, while LeNet in colored digit recognition with NSL even causes a little bit deterioration of the performance.

  • We also tested the fixed margin to yield effective feature discrimination, where the margins are tuned from 0.1 to 0.9 with a step size of 0.2. However, the results are not favorable (the improvement is limited), which validates the evidence that a fixed margin is not suitable for overcoming the modality bias problem. By contrast, when replacing the fixed margin with our adaptive one, the model can achieve significant performance improvement, which additionally proves the superiority of MMDB.

Scale influence. We tuned the scale in Equation (5) from 1 to 128 with a step size of \(2^n\) for all the eight methods. From the results in Figure 6, we found that a too small scale is insufficient for learning the feature space, resulting in unsatisfactory results. In addition, when the scale becomes large, the performance will saturate or even drop to some degrees.

Fig. 6.

Fig. 6. Performance change with respect to the scale s in Equation (5).

6.3 Case Study (RQ3)

In this subsection, we use case studies analyze the reasons why the proposed method works from the following two aspects: feature embedding separation and better attention maps.

Feature manifold embedding. Since the key motivation of the proposed method is to achieve the goal that frequent and sparse labels under certain biased modality span broader and tighter in the final feature space, respectively, we therefore visualized the learned features and displayed the results on Figures 7 and 8. In particular, we leveraged two tasks, colored digit recognition and VQA for illustration. For the former task, the color modality is restricted to red, and the language modality in the later task is expressed through the how many question type. For the two instances, the top two sub-figures are the feature embedding on the Euclidean and Cosine space from the baseline, while results from our method are summarized in the bottom two sub-figures. It is evident to us that (1) in the Euclidean space, the features are often irregular and even entangled with each other of baselines. Nevertheless, our method can separate these labels with clear boundaries. In addition, when it refers to the Cosine space, frequent labels (e.g., digit 2 for colored digit recognition and answer 2 for VQA) span much broader than sparse ones, yet this property cannot be observed by the baseline.

Fig. 7.

Fig. 7. Digit feature manifold embedding of red color for the colored digit recognition task.

Fig. 8.

Fig. 8. Answer feature manifold embedding of how many question type for the VQA task.

Attention maps in VQA. Finally, we showcased two successful cases of our method and illustrated them in Figure 9. Regarding the first case, as the answer 2 takes large proportion under the question type how many in the training set, the baseline model thus yields an answer 2 to this question. In contrast, our MMDB corrects this mistake, producing a more reasonable attention map (focusing solely on one car). As for the second case, the baseline model wrongly predicts the answer tennis mainly attribute to tennis being more frequent under the question type what sport in the training set. Besides, too much attention has been paid to less relevant regions, which is another reason leading to the incorrect answer. On the contrary, our MMDB can guide the model to emphasize more on the target object, glove, resulting in the correct answer baseball.

Fig. 9.

Fig. 9. Visualization of the baselines method (UpDn) with and without our proposed loss function MMDB. The key answer distributions under the question’s corresponding question type are illustrated on the leftmost column. The ground-truth is displayed on the second column, followed by the attention maps produced by the baseline and ours on the last two columns.

Skip 7CONCLUSION AND DISCUSSION Section

7 CONCLUSION AND DISCUSSION

In this work, we have systematically studied the modality bias problem in the context of multi-modal classification. To begin with, we construct several Our-of-Distribution datasets as test beds for its evaluation. Thereafter, a multi-modal de-bias loss is designed to discriminate the labels via properly characterizing the feature space. Concretely, for the given biased modality, the frequent and sparse labels with respect to the corresponding modality factors are learned to span broader and tighter in the Cosine space, respectively. Extensive experiments over four benchmarks regarding three multi-modal tasks have been conducted, which validates the effectiveness of the proposed loss function on 10 baselines.

Enlightened by the most recent studies on the tackling of the long-tail problem [41, 65], which rely on the training set statistic to work, our method also leverages such prior to approach the modality bias problem. Though obtained significant performance improvements, this restriction positions our MMDB at the cost of explicit bias conjecture beforehand, limiting its generalization capability across datasets. One potential solution is to discover the biased modalities without presumptions or labels [45] in the first step. After that, our method can be seamlessly combined with such techniques to perform with the ease of the prior pain. It is worth mentioning that this work does not present an advanced model for pursuing the SOTA results in multi-modal learning. We instead expect that, with this de-bias loss function, future research can focus more on the enhancement of the multi-modal understanding while with less affect of the modality bias.

With the recognition of this problem in multi-modal classification, exploring its associated expression for generation tasks will open an interesting door to increase the output diversity. In addition, tackling the modality bias problem from the other two stages, i.e., multi-modal representation and fusion, demands our further research attention as well.

APPENDIX

A PROOF FOR THE SCALING FACTOR S

Without loss of generality, let \(P_{i}\) denotes the expected minimum of the class i. In the ideal formulation, the \(\theta _i\) between label i and weight vector \(\mathbf {W}_i\) should be 0, while that for others labels should be 180. We then have the following: \(\begin{equation*} \begin{aligned}\frac{\exp {s (1 - m_{i})}}{\sum _{j \ne i} \exp {s (-1 - m_{j})} + \exp {s (1 - m_{i})}} &\ge P_{i}, \\ \end{aligned} \end{equation*}\) \(\begin{equation*} \begin{aligned}1 + \frac{\sum _{j \ne i} \exp {s (-1 - m_{j})}}{\exp {s (1 - m_{i})}} &\le \frac{1}{P_{i}}, \\ 1 + \frac{\sum _{j \ne i} \exp {s (m_{j})}}{\exp {s (2 - m_{i})}} &\le \frac{1}{P_{i}}. \\ \end{aligned} \end{equation*}\) On the basis of Jensen’s inequality, \(\begin{equation*} \begin{aligned}\sum _{j \ne i} \exp {s (m_{j})} &= \frac{|\Omega | - 1}{|\Omega | - 1} \sum _{j \ne i} \exp {s (m_{j})}, \\ &\ge (|\Omega | - 1) \exp {s\left(\frac{\sum _{j \ne i} m_j}{|\Omega | - 1}\right)}. \end{aligned} \end{equation*}\) Accordingly, we obtain, \(\begin{equation*} \begin{aligned}1 + \frac{(|\Omega | - 1) \exp {s({\sum _{j \ne i} m_j}/{(|\Omega | - 1}))}}{\exp {s (2 - m_{i})}} &\le \frac{1}{P_{i}}, \\ \exp {s\left(\frac{\sum _{j \ne i} m_j}{|\Omega | - 1} + m_{i} -2\right)} &\le \frac{1}{P_{i}} - 1.\\ \end{aligned} \end{equation*}\) Finally, we have (14) \(\begin{equation} s \ge \frac{\ln ({1}{P_{i}} - 1)}{m_{i} +{\sum _{j \ne i} m_j}/{(|\Omega | - 1)} - 2}. \end{equation}\)

Footnotes

  1. 1 https://github.com/guoyang9/AdaVQA.

    Footnote
  2. 2 We utilize the individual loss as an illustration instead of the loss for total instances due to space limitation.

    Footnote
  3. 3 https://github.com/IBM/action-recognition-pytorch.

    Footnote

REFERENCES

  1. [1] Agrawal Aishwarya, Batra Dhruv, Parikh Devi, and Kembhavi Aniruddha. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 49714980.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Andreas Jacob, Rohrbach Marcus, Darrell Trevor, and Klein Dan. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3948.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 24252433.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Arnab Anurag, Dehghani Mostafa, Heigold Georg, Sun Chen, Lucic Mario, and Schmid Cordelia. 2021. ViViT: A video vision transformer. In Proceedings of the International Conference on Computer Vision. IEEE, 68166826.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Bai Haoyue, Sun Rui, Hong Lanqing, Zhou Fengwei, Ye Nanyang, Ye Han-Jia, Chan S.-H. Gary, and Li Zhenguo. 2021. DecAug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 67056713.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Bai Haoyue, Zhou Fengwei, Hong Lanqing, Ye Nanyang, Chan S.-H. Gary, and Li Zhenguo. 2021. NAS-OoD: Neural architecture search for out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 83208329.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Bertasius Gedas, Wang Heng, and Torresani Lorenzo. 2021. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning. PMLR, 813824.Google ScholarGoogle Scholar
  9. [9] Cadène Rémi, Dancette Corentin, Ben-younes Hedi, Cord Matthieu, and Parikh Devi. 2019. RUBi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems. MIT, 839850.Google ScholarGoogle Scholar
  10. [10] Caliskan Aylin, Bryson Joanna J., and Narayanan Arvind. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183186.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Carreira João and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 47244733.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chen Chun-Fu (Richard), Panda Rameswar, Ramakrishnan Kandan, Feris Rogério, Cohn John, Oliva Aude, and Fan Quanfu. 2021. Deep analysis of CNN-based spatio-temporal representations for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 61656175.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chen Long, Yan Xin, Xiao Jun, Zhang Hanwang, Pu Shiliang, and Zhuang Yueting. 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 1079710806.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Chen Yunliang and Joo Jungseock. 2021. Understanding and mitigating annotation bias in facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 1498014991.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Cheng Lu, Mosallanezhad Ahmadreza, Silva Yasin N., Hall Deborah L., and Liu Huan. 2021. Mitigating bias in session-based cyberbullying detection: A non-compromising approach. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing. ACL, 21582168.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Clark Christopher, Yatskar Mark, and Zettlemoyer Luke. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing. ACL, 40674080.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Deng Jiankang, Guo Jia, Xue Niannan, and Zafeiriou Stefanos. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 46904699.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Dixon Lucas, Li John, Sorensen Jeffrey, Thain Nithum, and Vasserman Lucy. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. ACM, 6773.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Dou Qi, Castro Daniel Coelho de, Kamnitsas Konstantinos, and Glocker Ben. 2019. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems. MIT, 64476458.Google ScholarGoogle Scholar
  20. [20] Engstrom Logan, Ilyas Andrew, Santurkar Shibani, Tsipras Dimitris, Steinhardt Jacob, and Madry Aleksander. 2020. Identifying statistical bias in dataset replication. In Proceedings of the International Conference on Machine Learning. PMLR, 29222932.Google ScholarGoogle Scholar
  21. [21] Fukui Akira, Park Dong Huk, Yang Daylen, Rohrbach Anna, Darrell Trevor, and Rohrbach Marcus. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 457468.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Gat Itai, Schwartz Idan, Schwing Alexander G., and Hazan Tamir. 2020. Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. In Advances in Neural Information Processing Systems. MIT.Google ScholarGoogle Scholar
  23. [23] Gong Sixue, Liu Xiaoming, and Jain Anil K.. 2021. Mitigating face recognition bias via group adaptive classifier. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 34143424.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 63256334.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Guo Yangyang, Cheng Zhiyong, Nie Liqiang, Liu Yibing, Wang Yinglong, and Kankanhalli Mohan S.. 2019. Quantifying and alleviating the language prior problem in visual question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 7584.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Guo Yangyang, Nie Liqiang, Cheng Zhiyong, Ji Feng, Zhang Ji, and Bimbo Alberto Del. 2021. AdaVQA: Overcoming language priors with adapted margin cosine loss. In Proceedings of the International Joint Conference on Artificial Intelligence. 708714.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Guo Yangyang, Nie Liqiang, Cheng Zhiyong, Tian Qi, and Zhang Min. 2021. Loss re-scaling VQA: Revisiting the language prior problem from a class-imbalance view. IEEE Trans. Image Process. 31, 227–238.Google ScholarGoogle Scholar
  28. [28] Hama Kenta, Matsubara Takashi, Uehara Kuniaki, and Cai Jianfei. 2021. Exploring uncertainty measures for image-caption embedding-and-retrieval task. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 46:1–46:19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770778.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Hendrycks Dan, Basart Steven, Mu Norman, Kadavath Saurav, Wang Frank, Dorundo Evan, Desai Rahul, Zhu Tyler, Parajuli Samyak, Guo Mike, et al. 2021. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 83408349.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Hendrycks Dan, Lee Kimin, and Mazeika Mantas. 2019. Using pre-training can improve model robustness and uncertainty. In Proceedings of the International Conference on Machine Learning. PMLR, 27122721.Google ScholarGoogle Scholar
  32. [32] Hinton Geoffrey E., Vinyals Oriol, and Dean Jeffrey. 2015. Distilling the knowledge in a neural network. CoRR abs/1503.02531.Google ScholarGoogle Scholar
  33. [33] Huang Xun and Belongie Serge J.. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 15101519.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Jing Chenchen, Wu Yuwei, Zhang Xiaoxun, Jia Yunde, and Wu Qi. 2020. Overcoming language priors in VQA via decomposed linguistic representations. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 1118111188.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Johnson Justin, Hariharan Bharath, Maaten Laurens van der, Fei-Fei Li, Zitnick C. Lawrence, and Girshick Ross B.. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 19881997.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Johnson Justin, Hariharan Bharath, Maaten Laurens van der, Fei-Fei Li, Zitnick C. Lawrence, and Girshick Ross B.. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 19881997.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Kay Will, Carreira João, Simonyan Karen, Zhang Brian, Hillier Chloe, Vijayanarasimhan Sudheendra, Viola Fabio, Green Tim, Back Trevor, Natsev Paul, Suleyman Mustafa, and Zisserman Andrew. 2017. The kinetics human action video dataset. CoRR abs/1705.06950.Google ScholarGoogle Scholar
  38. [38] Kim Eungyeup, Lee Jihyeon, and Choo Jaegul. 2021. BiaSwap: Removing dataset bias with bias-tailored swapping augmentation. In Proceedings of the International Conference on Computer Vision. IEEE, 1497214981.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Kortylewski Adam, Egger Bernhard, Schneider Andreas, Gerig Thomas, Morel-Forster Andreas, and Vetter Thomas. 2018. Empirically analyzing the effect of dataset biases on deep face recognition systems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 20932102.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] LeCun Yann, Bottou Léon, Bengio Yoshua, and Haffner Patrick. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 22782324.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Lee Hyuck, Shin Seungjae, and Kim Heeyoung. 2021. ABC: Auxiliary balanced classifier for class-imbalanced semi-supervised learning. In Advances in Neural Information Processing. 70827094.Google ScholarGoogle Scholar
  42. [42] Li Yanchun, Cao Jianglian, Li Zhetao, Oh Sangyoon, and Komuro Nobuyoshi. 2021. Lightweight single image super-resolution with dense connection distillation network. ACM Trans. Multim. Comput. Commun. Appl. 17, 1s (2021), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Li Yi and Vasconcelos Nuno. 2019. REPAIR: Removing representation bias by dataset resampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 95729581.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Li Zhiheng and Xu Chenliang. 2021. Discover the unknown biased attribute of an image classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Li Zhiheng and Xu Chenliang. 2021. Discover the unknown biased attribute of an image classifier. In Proceedings of the International Conference on Computer Vision. IEEE, 1495014959.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Lu Boyu, Chen Jun-Cheng, Castillo Carlos Domingo, and Chellappa Rama. 2019. An experimental evaluation of covariates effects on unconstrained face verification. IEEE Trans. Biom. Behav. Identity Sci. 1, 1 (2019), 4255.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Noori Farzan Majeed, Riegler Michael, Uddin Md. Zia, and Tørresen Jim. 2020. Human activity recognition from multiple sensors data using multi-fusion representations and CNNs. ACM Trans. Multim. Comput. Commun. Appl. 16, 2 (2020), 45:1–45:19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Park Ji Ho, Shin Jamin, and Fung Pascale. 2018. Reducing gender bias in abusive language detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 27992804.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Perez Ethan, Huang Saffron, Song H. Francis, Cai Trevor, Ring Roman, Aslanides John, Glaese Amelia, McAleese Nat, and Irving Geoffrey. 2022. Red teaming language models with language models. CoRR abs/2202.03286.Google ScholarGoogle Scholar
  50. [50] Perez Ethan, Strub Florian, Vries Harm de, Dumoulin Vincent, and Courville Aaron C.. 2018. FiLM: Visual reasoning with a general conditioning layer. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI, 39423951.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Pernici Federico, Bruni Matteo, Baecchi Claudio, and Bimbo Alberto Del. 2019. Maximally compact and separated features with regular polytope networks. In Proceedings of the Computer Vision and Pattern Recognition Workshops. IEEE, 4653.Google ScholarGoogle Scholar
  52. [52] Ramakrishnan Sainandan, Agrawal Aishwarya, and Lee Stefan. 2018. Overcoming language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems. MIT, 15481558.Google ScholarGoogle Scholar
  53. [53] Recht Benjamin, Roelofs Rebecca, Schmidt Ludwig, and Shankar Vaishaal. 2019. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the International Conference on Machine Learning. PMLR, 53895400.Google ScholarGoogle Scholar
  54. [54] Selvaraju Ramprasaath Ramasamy, Lee Stefan, Shen Yilin, Jin Hongxia, Ghosh Shalini, Heck Larry P., Batra Dhruv, and Parikh Devi. 2019. Taking a HINT: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 25912600.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Shah Deven, Schwartz H. Andrew, and Hovy Dirk. 2020. Predictive biases in natural language processing models: A conceptual framework and overview. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 52485264.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing. ACL, 50995110.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Teney Damien, Abbasnejad Ehsan, and Hengel Anton van den. 2021. Unshuffling data for improved generalization in visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 14171427.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Wang Haohan, He Zexue, Lipton Zachary C., and Xing Eric P.. 2019. Learning robust representations by projecting superficial statistics out. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  59. [59] Wang Hao, Wang Yitong, Zhou Zheng, Ji Xing, Gong Dihong, Zhou Jingchao, Li Zhifeng, and Liu Wei. 2018. CosFace: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 52655274.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Wang Mei, Deng Weihong, Hu Jiani, Tao Xunqiang, and Huang Yaohai. 2019. Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 692702.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Wang Tianlu, Zhao Jieyu, Yatskar Mark, Chang Kai-Wei, and Ordonez Vicente. 2019. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 53095318.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Wu Jialin and Mooney Raymond J.. 2019. Self-critical reasoning for robust visual question answering. In Advances in Neural Information Processing Systems. MIT, 86018611.Google ScholarGoogle Scholar
  63. [63] Zhang Guanhua, Bai Bing, Zhang Junqi, Bai Kun, Zhu Conghui, and Zhao Tiejun. 2020. Demographics should not be the reason of toxicity: Mitigating discrimination in text classifications with instance weighting. In Annual Meeting of the Association for Computational Linguistics. ACL, 41344145.Google ScholarGoogle Scholar
  64. [64] Zhang Yan, Hare Jonathon S., and Prügel-Bennett Adam. 2018. Learning to count objects in natural images for visual question answering. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  65. [65] Zhang Yongshun, Wei Xiu-Shen, Zhou Boyan, and Wu Jianxin. 2021. Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. In Proceedings of the 35th AAAI Conference on Artificial Intelligence. AAAI, 34473455.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Zhuang Yueting, Xu Dejing, Yan Xin, Cheng Wenzhuo, Zhao Zhou, Pu Shiliang, and Xiao Jun. 2020. Multichannel attention refinement for video question answering. ACM Trans. Multim. Comput. Commun. Appl. 16, 1s (2020), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. On Modality Bias Recognition and Reduction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
          May 2023
          514 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3582886
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 February 2023
          • Online AM: 29 September 2022
          • Accepted: 26 September 2022
          • Revised: 9 June 2022
          • Received: 9 December 2021
          Published in tomm Volume 19, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)273
          • Downloads (Last 6 weeks)70

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!