Towards Balanced Active Learning for Multimodal Classification

Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. However, current active learning strategies are mostly designed for unimodal tasks, and when applied to multimodal data, they often result in biased sample selection from the dominant modality. This unfairness hinders balanced multimodal learning, which is crucial for achieving optimal performance. To address this issue, we propose three guidelines for designing a more balanced multimodal active learning strategy. Following these guidelines, a novel approach is proposed to achieve more fair data selection by modulating the gradient embedding with the dominance degree among modalities. Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality. Our approach outperforms existing active learning strategies on a variety of multimodal classification tasks. Overall, our work highlights the importance of balancing sample selection in multimodal active learning and provides a practical solution for achieving more balanced active learning for multimodal classification.


INTRODUCTION
Multimodal classification, as one of the classical multimodal learning tasks, aims to exploit complementary information inherent in multimodal data to achieve better classification performance.To this end, deep learning strategies have been implemented to train large-scale multimodal deep neural networks [4,15].However, such networks require an enormous amount of data to learn from, given their huge number of parameters.To reduce data cost, active learning (AL) is used to select a subset of more informative and distinctive unlabeled data samples for label assignment by oracles.Consequently, large networks can maintain performance while utilizing a smaller labeling budget.Most existing active learning algorithms are designed for unimodal tasks such as image classification [5,31], object detection [21,45] and language modeling [24,44].The objective is to select samples that have high uncertainty in them, carry novel knowledge for model training and those with distinctive features.However, there has been significantly less research reported on the design of effective active learning strategies for multimodal learning [29].
In this paper, we initially examine the performance of existing active learning strategies in selecting multimodal data.Our experiments reveal that these strategies tend to focus more on the dominant modality rather than fairly considering all modalities.For instance, in an image-text classification task, if the text contributes more to model optimization, active learning strategies may exhibit a bias towards the more distinguishable text modality by selecting valuable text samples and disregarding the informativeness of image samples.As a result, the selected multimodal dataset could become unbalanced, with insufficient information from the image modality, potentially leading to a degraded image model backbone.
Recent works [18,28,40,43] point out that balancing the training and optimization of all modalities is a key factor for successful multimodal learning.Similarly, it is crucial to design active learning strategies that can select multimodal data with fairness among all modalities to assist balanced multimodal learning.
Based on our findings, we develop a Balanced Multimodal Active Learning (BMMAL) algorithm that selects multimodal data by fairly considering each modality present in the data.In our approach, we choose the gradient embedding of model parameters, as it reflects the impact on model training and captures the diversity of data samples.However, we examine how the previous gradient embedding method [3] fails to select balanced multimodal data.To ensure fairness, we individually assess the contribution of each modality feature by examining the Shapley value, which attributes its contribution to the final multimodal prediction.We then apply modulation on the gradient embedding to penalize samples with dominant modalities.Lastly, a clustering seed initialization algorithm is employed to select diverse multimodal data with a significant influence on model training.
In summary, our main contributions are as follows: • We empirically show that most existing active learning strategies fail to select a balanced multimodal dataset.We analyze how to improve the current gradient embedding based active learning strategy to rectify this.• We propose a method to modulate the gradient embedding on sample-level to select more balanced multimodal candidates.• We conduct experiments on three multimodal datasets to show that our proposed method treats multimodal data more equally and achieves better performance.

RELATED WORKS 2.1 Active Learning
Uncertainty-aware strategies attempt to utilize the data uncertainty or the model uncertainty as a criterion to locate unlabeled data points that the current model has less confidence about.One strategy is to utilize the posterior classification probability distribution by measuring its entropy [32,42], or the margin between the most confident class and the second most confident class [30].In addition, uncertainty can be evaluated as the variance of predictions generated by an ensemble of models [5] or by multiple inferences with Monte-Carlo dropout as an alternative Bayesian approximation for static networks [12].Moreover, ALFA-Mix [27] evaluates unlabeled samples by mixing their features with labeled samples and observing whether there is inconsistency among predictions from mixed features.DFAL [10] incorporates adversarial attack techniques [25] to select unlabeled data samples located close to the classification boundaries.Diversity-aware strategies tend to select unlabeled data points whose features are as diverse as possible to minimize data redundancy.[26] utilizes K-medoid algorithm [19] to select representative data centroids that minimize the total distance from other data samples to the nearest centroids.CoreSet [31] greedily selects unlabeled data samples that have maximum distances from their nearest neighbors.[6] adopts the determinantal point process (DPP) to evaluate the diversity by calculating the determinant of the similarity matrix.Diversity-aware strategies can also be considered in the context of distribution matching, which aims to reduce the gap between the distributions of labeled and unlabeled samples in latent space or feature space.VAAL [35] trains a variational auto-encoder to construct the latent distribution of labeled samples and an adversarial network to distinguish labeled samples and unlabeled samples in the latent space.Moreover, the maximum mean discrepancy (MMD) [39], the H -divergence [36] and the Wasserstein distance [34] are used to measure the distribution gap.
To achieve a better trade-off between informativeness and diversity, hybrid methods are developed with an awareness of both.Since  diversity-aware strategies are orthogonal to most of uncertaintyaware strategies [16], they could be easily combined together.ALFA-Mix [27] adopts K-means clustering to further filter out samples to enhance diversity.BADGE [3] represents unlabeled data samples via gradient embedding of parameters of the last classifier layer and applies K-means++ [2] to form a diverse data selection which still carries high uncertainty.

Balanced Multimodal Learning
Our work considers joint multimodal learning for classifications.
Here, it has been found that the best unimodal networks could potentially outperform multimodal networks regardless of fusion mechanisms or regularization methods [40].Recent works show that the degradation of multimodal learning could be due to unbalanced optimization among different modalities.In [18], the failure of multimodal learning is attributed to modality competition where only dominant modalities are fully explored by joint training.Similarly, [43] demonstrates that multimodal learning greedily optimizes the dominant modalities and chooses to balance their training speeds.[40] propose to blend gradients with weights that are disproportional to the overfitting and generalization ratio of each modality so that each modality could be optimized in a balanced manner.[28] finds that fusion mechanisms such as concatenation and summation encourage the dominant modality to learn faster and thus develops gradient modulation to adaptively balance the training speed of each modality.

METHODOLOGY 3.1 Multimodal Active Learning Framework
The general active learning process is shown in Figure 1.
Once the model is trained, the unlabeled data samples are evaluated using an acquisition function and filtered for labeling.

Analysis of Imbalance in AL
We introduce one of the state-of-the-art active learning algorithm BADGE [3] and provide analysis of its imbalanced data selection over multimodal data samples.BADGE was the first to propose the replacement of features for embedding with the gradient of the weight of the last FC layer, which acts as the classifier.In our case, the last FC layer for multimodal classification is the multimodal classifier   .The weight of classifier  is a 2-dimensional matrix of size  ×   , where  is the number of classes and   is the dimension of concatenated multimodal feature   1 +   2 .The corresponding multimodal cross-entropy loss can be expanded as where  is softmax function and   •    is the  ℎ element of logits   .The gradient embedding is defined as  =  L   , and it is a 2-D matrix of size  ×   where the  ℎ row is where ŷ = argmax [(  )  ] is the pseudo label for unlabeled data samples.The gradient embedding is flattened into a vector for 1 Other fusion mechanisms such as summation and NL-gate are implemented in our further experiments.
sampling.It not only carries the uncertainty of classification from the margin between logits   and pseudo labels ŷ , but also is representative enough due to the information present in   .However, in multimodal learning settings, identifying the source of uncertainty can be challenging.Upon examining the calculation of multimodal logits , where   is divided into two matrices (  )  1 and (  )  2 , it is difficult to determine which modality carries more uncertainty and which carries less.To illustrate, for a visual event such as drawing, the visual modality contains more information and contributes more to multimodal logits by generating a larger output.The multimodal uncertainty calculation is thus skewing the visual uncertainty instead of considering both visual and auditory uncertainties fairly.From Section 4.4, we find that BADGE does pay more attention to the dominant modality, which might potentially damage the performance of joint multimodal learning.Another limitation of BADGE is its inability to distinguish modality contributions.For instance, given two data samples with identical logits, we should prioritize the one with a more balanced contribution during data selection to facilitate balanced multimodal learning.However, the current BADGE algorithm cannot achieve this.Similarly, most conventional active learning algorithms lack this capability.
Hence, we develop a balanced multimodal active learning method that could avoid biased data selection towards the dominant modality to mitigate modality competition and assure that the trained multimodal network would not easily degenerate to the dominant modality.While our designed method is encouraged to pay more attention to the weaker modality, it is essential to ensure that it does not overly lean towards the weaker modality, as this may also harm the multimodal classification performance.

Guidelines to Design Balanced MMAL
To make existing AL strategies more suitable for balanced multimodal learning, it is necessary to inspect the individual modality contribution and reduce the contribution gap among different modalities.We empirically propose three guidelines for designing active learning strategies that treat each modality more equally.Let Φ   () represent the contribution of the  ℎ modality of data sample  to the final model outcome , which should satisfy: We introduce the dominance degree  () to quantify how severely a data sample  is dominated by the strongest modality: We further partition the entire unlabeled dataset into multiple subsets  = { 1 , ...,   } for the ease of discussion.In each subset   , modality   contributes the most: Guideline 1: For two multimodal data samples   and   , if their acquisition scores of conventional active learning (CAL) strategies are equal, the one with more balanced unimodal contributions should have higher acquisition scores of balanced multimodal active learning strategies, By following Guideline 1, data samples with more equal unimodal contributions are more likely to be selected.However, this does not guarantee that the stronger modality will be suppressed, nor does it ensure that the weaker modality will not be overly encouraged.Therefore, we introduce two additional guidelines.
Guideline 2: To avoid biased data selection favoring the stronger modality, the gap between the average acquisition scores of data samples dominated by the stronger modality and those dominated by the weaker modality should be reduced.In a two-modality case, where  1 is the weaker modality and  2 is the stronger modality (i.e. the average contribution of  1 over the entire dataset is less than that of  2 , 1 Guideline 3: Lastly, to prevent biased data selection towards the weaker modality, it is necessary to ensure that the contribution of each modality to the acquisition score function   is still proportional to its modality contribution to the model outcome on the sample-level.It ensures that the data samples are selected in a way that fairly represents the contributions of each modality to the actual model outcome.
In summary, Guideline 1 prioritizes the samples with more equal unimodal contributions.Guideline 2 and 3 work together to punish the stronger modality on the dataset-level but maintain the relationship between strong and weak modality on the sample-level, avoiding biases towards either the stronger or weaker modalities.

Estimate Modality Contribution
We show how we compute modality contribution Φ.In the context of multimodal classification, balanced active learning should select data samples that fairly contribute to the performance of all modalities.To achieve this, it is essential to estimate the degree to which each modality of a given data sample contributes to the final multimodal prediction.One approach involves assessing modality importance by computing the disparity in model performance before and after the incorporation of a particular modality.Researchers have proposed various techniques to remove the information of one modality, such as masking [11], permutation [14], and empirical multimodally-additive projection (EMAP) [17].Nonetheless, these attribution methods are ill-suited for active learning as they require ground truth labels to calculate model performance metrics, such as accuracy.As a result, these methods cannot be employed for estimating modality contribution for unlabeled data due to the absence of ground truth labels.
Therefore, we choose to use the Shapley value to estimate modality contribution without the need for true labels.The Shapley value [33] was proposed to fairly attribute payouts among group of cooperative players based on their contributions to the total payout in game theory.In deep learning, SHapley Additive exPlanations (SHAP) value [23] considers each feature as a player and the model prediction as the total payout to estimate feature contributions.Let M = {  1 , ...,    } represent the set of all modality features, S denote the subset, and  symbolize the model outcome.Here, we use features instead of raw data inputs since features are utilized in active learning.To estimate the Shapley value of  ℎ modality feature    , we compute the marginal contribution to the subset S and average over all possible subset selections: We use the largest predicted class probability  ŷ provided by   as the model outcome  , where ŷ is the pseudo class.For the most common two-modality case, the Shapley values of modality features can be computed as follows (∅ represents a zero vector): (10) The Shapley value could be positive, negative or zero.While the sign indicates in which direction of each modality contributes, our primary interest lies in the extend of its contribution.Hence, we define modality contribution as follows:

Proposed Method
Following the proposed guidelines, we redesign the BADGE for multimodal classification scenarios with two modalities,  1 and  2 , to achieve more balanced data selection.The  ℎ row of gradient embedding in Eq. 3 could be derived as concatenation of two unimodal gradient embeddings: We then design two weights   1 and   2 , and scale each unimodal gradient embedding by them respectively: Here,  = |Φ  1 − Φ  2 | is the difference between contributions of two modalities.Note that the gradient embedding of larger l2 norm will be selected more easily by K-Means++ algorithm [3].Therefore, by multiplying with these weights, the magnitude of gradient embedding will be suppressed more if their unimodal contributions are more unbalanced.It aligns with our Guideline 1 where we want to punish the samples with unbalanced contributions.Moreover, we observe that the average  of the subset in which the weaker modality dominates is smaller than that of the subset where the stronger modality dominates.See Figure 6 and our discussion in Sec 4.5.If  1 is the weaker modality regarding the entire dataset, then we will have 1 for two subsets  1 and  2 dominated by  1 and  2 respectively.It means that the subset where the stronger modality dominates will be suppressed more, and it follows our Guideline 2 to punish the stronger modality on the dataset-level.
Finally, the Guideline 3 is also adhered to.For each sample, the modality with a higher contribution to the model outcome is always assigned a greater weight, resulting in a higher magnitude of unimodal gradient embedding.This ensures that the contribution to data selection is proportional to the contribution to the model outcome and model optimization if selected.
In the end, we perform K-Means++ over the scaled gradient embedding to select candidates for labeling.As a result, our BMMAL strategy could achieve more balanced active learning on multimodal classification than BADGE.It could prevent biased selection towards either the stronger or weaker modalities, thus benefiting multimodal learning.

EXPERIMENT 4.1 Dataset
Food101 [42] is a multi-class food recipe dataset with 101 kinds of food.Each recipe consists of a food image and textual recipe description.The dataset consists of 45,719 samples for training and 15,294 samples for testing.
KineticsSound [1] is a sub-dataset containing 31 action classes selected from Kinetics-400 [20].These action classes are considered to be correlated to both visual and auditory content.This dataset contains 14,739 clips for training and 2,594 clips for testing.
VGGSound [8] is a large-scale video dataset with 309 classes.Each video clip is 10-second and captures the object making the sound.We are only able to download 180,911 clips for training and 14,843 clips for testing due to the unavailability of YouTube videos.

Baseline
We consider seven existing active learning strategies as baselines.Random selects the data samples randomly from the unlabeled data pool.Entropy [32] selects data samples with the highest entropy of multimodal classification probabilities.CoreSet [31] filters out a subset of unlabeled data with representative multimodal features via K-center greedy algorithm.BADGE [3] is a hybrid method that selects diverse data samples by K-means++ sampler over their gradient embedding of multimodal classifier.BALD [13] is a Bayesian method to evaluate the mutual information between model predictions and model parameters.Since our model is static, we run five rounds of model forwarding with enabled dropout to obtain the entropy of model parameters.DeepFool [10] adopts an adversarial-like approach that adds small perturbations over multimodal features and selects data whose predictions are flipped.GCNAL [7] learns an extra graph convolution network to distinguish labelled and unlabelled samples and selects unlabelled samples that are sufficiently different from labelled ones.

Experiment Setting
Image-text Classification: For the Food101 dataset, we adopt ResNet-101 pre-trained on ImageNet as the image backbone and pre-trained Bert-base model [9] as the text backbone.All unimodal and multimodal classifiers are single FC layers.We use AdamW [22] as the optimizer and train the model for 15 epochs in each AL round and adopt random crop, random horizontal flip and random grey scale for image augmentation.
Video Classification: For VGGSound and KineticsSound, we utilize ResNet2P1D-18 [37] as visual backbone.The difference is that it is pre-trained on Kinetics-400 for VGGSound, while it is randomly initialized for KineticsSound.We use the randomly initialized ResNet-18 as an auditory backbone whose input channel is modified from 3 to 1.The video is uniformly sampled into 10 frames at the rate of one frame per second.The audio clip is transformed into a spectrogram with a window length of 512 and an overlap length of 353.For video augmentation, we randomly sample 5 frames out of 10 frames and apply image augmentation techniques on each frame.For audio augmentation, we randomly extract a 5-second audio fragment from the whole audio clip.We use Adam as optimizer and train the model for 45 epochs in each round.
The experiment is repeated 5 times for image-text classification and 3 times for video classification to remove the randomness of the initial querying.For multimodal fusion, we apply concatenation which is a widely used multimodal fusion mechanism on all tasks.In addition, we implement summation and NL-gate [41] that is similar to multi-head attention [38] in further experiments.

AL Performance
A fair and good AL strategy ought to select important multimodal data that could contribute to multimodal tasks and, simultaneously, pay fair attention to weaker modalities and strong modalities to prevent the trained multimodal network from degenerating into only a good unimodal network.We run conventional active learning strategies along with our proposed method BMMAL on several multimodal datasets, and compare their multimodal and unimodal classification accuracy.
We firstly draw the trend of modality contributions to the predicted probability over the ground truth class on test dataset across different active learning iterations in Figure 2 and Figure 3.As shown in the figures, the textual modal contributes more than the imagery model on the Food101 after second iteration, and the auditory modal contributes more than the visual modal on the Kinetic-sSound.More importantly, the difference between two unimodal contributions of BMMAL is overall smaller than both BADGE and Random.It means that two modalities contribute more equally in the models trained by the data selected by BMMAL.
The performance comparison of each AL iteration on the Food101 dataset is shown in Figure 10a and 10c.Note that textual modality is the stronger modality since iteration 2. Our method outperforms all baselines except BADGE in multimodal classification.In text classification, BMMAL, BADGE and CoreSet achieve good performance.In image classification, our method is superior to most of the baselines except Random.From the above comparison, we can tell that BADGE and CoreSet mainly focus on selecting valuable samples over the stronger text modality and ignore the weaker image modality.Although Random uniformly selects multimodal data without any weighting in image classification, it is considered unfair concerning the text modality.
The performance comparison of each AL iteration on the Ki-neticsSound dataset is shown in Figure 10b and 10d.Note that auditory modality is the stronger modality.Our method outperforms all baselines in multimodal classification.BADGE performs the best on audio classification on many iterations, However, its performance declines on video classification indicating that biased data selection might negatively affect multimodal classification.It shows that BADGE tends to assign more importance to audio modality during data selection and such behavior might negatively affect multimodal joint training.
The performance comparison of each AL iteration on the VG-GSound dataset is shown in Figure 4c and 4f.Note that auditory modality is the stronger modality.Our method outperforms BADGE in not only multimodal classification but also in two unimodal classification by an obvious margin.
Findings.Our first finding is that AL methods such as BADGE and BALD which win at classification of the stronger modality could stand a good chance of failing at classification of the weak  A strategy is considered better if its row-wise value is larger, indicating that it beats other strategies more often.On the other hand, a strategy is better if its column-wise value is smaller, meaning it is rarely beaten by other strategies.The maximum value of each cell is 5, which is the total number of experimental settings.The bottom row displays the column-wise average values (lower is better).
modality.This may be due to biased data selection towards the stronger modality, and it is undesirable for balanced multimodal learning.Our second finding is that Random and CoreSet could perform better in the weaker modality, whereas they are inferior in multimodal classification because random selection treats every sample with absolute fairness and CoreSet focuses too much on the weak modality which are both unfair concerning the stronger modality.Finally, our method achieves a fairer multimodal data selection with a better trade-off between weak and strong modalities.

Ablation Study
Pairwise Comparison.We illustrate the results across various experimental settings in matrix  in Figure 5 [3].We compute the t-score for each repeated experiment and use the two-sided t-test to compare the performance of paired strategies on the test set with a 0.9 confidence interval.If strategy  significantly outperforms strategy , we add 1/ to  , , where  is the total number of iterations for a single experiment setting.The maximum cell value equals the total number of experiment settings. , indicates the number of times strategy  significantly outperforms strategy .We compute the matrix for both multimodal and unimodal classification for stronger (text for Food101, audio for KineticsSound and VGGSound) and weaker modalities (image for Food101, video for Ki-neticsSound and VGGSound).The three matrices demonstrate that our proposed method outperforms most baselines across settings.Specifically, BMMAL surpasses BADGE in multimodal classification and unimodal classification on weaker modalities, while performing comparably with BADGE in unimodal classification on stronger modalities.This suggests that the performance improvement of BMMAL in multimodal classification mainly stems from enhancing weaker modalities while maintaining stable performance in stronger modalities.Dominance Degree.As described in Eq. 6, we divide the entire unlabeled dataset into multiple sub-datasets in which modality   contributes the most.The Food101 dataset is divided into   and   dominated by text and image modality, respectively.In Figure 6a, the average weight values of the weaker modality are showed.As shown before in Figure 2, text modality is the stronger one starting from the second iteration.The average value of   in   accordingly becomes less than that of   in   from the second 5k 7k 9k 11k Video Top-1 iteration, meaning that the average difference value  between two unimodal contributions in   is larger than in   .The KineticsSound dataset is divided into   and   dominated by video and audio modality, respectively.In Figure 6b, the average weight values of the weaker modality are showed.Similarly, the average difference value  between two unimodal contributions in   is larger than in   .Consequently, on the dataset-level, the sub-dataset dominated by the weaker modality receives less punishment compared to the sub-dataset dominated by the stronger modality.Different Fusion Mechanisms.We perform experiment by changing the fusion method from concatenation into summation on Food101 and KineticsSound, while keeping other settings unchanged.We include the performance comparison in the pairwise comparison and present the iterative comparison in the supplementary materials.Furthermore, we change concatenation to NL-gate for mixing video and audio features on the VGGSound dataset, setting the initial budget to 5,000 and the AL budget for each round to 2,000, as NL-gate requires more data to demonstrate its efficiency in fusion.We provide the implementation details in the supplementary materials.As shown in Figure 7, our method achieves comparable multimodal classification performance to BADGE and becomes worse on auditory classification.However, for the weaker visual classification, our method outperforms the others, demonstrating its effectiveness in balancing weak and strong modalities.
Large-scale Active Learning.We conduct experiment on VG-GSound with larger budget size of 5,000 to validate our method on large-scale active learning for multimodal video classification.The results are averaged and shown in Table 1.On video classification, the performance of BADGE degrades and becomes worse than random selection, while our method achieves improvement over BADGE and random selection.On audio classification, BADGE and our method are comparable and are both better than random selection.As a result, our method performs better than BADGE and can save around 5k labels compared with random selection if target multimodal classification top-1 accuracy is set to 0.435.Classwise Performance Comparison.We show the classwise performance comparison on the KineticsSound dataset.As shown in Figure 8, the gain is more significant than the drop.Moreover, improved classes such as 'chopping wood', 'bowling' and 'shoveling snow' carry more visual information, and dropped classes are mostly dominated by the auditory modality.Note that Kinet-icsSound is a dataset where audio contributes more than vision, which means that BMMAL avoids biased selection over auditory modality and focuses more on the weaker visual modality.

DISCUSSION
In this paper, we evaluate how existing active learning strategies perform on multimodal classification.Our empirical studies show that they might treat different modalities unfairly, and it could lead to performance degradation for multimodal learning.We propose BMMAL to mitigate this unfairness by separately scaling unimodal gradient embeddings, which avoids mixing all unimodal information and well retain characteristics of each modality.The method performs well on multiple datasets and can be potentially applied on large-scale multimodal active learning.

A COMPUTATIONAL COMPLEXITY
Computing the Shapley values of each unimodal feature requires to perform inference 2  times in total, where  is the number of modalities.In our two-modality learning case, we need to perform inference four times with different combination of unimodal features to obtain the Shapley values, which is acceptable.Then, given the computed gradient embedding of  unlabeled samples, the sampling time complexity of BMMAL is O ( ), where  is the query budget of each AL round,  is the size of weight matrix of the last linear classifier and  is the number of classes.

B IMPLEMENTATION OF NL-GATE
NL-gate [41] is a mid-fusion mechanism that behaves similar to multi-head attention.We implement it in the video classification task, where Resnet-18 is utilized as the audio backbone and Resnet2P1D-18 is utilized as the video backbone.Note that both Resnet-18 and Resnet2P1D-18 have four blocks.We extract the middle 2D audio features from the third block of Resnet-18 and the middle 3D video features from the third block of Resnet2P1D-18 as inputs to the NL-gate.
We show the implementation of NL-gate in Figure 9.The 3D video feature is average pooled over the spatial channels into a 1D video feature.It is then tiled over the frequency channel into a 2D video feature that has the same size as the 2D audio feature.The concatenation of the 2D video feature and the 2D audio feature is used as key and value in NL-gate.The original 3D video feature is used as query in NL-gate.After audio and video features are mixed, they will be processed with a random initialized module with the same layout as the fourth Block of Resnet2P1D-18 to produce the final feature.To compute the marginal unimodal contribution, we choose to compute the Shapley values of the features generated by the last shared convolution layers (  and   ) before the NL-gate fusion module.

C SPLIT THE LARGE UNLABELED DATA POOL
In large-scale AL experiments, the gradient embedding produced by all unlabeled data samples could be too large to be stored in the memory.To address this issue, we split the unlabeled data pool into  smaller pools to save memory space, where  is the split size.After splitting, we query   unlabeled samples from each smaller pool and aggregate them to form the final query set.The space complexity of BMMAL is correspondingly reduced by  times.Moreover, the sampling time complexity becomes O (     ) = O ( 1   ), which is also reduced by  times compared with original time complexity.We use split size of eight in the large-scale AL experiment with the VGGSound-full dataset.Although splitting might affect the AL performance, we observe that both BMMAL and BADGE still perform better than random data selection.It indicates that splitting the unlabeled data pool is acceptable in large-scale AL.

D AL PERFORMANCE WITH SUMMATION
We visualize the performance comparison of all baselines with our proposed method in all AL rounds on Food101 and Kinetic-sSound with fusion mechanism of summation in Figure 10, and the unimodal contribution among BMMAL, BADGE and Random in Figure 11.As shown in the figures, our proposed method outperforms BADGE on Food101 and achieves more balanced unimodal contribution than BADGE.While on KineticsSound, our proposed method is comparable with BADGE, and it may be due to the weak fusion ability of summation.

Figure 1 :
Figure 1: General active learning process.The dashed lines represent model training.The solid lines represent data selection.

Figure 2 :
Figure 2: Modality contribution Φ across different AL iterations on the Food101 test set.

Figure 3 :
Figure 3: Modality contribution Φ across different AL iterations on the KineticsSound test set.
Unimodal performance comparison across AL iterations on VGGSound.

Figure 4 :
Figure 4: Performance comparison between proposed method and other conventional AL strategies with concatenation fusion method.The metric selected is top-1 accuracy (Top-1) on mulitmodal and unimodal classification.
Pairwise comparison on multimodal classification.R a n d o m B A L D E n t r o p y D e e p F o o l C o r e S e t G C N A L B A D G E B M M Pairwise comparison on unimodal classification of stronger modalities.R a n d o m B A L D E n t r o p y D e e p F o o l C o r e S e t G C N A L B A D G E B M M Pairwise comparison on unimodal classification of weaker modalities.

Figure 5 :
Figure 5: Pairwise comparison of all active learning strategies.Each element in the matrix  , represents the number of times strategy  outperforms strategy .A strategy is considered better if its row-wise value is larger, indicating that it beats other strategies more often.On the other hand, a strategy is better if its column-wise value is smaller, meaning it is rarely beaten by other strategies.The maximum value of each cell is 5, which is the total number of experimental settings.The bottom row displays the column-wise average values (lower is better).
Average weight   in   and average weight   in   on the Food101 dataset.Average weight   in   and average weight   in   on the KineticsSound dataset.

Figure 6 :
Figure 6: Average weight for the weaker modality in a subdataset dominated by the other stronger modality.

Figure 7 :
Figure 7: Multimodal and unimodal classification performance comparison with NL-gate fusion method on the VG-GSound dataset.

Figure 8 :
Figure 8: Top 10 improved and dropped classes based on the improvement of BMMAL to BADGE on multimodal classification accuracy on KineticsSound with 5K labeled samples.Bars represent multimodal classification accuracy.Stems represent unimodal classification accuracy.

Figure 9 :
Figure 9: The implementation of NL-gate.We use the 3D video feature as query and the 2D concatenated audio and video feature as key and value.

Figure 10 :
Figure 10: Performance comparison between proposed method and other conventional AL strategies with Summation fusion method.The metric selected is top-1 accuracy (Top-1) on mulitmodal and unimodal classification.
Modality contribution Φ across different AL iterations on the Food101 test set.Modality contribution Φ across different AL iterations on the KineticsSound test set.

Figure 11 :
Figure 11: Unimodal contribution comparison among proposed method, BADGE and random selection with Summation fusion.
, we are given a large unlabeled data pool   0 = {(  1 , ...,    ) 1... } of  input data with  modalities and an empty labeled data pool   0 = ∅.The labeling budget of each round is set to .   1 and   2 represent the input data from two different modalities.They are processed through encoders   1 and   2 respectively to extract unimodal features   1 ∈ R  1 and   2 ∈ R   2 .We adopt concatenation, a wildly used late-fusion mechanism, to construct multimodal features   =   1 ⊕   2 .1Theunimodal and multimodal features are fed to unimodal classifiers   1 ,   2 and multimodal classifier   respectively to produce logits   1 ,   2 and   for classification.The final loss is the average cross-entropy loss L  of unimodal and multimodal logits with true labels :

Table 1 :
AL performance on VGGSound dataset with budget size of 5,000.The best results are highlight in bold.