Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment

Video quality assessment (VQA) has attracted growing attention in recent years. While the great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods. To surmount the constraint of insufficient training data, in this paper, we first consider the complete range of video distribution diversity (\ie content, distortion, motion) and employ diverse pretrained models (\eg architecture, pretext task, pre-training dataset) to benefit quality representation. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features generated by these frozen pretrained models. By leveraging the Quality-aware Acquisition Module (QAM), the framework is able to extract more essential and relevant features to represent quality. Finally, the learned quality representation is utilized as supplementary supervisory information, along with the supervision of the labeled quality score, to guide the training of a relatively lightweight VQA model in a knowledge distillation manner, which largely reduces the computational cost during inference. Experimental results on three mainstream no-reference VQA benchmarks clearly show the superior performance of Ada-DQA in comparison with current state-of-the-art approaches without using extra training data of VQA.


Image Modality
For the image modality (e.g., ImageNet [13]), information pertaining to content and distortion perception is included.
For the cross-modality (e.g., WebImageText [40], emotional descriptions associated with visual quality may be contained.

INTRODUCTION
With the explosive growth of video content-based social media, there has been a tremendous amount of videos produced and shared [11].To guarantee optimal video quality and ensure users' quality  [45] VQA 585 240 YouTube-UGC [53] VQA 1,380 123 LSVQ [61] VQA 39,075 35 KoNViD-150k [17] VQA 153,841 5 Sports-1M [25] classification 1,133,158 -(auto.)Kinetics-400 [26] classification 306,245 3-5 of experience (QoE), VQA plays a crucial role in guiding image processing and video coding systems [4,59].Benefit from the thrive of Deep Neural Networks (DNN), DNN-based VQA methods [17,29,31,32,61,63] have shown great results on in-the-wild VQA benchmarks [23,45,53,61].DNN-based VQA methods are hindered by the limited scale of existing VQA datasets.As demonstrated in Tab. 1, public VQA datasets are significantly smaller in size when compared to video classification datasets [25,26].The primary reason for this discrepancy is the prevalent use of crowdsourcing to annotate subjective video quality [5,9,22,44], which is a time-consuming yet indispensable process in eliminating randomness and enhancing consistency.For example, the KoNViD-1k dataset [23] requires an average of 114 subjective scores to produce a valid label.Recently, there have been attempts to scale up VQA datasets [17,61].However, these efforts often sacrifice the annotation quality, as they reduce the average number of annotations per video in order to improve efficiency.For instance, an experiment conducted on the KoNViD-150k dataset [17] revealed that the correlation coefficient of SRCC between randomly sampled five annotations is 0.8, while it is 0.9 for fifty annotations.Thus, in this paper, we primarily focus on enhancing the performance of VQA under the circumstance of limited availability of high-quality annotated data.
To surmount the constraint of insufficient training data, certain works [24,31,32,52] attempt to utilize DNNs that have been pretrained on other extensive datasets, for the purpose of fine-tuning.The majority of these works employ ImageNet [13], which comprises a vast array of object categories, as a pre-training dataset.By introducing content-aware knowledge which meets the diverse distribution of in-the-wild videos, a considerable enhancement in performance can be obtained in downstream VQA tasks.However, videos captured in-the-wild often encounter unavoidable distortions resulting from extreme shooting conditions, compression, transmission, or other unprofessional operations performed by users [59].In such scenarios, relying solely on content-aware information is insufficient to ensure quality representation.Furthermore, some other studies [30,54] seek to leverage motion-aware and compression-aware features obtained from models that are pretrained using spatiotemporal information, simply by concatenating or averaging the features.As shown in Fig. 1, quality-related factors are encompassed within the information of different modalities, and the contribution of each modality may vary depending on the input.Nevertheless, most existing methods do not consider the complete range of video distribution diversity or employ simplistic methods to integrate various types of features, which restricts further improvements in VQA.
To address the aforementioned limitations, in this paper, we aim to employ diverse in-the-wild pretrained models to enhance VQA performance from various aspects that may affect video quality.In the case of the image modality, owing to the presence of content labels and a vast number of images closely resembling the actual distribution, quality-related information pertaining to content and distortion perception is included.As for the video modality, the data distribution contains quality-related information such as compression distortions (e.g., blocky artifacts) or motion blur.For the cross-modality of image-text, we use a trained CLIP model [40], which efficiently learns visual concepts from natural language supervision, for analysis.By devising appropriate templates, CLIP exhibits remarkable predictive ability without access to VQA annotations.We surmise that during the training of CLIP, certain texts may contain emotional descriptions associated with the quality.These phenomena attest to the practicality of employing in-the-wild models that have been pretrained on diverse modalities of data.
Based on the above observations, in this paper, we introduce a new Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework for VQA tasks.Initially, we establish a pool of pre-trained models exhibiting a wide range of diversity, taking into account various aspects of their architectures, pretrained pretext tasks, and pretrained datasets.Subsequently, to dynamically capture the desired quality-related features on a per-sample basis during training, we present a Quality-aware Acquisition Module (QAM) that outputs gating weights of features generated by different pretrained models for aggregation.We also impose an additional sparsity constraint on the distribution of the gating weights, encouraging attention to be focused on more crucial and pertinent features for quality representation.Lastly, the learned quality representation is employed as auxiliary supervisory information, in conjunction with the supervision of the labeled quality score, to facilitate the training of a comparatively lightweight VQA model using knowledge distillation.The effectiveness of our method is evaluated through extensive experiments on three mainstream high-quality annotated NR-VQA datasets.Adapt-DQA models achieve 0.8651, 0.8591, and 0.8729 of SRCC on KoNViD-1k, LIVE-VQC, and YouTube-UGC, improving the state-of-the-art results for these datasets by absolute margins of 0.6%, 1.79%, and 3.88%, respectively.Our contributions are as follows: • To the best of our knowledge, this is the first study to comprehensively investigate the relationships between pretrained models and video quality.We construct a diverse pool that encompasses a broad spectrum of quality-related factors.• We propose the Ada-DQA to leverage these pretrained models for VQA, where the QAM is proposed to capture qualityrelated features adaptively.Additionally, a sparsity constraint is also attached for the most crucial and relevant features.
• We evaluate Ada-DQA on three mainstream NR-VQA benchmarks, surpassing current state-of-the-art methods without using extra training data of QA.Sufficient ablation studies validated the effectiveness of each component.

RELATED WORK
According to the availability of reference videos, VQA methods can be classified into full-reference (FR), reduced-reference (RR), and no-reference (NR) [30] ones.As reference videos are always hard to obtain, NR-VQA becomes more practical in the real-world VQA scenario, which is investigated in this paper.According to the difference in construction schema, VQA methods can be classified into traditional hand-crafted and learning-based ones.

Classical VQA Approaches
Classic VQA methods [28,33,37,41,48,49] rely on handcrafted features to evaluate video quality.With an underlying assumption that the perceptual quality can be measured by the disturbance of natural scene statics (NSS) [60], these work attempts at designing handcrafted features with richer representation for VQA.The work [42] based on the 2D discrete-time transform (DCT) features of video frame-difference statistics, and motion information is further introduced to level up the representation capacity.TLVQM [28] utilizes a combination of spatial high-complexity and temporal low-complexity handcraft features.Whereas, handcrafted features are gradually replaced by the DNN-based features, due to their sensitivity to distortion types and the superiority DNN features demonstrated in various computer vision tasks.
However, due to the data-driven characteristics of deep learning, most of the current VQA models suffer from the lack of sufficient high-quality-labeled datasets.There are some attempts to relieve this insufficient data challenge, either from patch-level/frame-level augmentation [31,62] or fine-tuning from other large computer vision models pretrained on large general knowledge-based datasets [24,32,52].VSFA [31] extracts frame-wise features with ResNet and uses a gated-recurrent unit to model temporal information.LSCT [62] adopts a Transformer to predict video quality based on the frame features extracted by an IQA model.But frame-level augmentation dismissed the effect brought by temporal concealment, which is widely noticed nowadays.Most fine-tuning work [24,32,52] utilizes models pretrained on classification datasets, where learned information mainly covers content-awareness and is not tailor for the task of VQA.Several work [30,54,64] has noticed the insufficiency of contentaware information.CoINVQ [54] leverage distortion-aware and compression-aware representation besides the content-aware representation.Nevertheless, the distorted information is learned from synthetic datasets and the generalized ability to in-the-wild distorted data needs to be verified.BVQA [30] introduces motionaware information learned from the action classification dataset.But they dismiss the fact that distortion awareness is crucial to VQA.What's more, these works either utilize a temporal-sampling and concatenating strategy to aggregate features or employ temporal average pooling for feature fusion.The final features are not acquired in an adaptive and flexible manner, which prohibits the diverse feature representations from unleashing full potential.More recent work focus on building spatiotemporal relation.StarVQA [58] builds a Transformer by using divided space-time attention.DisCoVQA [57] design a transformer-based Spatial-Temporal Distortion Extraction module to tackle temporal quality attention.FastVQA [56] attempts to assess local quality by sampling patches at their raw resolution and covers global quality with contextual relations.

METHOD
To surmount the constraint of limited labeled data availability and to obtain the quality-related features inherent in diverse modalities, we introduce the Ada-DQA framework for VQA tasks.In Sec.3.1, we provide an overview of the framework.In Sec.3.2, we explicate the construction of pretrained models from various aspects.In Sec.3.3, we elucidate the process of acquiring quality representation using the proposed Quality-aware Acquisition Module (QAM).Finally, in Sec.3.4,we present the optimization objective during training based on knowledge distillation and regression loss.

Ada-DQA Framework
As shown in Fig. 2, the framework of Ada-DQA can be divided into three components.First,  pretrained models, which act as feature extractors, are selected as candidates from the wild.Given an input video V, features are generated by these pretrained models, whose weights are frozen.This significantly reduces the training cost of multiple heavy pretrained models.According to the training paradigm of pretrained models, these features may contain quality-related information (e.g., , content, distortions, and motion).However, since factors that may affect quality vary in different videos, the correlation between the quality of different videos and these features is also different.Second, to adaptively capture desired quality-related features sample-by-sample during training, the proposed QAM is used to raise dynamic weights for feature aggregation.An extra sparsity constraint is attached to the distribution of these gating weights, promoting attention to more critical and relevant features for quality representation.Then the video quality feature can be obtained by a weighted summation.Finally, the learned quality representation is utilized as supplementary supervisory information, along with the supervision of the labeled quality score, to guide the training of a relatively lightweight VQA model in a knowledge distillation manner.During inference, only the optimized VQA model is used, reducing the computational cost largely.More details will be provided below.

Quality-related Pretrained Models
Inspired by the current success of the "pretraining and fine-tuning" paradigm in deep learning [8,20,40], we aim to utilize in-the-wild pretrained models to benefit VQA from diverse aspects of the video, in order to enhance better understanding of video quality and enable personalized treatments to improve it.We contemplate the choice of pretrained models through the lens of multiple factors, as shown in Fig. 3, that may impact the quality of videos as follows: • Content.Human judgments of visual quality are contentdependent according to previous studies [31,54].When a video is visually appealing, engaging, and relevant to the viewer's interests (e.g., cute puppy and beautiful scenery), it can capture their attention and make them more receptive to the video's content.In contrast, if a video is dull, uninteresting, or irrelevant (e.g., black screen and messy corners), viewers are more likely to rate low quality.Introducing models pretrained on the task of object recognition (e.g., EfficientNet, Swin Transformer) may benefit VQA.• Distortion.In addition to content, distortions introduced during the phase of video capturing and compression also determine the video quality [50].Thus a pretrained model that has been trained on a dataset of images or videos (e.g., ImageNet, Kinetics-400) with compression artifacts will have learned to identify the specific patterns and features that are associated with compression artifacts, such as blockiness, blurriness, or pixelation.• Motion.Unlike the image scenario, motion blur can significantly affect the quality of videos [10,55].It occurs when there is rapid motion, and the camera or objects in the scene are moving too quickly for the camera's shutter to capture.
A pretrained action recognition model (e.g., SlowFast, TimeSformer) may detect specific actions or movements such as running, jumping, or throwing.These can be useful for analyzing the amount of motion or by looking for specific visual cues that are associated with motion blur, such as streaking around the edges of moving objects.
However, it is important to note that individual pretrained models may not be able to identify all types of quality-related factors, or may not be as accurate in identifying certain types.Some evidences are given in Fig. 4. When encountering different types of distortions, there will be obvious differences in the perception ability of the pretrained model.In detail, ConvNext-base (SRCC=0.968)outperforms EfficientNet-b7 (SRCC=0.038)when facing compression.When it comes to sharpness, EfficientNet-b7 performs better.Therefore, it is important to use a diverse set of models and combine their results to get a more robust assessment.Thus, we propose to construct a pool with a large diversity of candidate models, considering the following aspects: • Architecture.The efficacy of a network architecture (e.g., CNN, Transformer) hinges upon its capacity to assimilate and convey information.A well-crafted architecture can discern finer details and patterns in the input video, while also influencing the manner in which spatial and temporal information, containing quality-related features, is processed.• Pretrained pretext task.The type of supervision in the pretext task has an impact on the ability of the pretrained model to VQA tasks.When the distribution of data is similar, a supervised pretext task may lead to superior performance.Conversely, self-supervised pretext tasks, where the model is trained on unlabeled data, may facilitate better generalization when confronting unfamiliar VQA domains.• Pretrained dataset.Pretrained datasets on a large scale can be advantageous to VQA by providing diverse content, distortion, and motion-related data.A desired pretrained dataset should include a wide range of categories that closely resemble real-world scenarios, as well as other multi-modal information that can aid in describing video quality.For instance, the WebImageText [40] dataset, which combines text and images, can be helpful in this regard.
Based on the above considerations, in this paper, we select several pretrained models that obtain top performance in their original fields, including (1) EfficientNet-b7 [46] trained on ImageNet-1k Figure 3: Videos sampled from the YouTube-UGC dataset [53] and their corresponding labeled MOS, ranging from 1.0 to 5.0.It can be seen that video quality is affected by various aspects, including content, distortions, and motion.

Quality-aware Acquisition Module
As the distribution of content and distortions in videos can be quite complex, a static combination of pretrained models may not always yield optimal performance.In order to adaptively capture the diversity and complementary information from different pretrained models, we propose a Quality-aware Acquisition Module (QAM).It takes extracted features from various pretrained models as input and produces a consolidated feature as output for the ultimate representation of quality.The computational process can be partitioned into two main parts.The first part is responsible for transforming the extracted features initially into a uniform feature dimension to enable subsequent aggregation.Structurally, this transformation block comprises two fully-connected layers followed by a normalization layer and a GELU activation layer.The second part generates gating weights  to control the aggregation process.The gating network takes the concatenated feature vector as input and outputs a set of gating weights that represent the relative contribution of each pretrained model to the final quality representation.Structurally, this gating network is stacked using a fully-connected layer and a sigmoid layer.Then the quality representation g can be obtained by a weighted sum according to the gating weights.Given the extracted features by different pretrained models {f 1 , f 2 , • • • , f  }, these procedures can be noted as: where F   (•) denotes the mapping function for the -th transformation block, and F  (•) represents the mapping function for the gating network.And  is the number of aligned feature dimensions.
In addition, to emphasize the importance of critical features and enhance the generalization ability, we propose to impose a sparsity constraint as a regularization on the distribution of gating weights.The L 1 loss is utilized to penalize non-zero weights resulting in more weights near zero.This constraint can be written as: In this way, QAM allows for capturing a broader range of qualityrelated features, thereby enabling better adaptation to various types of video content, distortions, or movement.Then the aggregated feature is sent into a regression head, which is a single fully-connected layer, for quality prediction, resulting in   .

Optimization Objective
In practical scenarios, using these large pretrained models for inference can be computationally expensive.To reduce the computational cost and increase flexibility, we propose to use knowledge distillation [21] to transfer the knowledge from large and complex models to a lightweight VQA model.In this paper, a Video Swin Transformer-Tiny [35] is selected as the backbone.For an input video V, the quality representation can be achieved by h = H (V), where H (•) represents the mapping function of the lightweight backbone, and h ∈ R  .Then h is sent into a regression head, which is a single fully-connected layer, for quality prediction, resulting in   .Note that both   and   are supervised by the labeled MOS using a smooth L 1 loss.Additionally, we apply a similarity consistency in knowledge distillation between g and h.This allows the VQA model to simulate the robust quality representation generated by diverse pretrained models, further enhancing its performance.Given the labeled MOS , the regression loss for pretrained models can be noted as: And the regression loss for the lightweight VQA model L   (,   ) share the same formulation.A L 2 loss is used for the process of knowledge distillation, which can be written as: The whole optimization objective can be formulated as: min L   (,   ) +  L   (,   ) + L  (g, h) + L  ( ), (5) where  is a balancing weight for knowledge distillation, and  is a hyper-parameter to balance the level of sparsity.The correlation of SRCC is computed according to distortion degrees.It is evident that pretrained models may detect certain types of distortions, but their ability to perceive distortion varies across models.

EXPERIMENTS 4.1 Dataset and Evaluation Metrics
Dataset.Our method is evaluated on three widely-adopted public NR-VQA datasets, including KoNViD-1k [23], LIVE-VQC [45], and YouTube-UGC [53].Mean opinion scores (MOS) are provided along with training videos.Specifically, KoNViD-1k contains 1,200 videos that are fairly filtered from a large public video dataset YFCC-100M.The time duration of the video is 8 seconds.And these videos have a frame rate of 24/25/30 FPS and a resolution of 960 × 540.LIVE-VQC consists of 585 videos with complex authentic distortions, which are captured by 80 users using 101 different devices, ranging from 240P to 1080P.YouTube-UGC has 1,380 UGC videos sampled from YouTube with a duration of 20 seconds and resolutions from 360P to 4K.All these datasets contain no pristine videos, thus only NR methods can be evaluated on them.Following [59], we split all the dataset into 80% training videos and 20% testing videos randomly.
Evaluation Metric.Spearman's Rank-Order Correlation Coefficient (SRCC) and Pearson's Linear Correlation Coefficient (PLCC) are selected as metrics to measure the monotonicity and accuracy, respectively.They are in the range of 0.0 to 1.0, and larger values indicate better results.Besides, the mean average of PLCC and SRCC is also reported as a comprehensive criterion.

Implementation Details
Our method is implemented based on PyTorch [39] and MMAc-tion2 [12].All experiments are conducted on 4 NVIDIA V100 GPUs.For all datasets, we select EfficientNet-b7, ir-CSN-152, CLIP, Swin Transformer Base, TimeSformer, Video Swin Transformer Base and SlowFast as candidate pretrained models, and choose the Video Swin Transformer Tiny as the lightweight VQA model.Frames are sampled in each video with a fixed temporal step to form a clip input.For frame-wise models (e.g., EfficientNet, CLIP), the feature representation can be calculated through the average features of all frames.For video clip-based models (e.g., SlowFast, ir-CSN-152), the extracted features can be used directly for the video representation.For KoNViD-1k, we sample 16 frames with a frame interval of 2. As videos in LIVE-VQC and YouTube-UGC have longer time durations, we sample 64 frames with an interval of 2, and 32 frames with an interval of 8, respectively.Since most augmentations will introduce extra interference to the quality of videos (e.g., resize, color jitter) [27], we only choose the center crop to produce inputs with a resolution of 224 × 224.During the optimization procedure, we use the AdamW optimizer with a weight decay of 2e-2.A cosine annealing scheduler with a warmup of 2 epochs is adopted to control the learning rate.The initial learning rate is 1e-3.And  is 0.1 by default. is set to 0.8. is set to 32.The batch size of the input is set to 1.All models are trained for 60 epochs.And the checkpoint generated by the last iteration is used for evaluation.For inference, we follow a similar procedure as [1] by using 4 × 5 views.In the procedure, a video is uniformly sampled as 4 clips in the temporal dimension, and for each clip, the shorter spatial side is scaled to 256 pixels and we take 5 crops in the four corners and the center.The final score is computed as the average score of all the views.The average result of 10 repeat runs with different random splits is used as the final score for the experiments in Tab. 2.

Comparison with SOTA methods
We report the SRCC and PLCC performance with current SOTA methods on KoNViD-1k, LIVE-VQC, and YouTube-UGC.As shown in Tab. 2, our method achieves new state-of-the-art results on all these datasets without using extra training data of QA.Some observations can also be found through these results.We can observe those deep learning-based methods outperform the traditional hand-crafted method (e.g., VIIDEO, NIQE) largely.Besides, within deep learning-based methods, VQA methods produce much better performance than IQA methods (e.g., BRISQUE).
Table 2: Comparisons with SOTA methods.The up arrow "↑" means that a larger value indicates better performance.The mark "-" means the results are not reported originally.Mark "*" indicates that the model uses external QA data for training.The best and second best performances are highlighted and underlined.Ada-DQA outperforms existing SOTA methods by large margins on three datasets.We also report the performance of the Video Swin Tiny without the assistance of diverse pretrained models.

Experimental Analysis and Ablation Studies
Number of pretrained models and effectiveness of sparsity constraint in QAM.To investigate the impact of the number of pretrained models, we performed experiments by reducing the number of models from 7 to 3. As depicted in Tab. 3, increasing the number of models does not always lead to better performance without the use of sparsity constraints in QAM.With the help of sparsity constraint, the model can achieve continuous improvement as the number of pretrained models increases.However, adding more models beyond 8 (introducing an extra model of ViT Base) does not yield any further improvements.This may indicate that  4, a solitary pretrained model cannot consistently attain optimal outcomes across all VQA datasets.For instance, while CLIP excels on YouTube-UGC, it falls considerably short on the other two datasets.We posit that this is influenced by the correlation between pretrained models, such as pre-text task, dataset, architecture, and VQA tasks.Additionally, the results obtained by utilizing singular pretrained models are notably distant from the state-of-the-art.These findings substantiate that a solitary pretrained model is inadequate for diverse application scenarios, and leveraging a variety of pretrained models is imperative.
Different types of distillation loss.Experiments on different knowledge distillation losses are performed in KoNViD-1k, including the L 2 loss, the L 1 loss and the Jesen-Shannon (JS) loss [16].As shown in Tab. 5, the L 2 loss owns the best performance in transferring aggregated features by multiple teacher models.
Selection of hyper-parameters.We conduct experiments to show how the hyper-parameters  and  in Equ. 5 will influence the final results in KoNViD-1k.The results are listed in Tab.6a and 6b.When  is 0.1, and  is 0.8, which are used in our experiments, the best performance can be obtained.
Contribution of different pretrained models.To investigate the contribution, we analyze the gating weights  generated by the QAM in KoNViD-1k.The statistical average scores for different models are calculated.We count responses of low-quality (LQ, MOS<3.5) and high-quality (HQ, MOS>3.5)videos in Tab. 7. It can be seen that for LQ videos, models that can provide distortion and motionrelated information (e.g., ir-CSN-152) have larger weights; for HQ videos, models that can provide content-related one (e.g., EfficientNet-b7) own larger weights.Computational cost.We compare the #Params, #FLOPs, and SRCC of Ada-DQA over some SOTA methods whose models are available: Ada-DQA (29M, 88T, 0.8651), MDTVSFA (24M, 168T, 0.7812), StarVQA (121M, 75T, 0.812), BVQA (24M, 240T, 0.8362).With the help of pretrained models during training, Ada-DQA obtains higher results with a fair cost during inference.

Visualization of the Attention
Some representative videos in KoNViD-1k are selected to show the performance improvement brought by Ada-DQA.Visualization of the feature attention maps using GradCAM [43] are shown in Fig. 5.After introducing the adaptive acquisition strategy, Ada-DQA generates more accurate results and the attention maps highlight more quality-related regions.For instance, in the first video clip, attention from the vast surface of the ocean (left) is transferred to the sailboat, with some attention on ocean waves kept (right).Ada-DQA focuses on areas more related to the action (boat sailing) or giving clues about the perceptual quality (edges of waves).

CONCLUSION
To address the issue of insufficient training data in VQA, this paper analyzes the entire spectrum of video distribution diversity that impacts quality and proposes the Ada-DQA framework, which employs a range of diverse pretrained models to improve quality representation.With Ada-DQA, it becomes possible to extract critical and relevant features generated by different frozen pretrained models adaptively.Experimental results on three mainstream NR-VQA benchmarks show the effectiveness in the context of limited data.Thorough analysis and ablation studies also validate the necessity of each component.This work hopes to inspire future research that leverages pretrained models to aid in a wider array of tasks.

Figure 2 :
Figure2: Overview of our proposed Ada-DQA framework.First, in-the-wild pretrained models are selected as candidates according to diverse aspects.Second, features generated by these frozen pretrained models are aggregated per sample using the QAM adaptively.This approach allows acquiring of quality-related representations.Third, during training, the integrated feature is utilized as supplementary supervision, along with the labeled quality score, to guide the training of a lightweight VQA model.During inference, only the optimized VQA model is used, reducing the computational cost largely.

Figure 4 :
Figure 4: Responses of different pretrained models to synthetic sequences generated by distortions (i.e., compression, sharpness).The correlation of SRCC is computed according to distortion degrees.It is evident that pretrained models may detect certain types of distortions, but their ability to perceive distortion varies across models.

Figure 5 :
Figure 5: Comparison of predictions and attention visualizations.For each video, video frames (left) and attention maps (right) before and after using Ada-DQA are illustrated.

Table 1 :
Comparisons of public VQA and video classification datasets.Note that obtaining reliable quality annotations requires massive subjective annotations per video, which limits the size of VQA datasets.

Table 3 :
Experimental analysis on different numbers of selected pretrained models with or without the usage of sparsity constraint in QAM.SRCC results are reported.The best result under the setting of with or without is bolded.

Table 4 :
Experiments using a single pretrained model for knowledge distillation.The SRCC results in three datasets are reported.The best results in different datasets are bolded.

Table 5 :
Experiments on different distillation losses.

Table 6 :
Selection of the hyper-parameters of  and .

Table 7 :
Contributions of different pretrained models by .