Deep Learning Based Multimodal with Two-phase Training Strategy for Daily Life Video Classification

In this paper, we present a deep learning based multimodal system for classifying daily life videos. To train the system, we propose a two-phase training strategy. In the first training phase (Phase I), we extract the audio and visual (image) data from the original video. We then train the audio data and the visual data with independent deep learning based models. After the training processes, we obtain audio embeddings and visual embeddings by extracting feature maps from the pre-trained deep learning models. In the second training phase (Phase II), we train a fusion layer to combine the audio/visual embeddings and a dense layer to classify the combined embedding into target daily scenes. Our extensive experiments, which were conducted on the benchmark dataset of DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) 2021 Task 1B Development, achieved the best classification accuracy of 80.5%, 91.8%, and 95.3% with only audio data, with only visual data, both audio and visual data, respectively. The highest classification accuracy of 95.3% presents an improvement of 17.9% compared with DCASE baseline and shows very competitive to the state-of-the-art systems.


I. INTRODUCTION
Recently, applying deep learning techniques to analyze videos has achieved many successes and opened a variety of real-life applications.Indeed, a wide range of deep learning based systems have been proposed for various humanrelevant tasks of emotion recognition [1], lip-reading [2], or detecting human activities [3], [4], [5], etc.Recently, a dataset of daily-scene videos [6], which was proposed by DCASE challenge [7] for a new task of audio-visual scene classification (AVSC), was published and attracted attention from the video research community.Similar to the systems proposed for analyzing videos of human activities [5], [1], the state-of-the-art systems proposed for AVSC task also leveraged deep learning based models and presented joined audio-visual analysis.For instances, the proposed systems in [8], [9] used convolutional based models to extract audio embeddings from audio data and leveraged pre-trained deep learning models for extracting visual embeddings from visual data.Then, the audio embeddings and the visual embeddings are concatenated and fed into dense layers for classification.To further enhance audio/visual embeddings, the authors in [10] proposed a graphed based model which was used to learn the audio/visual feature maps extracted from middle The graph based audio/visual embeddings are finally fused with audio/visual embeddings before going to dense layers for classification.Meanwhile, authors in [11] improved the audio/visual embeddings by proposing a contrastive eventobject alignment layer.The contrastive event-object alignment layer, which is based on the contrastive learning technique, helps to explore the relationship between audio and visual information by learning relative distances of eventobject pairs occurring in both audio and visual scenes.
In this paper, we also leverage deep learning techniques, propose a deep learning based multimodal system for the task of AVSC.We present our main contributions: (a) We propose a mechanism, which combines a multi-model fusion and a two-phase training strategy, to generate an audio-visual embedding representing for one video input.(b) We evaluate our proposed deep learning based multimodal system on the DCASE 2021 Task 1B Development set which is the benchmark and largest dataset for the task of audio-visual daily scene classification.Results reveal that our proposed system is very competitive to the state-of-the-art systems.

II. PROPOSED DEEP LEARNING BASED MULTIMODAL FOR AVSC TASK
As Figure 1 shows, the high-level architecture of our proposed deep learning based multimodal for audio-visual scene classification (AVSC) comprises two individual branches: the audio branch and the visual branch, which focus on either audio or visual data extracted from the input video.Regarding the audio branch, the audio is first transformed The audio and image embeddings are then combined by a Fusion Layer to generate an audio-visual embedding (i.e.The Fusion Layer is denoted by the function f ).The audiovisual embedding is finally classified into target categories by a Dense Layer.From results shown in recent papers [8], [9], [11], we can see that the visual data contributes to the AVSC performance more significantly than the audio data.If we train our proposed AVSC system with an endto-end training process, it possibly causes an overfitting on the visual branch and reduces the role of the audio branch.We therefore propose a two-phase training strategy to train our proposed AVSC system.While the first training phase (Phase I) is used to train and achieve the individual Audio and Visual Backbones, the Fusion Layer and the Dense Layer are trained in the second phase (Phase II).

A. Phase I: Train deep learning models on individual audio or visual data to achieve audio and visual backbones
In Phase I, we aim to achieve individual high-performance Audio and Visual Backbones as shown in Figure 1.To this end, we consider the AVSC task as a combination of two independent tasks of Acoustic Scene Classification (ASC) and Visual Scene Classification (VSC).To deal with the ASC task, we leverage multiple input spectrograms, which proves powerful to improve the ASC performance [12], [13].
In particular, we propose deep learning based systems as shown in Figure 2 to train audio data.The audio is firstly re-sampled to 32,000 Hz, then transformed into three types of spectrogram: Mel spectrogram (MEL), Gammatone (GAM), and Constant-Q-Transform (CQT) where both temporal and spectral features are presented.By using two channels and setting parameters of the filter number, the window size, the hop size to 128, 80 ms, 14 ms, respectively, we generate MEL, GAM, and CQT spectrograms of 128×309×2 from one 10-second audio segment.Delta and delta-delta are then applied to the three-dimensional spectrograms to obtain six-dimensional spectrograms of 128×305×6.Next, the Mixup [14] data augmentation method is applied on the six-dimensional spectrograms before feeding into a residualinception based network for classification.Regarding the residual-inception based network for training audio spectrograms, it is separated into two main parts: A Residual-Inception block and a Dense block.The Residual-Inception block in this paper is the CNN-based backbone of the novel residual-inception deep neural network architecture which is reused from our previous works in [15].Meanwhile, with RGB format.Then, the Mixup [14] data augmentation method is applied on the scaled images before feeding into classification models.To construct the classification models, we are inspired by [16] which shows that a combination of Inception based and ConvNet based models proves effective to improve the performance of VSC tasks.We, therefore, select InceptionV3 and ConvNeXtTiny networks from Keras library [17] for evaluating the VSC task in this paper.As both InceptionV3 and ConvNeXtTiny networks were trained with ImageNet [18] in advance, we reuse the trainable parameters from the first layer to the global pooling layer of these networks.We then connect these pre-trained layers with a two dense layers as shown in the lower part in Figure 3  Given the individual pre-trained models of Aud-MEL, Aud-GAM, Aud-CQT for audio input and Vis-CONV and Vis-INC for visual input, we remove header layers of these pre-trained models (i.e.The header layers of the pre-trained models are either the softmax layer or the final dense layer) to perform the Audio and Visual Backbones as shown in Figure 1.In the other words, when we feed an audio or visual data into the pre-trained models of Aud-MEL, Aud-GAM, Aud-CQT, Vis-CONV, Vis-INC, the feature maps extracted at the first fully connected FC(1024) or at second fully connected FC (10) are considered as the audio and visual embeddings as shown in Figure 1.

B. Phase II: Train the Fusion Layer and the Dense Layer
In this Phase II, we aim to train the Fusion Layer and the Dense Layer as shown in the lower part of the Figure1.Regarding the Fusion Layer, it is used to combine audio and visual embeddings, which are extracted from the Audio and Visual Backbones, to generate an audio-visual embedding representing for one video input.In this paper, we proposed three combination methods for the Fusion Layer.Additionally, we have two types of audio/visual embeddings: The first type of audio/visual embeddings are extracted from the first fully connected layer FC(1024) of the pre-trained deep learning based models: Aud-MEL, Aud-GAM, Aud-CQT, Vis-CONV, Vis-INC; and the second type of audio/visual embeddings are extracted from the second fully connected layer FC(10) of these pre-trained deep learning based models.As a result, we totally evaluate six types of Fusion Layer, referred to as f 1 , f 2 , f 3 , f 4 , f 5 , and f 6 .While f 1 , f 2 , f 3 are three types of combinations for the first type of audio/visual embeddings, f 4 , f 5 , f 5 are for the second type of audio/visual embeddings.Let consider {ae g , ae m , ae c , ve i , ve c } ∈ R 1024 as the the first type of audio and visual embeddings extracted from the the first fully connected layer FC(1024) of the Audio and Visual Backbones, the fusion functions of f 1 , f 2 , f 3 representing for the Fusion Layer are described by f1 = aeg.w1+ aem.w2 + aec.w3 + ve i .w4+ vec.w5 + b, f2 = (aeg.w1+aem.w2+aec.w3).wa+(vei .w4+vec.w5).wv+b,(2) f3 = concat[(aeg.w1+aem.w2+aec.w3),(ve i .w4+vec.w5)], where {w 1 , w 2 , w 3 , w 4 , w 5 , w a , w v , b} ∈ R 1024 are trainable parameters.
Regarding the fusion function f 1 , we assume that individual audio/visual embeddings have a linear relation across each dimension.Therefore, we apply the element-wise product between each trainable vector of w 1 , w 2 , w 3 , w 4 , w 5 and each individual embedding before adding a bias b.By this way, a linear function, which helps to learn the relation of audio/visual embeddings across 1024 dimension, is established.Meanwhile, in the fusion function f 2 , we first apply the linear combination for only audio embeddings and for visual embeddings independently.Then, we again apply the linear combination for both audio and visual embeddings using trainable vectors of w a , w v and b.For the fusion function f 3, we also first apply the linear combination for only audio embeddings and only visual embeddings independently.We then concatenate two audio and visual embeddings to perform one audio-visual embedding.The fusion functions f 4 , f 5 , f 6 share the same equation as f 1 , f 2 , f 3 respectively with the second type of audio/visual input embeddings of {ae g , ae m , ae c , ve i , ve c } ∈ R 10 and the trainable parameters of {w 1 , w 2 , w 3 , w 4 , w 5 , w a , w v , b} ∈ R 10 .
The output of the Fusion Layer, known as the audio-visual embedding, is finally classified by a Dense Layer performed by a fully connected layer FC (10) and a Softmax layer as shown in the Figure 1.Notably, as we freeze the Audio and Visual Backbones in the Phase II training process, the model is forced to learn the Fusion Layer and the Dense Layer.

A. Implementation of deep learning models
We apply Tensorflow framework for implementing all deep learning based models in this paper.As mixup [14] data augmentation is used for audio spectrograms, image frames, and audio/visual embeddings to enforce classifiers, the labels of the augmented data are no longer one-hot.We therefore use Kullback-Leibler (KL) divergence loss to train back-end classification models: where N is the training samples, Θ present the trainable network parameters, and λ denotes the 2 -norm regularization coefficient.y n and ŷn denote the ground-truth and the network output, respectively.All the training processes in this paper are run on two GeForce RTX 2080 Titan GPUs using Adam method [19] for optimization.

B. Datasets and evaluation metric
This dataset is referred to as DCASE Task 1B Development, which was proposed for DCASE 2021 challenge [7].The dataset is slightly imbalanced and contains both acoustic and visual information, recorded from 12 large European cities: Amsterdam, Barcelona, Helsinki, Lisbon, London, Lyon, Madrid, Milan, Prague, Paris, Stockholm, and Vienna.It consists of 10 scene classes: airport, shopping mall (indoor), metro station (underground), pedestrian street, public square, street (traffic), traveling by tram, bus and metro (underground), and urban park, which can be categorized into three meta-class of indoor, outdoor, and transportation.The dataset was recorded by four recording devices simultaneously with the same setting of 48000 Hz sampling rate and 24 bit resolution.We obey the DCASE 2021 Task 1B challenge [7], separate this dataset into training (Train.)subset for the training process and evaluation (Eval.)subset for the inference.As regards the evaluation metric, the Accuracy (Acc.%), which is commonly applied in classification tasks [7] and is also proposed for DCASE Task 1B challenge, is used to evaluate the AVSC task in this paper.

C. Experimental results and discussion
We first evaluate the performance of our proposed systems with different types of fusion methods mentioned in Section II-B.As the results show in Table I, fusion methods of f 4 , f 5 , f 6 outperform f 1 , f 2 , f 3 respectively.In the other words, the fusions of audio/visual embeddings extracted from the second fully connected layer FC (10) are more effective rather than the fusions of audio/visual embeddings from the first fully connected layer FC(1024).We also see that the best accuracy score of 95.3% is achieved from f 4 method which presents a linear combination of all five audio/visual embeddings.
We then evaluate the performance comparison among the proposed systems using f 4 fusion of only audio data, of only visual data, of both audio and visual data.As the Figure 4 shows, the proposed AVSC system using only visual data (91.8%)outperforms the system with only audio data (80.5%)over almost categories, except of 'Tram' and 'Park'.When both audio and visual data are used, this helps to improve the performance in all categories (Most categories  record have accuracy more than 90%, except 'Airport' with 88.0%).We compare our best systems (i.e. using f 4 fusion) with the state-of-the-art systems.As the Table II shows, our proposed systems using only audio or using only visual data outperforms the state-of-the-art systems, records the accuracy of 80.5% and 91.8%, respectively.Our proposed system using both audio and visual data achieves the top-2 after the system from [20].However, the top-1 system [20] presented an intensive ensemble of nine large deep learning models (EfficientNet, ResNeSt, and RegNet for directly training audio data; ResNet-6.4F,FesNetSt-50d, HRNet-W18 for directly training visual data; CLIP based networks of ResNet-101, ResNet-50x4 ViT-B32 for extracting visual embeddings), which requires nine individual processes as well as a post processing method for an inference.Meanwhile, our proposed system combines five lighter models (3 residualinception based models for audio data (36 M trainable parameters), InceptionV3 and ConvTiny based models for visual data (69.4M trainable parameters)) and presents an end-to-end inference process.

IV. CONCLUSION
We have proposed a deep learning based multimodal system with the two-phase training strategy for classifying daily life videos.Our proposed model, which makes use of a multispectrogram approach for audio data (i.e.MEL, GAM, and

Fig. 1 .
Fig. 1.The high-level architecture of the proposed deep learning based multimodal for AVSC task

Fig. 2 .
Fig. 2. Deep learning based models with audio data

Fig. 3 .
Fig. 3. Deep learning based models with visual data to perform the InceptionV3 and ConvNeXtTiny based classification models for the VSC task in this paper.The InceptionV3 and ConvNeXtTiny based classifiers, which are finetuned on the downstream VSC task, are referred to as Vis-CONV and Vis-INC models, respectively.

Fig. 4 .
Fig. 4. The confusion matrix results of the propose systems using f 4 fusion method with only audio data (a), with only visual data (b), and with both audio and visual data (c)

TABLE I THE
PERFORMANCE OF THE PROPOSED SYSTEM (ACC.%)WITHDIFFERENT TYPES OF FUSION FUNCTION: f 1 , f 2 , f 3 , f 4 , f 5 , f 6 .

TABLE II COMPARE
OUR PROPOSED AVSC SYSTEM TO THE STATE-OF-THE-ART SYSTEMS (ACC.%)WITH ONLY USING AUDIO DATA, WITH ONLY USING VISUAL DATA, AND WITH USING BOTH AUDIO & VISUAL DATA CQT) and multiple networks for visual data (InceptionV3 and ConvNeXtTiny), achieves the best performance of 95.3% on the benchmark dataset of DCASE 2021 Task 1B.The experimental results prove that our proposed AVSC system is very competitive to the state-of-the-art systems and potential for applying to real-life applications.