Auditory Attention Decoding with Task-Related Multi-View Contrastive Learning

The human brain can easily focus on one speaker and suppress others in scenarios such as a cocktail party. Recently, researchers found that auditory attention can be decoded from the electroencephalogram (EEG) data. However, most existing deep learning methods are difficult to use prior knowledge of different views (that is attended speech and EEG are task-related views) and extract an unsatisfactory representation. Inspired by Broadbent's filter model, we decode auditory attention in a multi-view paradigm and extract the most relevant and important information utilizing the missing view. Specifically, we propose an auditory attention decoding (AAD) method based on multi-view VAE with task-related multi-view contrastive (TMC) learning. Employing TMC learning in multi-view VAE can utilize the missing view to accumulate prior knowledge of different views into the fusion of representation, and extract the approximate task-related representation. We examine our method on two popular AAD datasets, and demonstrate the superiority of our method by comparing it to the state-of-the-art method.


INTRODUCTION
In the acoustic environments people face every day, one's brain can focus auditory attention on a particular stimulus while filtering out other stimuli.For example, people can focus on their interest speaker during a cocktail party (see Figure 1).This marvel phenomenon, the cocktail party effect [7], has attracted long-standing research interest [9,20,27].And the mechanism behind it is often called selective auditory attention [10,20,28].
Recently, with the development of the brain-computer interface (BCI), researchers are interested in decoding auditory attention through brain activities, which is known as auditory attention decoding (AAD).Auditory attention can be decoded from several brain signals, such as electrocorticography (ECoG) [28], magnetoencephalography (MEG) [17] or electroencephalography (EEG) [30].Since it is economical and non-invasive, methods based on EEG have the most promising application potential and may affect hearing aids and active noise cancellation (ANC) headphones in the future.
In the dual-speaker scenario, which is the most popular experimental form in recent AAD research, the subject will hear two different speeches and choose one speech as the attended one actively or passively (see Figure 2 for an example).And the task of AAD methods is to infer the subject's attended speech based on the EEG and two speeches.Most existing AAD methods resort to extracting the representation using all the information in the data [6,26,42].However, the prior knowledge of the AAD task is that the attended speech and the EEG are two related views, which contain information about auditory attention.And such a relationship has been ignored in the existing deep learning AAD methods.According to Broadbent's filter model [3,4], the attentional processing system in the human brain has an early selection process that uses a selective filter to avoid unrelated information getting involved in the higher-level processing.With this filtering mechanism, our human brain can have a remarkable capability to pay attention to a particular sound source and ignore surrounding noise, such as focusing on the attended speaker at a cocktail party intentionally or making conversation with your friends on a noisy train.Therefore, we argue that the representation should be extracted from the task-related part of the data.
Inspired by Broadbent's filter model, we developed our method in a multi-view structure and filter the unrelated information when fusion the representation.Specifically, our work refers to the EEG and speeches as different views of data and decoding auditory attention based on multi-view variational autoencoder (VAE) [24,32,36,40].The multi-view VAE will transform the different views of data into different single-view representations at the beginning, and fuse them to a common representation space.Then several decoders will be trained, and map the representation from the common space back to different views of data.The common space can encode the distribution of multi-view data effectively after training.Since that, the multi-view VAE can leverage the underlying relationship between different views of data and improve the performance of AAD methods.
When implementing the multi-view VAE in the AAD task, a critical problem is how to effectively utilized the prior knowledge about different views of data.In fact, the information about selective attention is contained in the attended speech and EEG, which we called task-related representations (or views).So it is important to retain more information from the task-related views and minimize the interference of task-unrelated views (unattended speech) during the fusion of single-view representations.Since the multi-view VAEs support learning a representation of data with the missing view, a straightforward thought is fusing the task-related representation on attended speech and EEG views.But unfortunately, the multi-view VAE needs the fused common representation to carry out the AAD task, while fusing a task-related common representation needs the result of AAD.This dilemma causes a great obstacle in the application of task-related representation in the AAD task.To solve this problem, we propose task-related multi-view contrastive (TMC) learning to extract the approximate task-related representation.
The TMC learning consists of two main ideas: 1) utilizing the support of missing view in multi-view VAE to fuse a task-related representation and 2) approximate task-related representation using contrastive learning.Specifically, we first fuse the task-related representation based on the attended speech and EEG according to the label in the training stage.Since the label is unavailable in testing, we then fuse a complete representation, which depends on all the speeches and EEG, and align the complete representation with the task-related one using contrastive learning.Through that, the TMC can approximate the task-related representation by the complete one.Since the fusion of complete representation does not need label information, we can get an approximate solution to the non-trivial problem above.
Contributions.Our main contributions are: 1) By applying the multi-view VAE, we construct our method to exploit the information in the multi-view data and learn a more comprehensive representation (see figure 3 for an overview of the framework).
2) We propose task-related multi-view contrastive (TMC) learning which can utilize the prior knowledge about different views of data to learn an approximate task-related representation effectively.3) The experiments show that our method is comparable to or much better than the state-of-the-art methods on two popular AAD datasets.with neural encoding and decoding, that is, predicting the brain activities (EEG) given stimuli, or reconstructing the stimuli given brain activities [19].In the forward encoding methods, an encoder will be trained to infer the EEG given different speech signals [1,39] and correlated to the real EEG to decide the attended speech.While for the backward decoding methods [11,14,15,23] , a stimulus reconstruction pattern is wildly accepted [2,29,30].The envelope of attended speech will be reconstructed from the EEG, and then compare with all the speeches using the Pearson correlation coefficient.The traditional methods are mostly based on a linear model, which fails to capture the nonlinear characteristics of the human auditory system [43].

Deep Learning Methods.
With the applications of deep learning in the brain-computer interface, many works decoded auditory attention using deep neural networks.Although the stimulus reconstruction pattern can be easily transferred to the deep learning AAD methods [16], most of the works choose a more direct and end-to-end way, i.e. classify the speeches directly [6,8,26,34,38].For example, in [6] and [34], the authors build different attention mechanisms and apply them to the channel, band, or temporal of EEG to extract effective representation for AAD.However, these methods do not accumulate the prior knowledge of different views in the extraction of representation.As we mentioned before, the task-related information is contained in the attended speech and EEG, and ignoring such prior knowledge will hinder the performance of AAD methods.Different from those methods, we propose a multi-view auditory decoding method based on multi-view VAE, and use the TMC learning to accumulate the prior knowledge in the fusion of representation and learn an approximate task-related representation.

Multi-View VAEs
Recently, there has been a research interest in using VAE for selfsupervised multi-view generative models, and produced a lot of important research progress [25,32,40].The greatest advantage of multi-view VAEs is that they can infer the complete representation given incomplete views of data.And the fundamental difference between these works is in the formulation of constructing the complete representation space, i.e. the complete posteriors.In MVAE [40], the researchers use a product of single-view posteriors (Product-of-Experts, PoE [21]) to formulate the complete posterior.While in MMVAE [32], the complete posterior is formulated using a mixture of single-view posteriors (Mixture-of-Experts, MoE).After that, several works have been proposed to improve the performance of MVAE and MMVAE [13,25,33,35,41].In order to effectively combine the advantages of MVAE and MMVAE, MoPoE-VAE [36] use the Mixture-of-Products-of-Experts (MoPoE) which first conducts PoE on subsets of complete views and then form the complete posterior using MoE on these subsets.

METHODOLOGY
The primary purpose of AAD is to find the attended speaker in multiple speakers.We design our method based on the prior knowledge that the information about the attended speaker is contained in EEG and the attended speech.In contrast, unattended speech is unrelated to the goal when decoding auditory attention.Based on this idea, the main challenges are: 1) The method we use to construct the representation space; 2) How does our method reduce the interference of unrelated information while retaining the task-related information during the training?We will specify our method to address these two challenges in the following subsections.

Decoding Auditory Attention with Multi-View VAE
We construct our representation space using multi-view VAE.Specifically, we consider the EEG and speech stimuli (both the attended and unattended ones) as different views of data that may contain information about the subject's auditory attention, and use multiview VAE to fuse different views into a common representation space.The overview of our method architecture is illustrated in figure 3.
Given the raw EEG and speech stimuli, we extract different features from EEG and speech stimuli in the preprocessing stage.We extract the speech spectrogram using the short-time Fourier transform (STFT) from the lowpass-filtered raw speech stimuli.While for the EEG signal, we extract different frequency bands to construct a more comprehensive feature.We consider five EEG bands including the  (1 − 4Hz),  (4 − 8Hz),  (8 − 12Hz),  (12 − 30Hz), and low  (30 − 50Hz) [5].The detailed implementation of the data preprocessing can be found in Section 4.
The samples from the complete posterior will be fed into three decoders to reconstruct the original input, and the multi-view VAE will be trained by maximizing the evidence lower bound (ELBO): where the   (•) is the KL-divergence that is used to measure the statistical distance between complete posterior and the isotropic Gaussian   ().
The multi-view VAE can extract powerful representation in a self-supervision way.To approach the auditory attention in the representation space constructed by multi-view VAE, we apply a simple classifier in the complete representation space to separate samples with different attended speeches.The classifier contains a 3layer MLP, which can map the samples from the complete posterior into a one-hot vector that indicates the attended speech in the input speeches.
We minimized the binary cross entropy (BCE) loss for the classifier during the training.Let the C() denote the classification result, and y is the label related to it, we compute classification loss as:

Approximate the Task-Related Representation Using TMC
Although the multi-view VAE can extract powerful representation from the complete view data, it has an inherent drawback in representation fusion.In fact, as we mentioned before, even if the unattended speech has task-unrelated information, we have no choice but to include it in the complete representation.Since removing the task-unrelated unattended view needs the information from the label that is not available in the testing stage, it is impossible to get the task-related representations (posterior given EEG and attended speech) without ground-truth.
To solve this problem and make better use of prior knowledge, we propose TMC learning that encourages multi-view VAE to learn an approximate task-related representation.A simple TMC instantiated between two multi-view samples   and   is shown in Figure 4.

3.2.1
Task-Related Multi-View Contrastive (TMC) Learning.In general, we use {   }  =1 = X  to denote a sample from the general multi-view data which have M views.Moreover, we assume that there is a subset of views {   }  ∈ = X   ⊂ X  that is taskunrelated.We refer to the single-view representations    =  (   ) for the presentations extracted from the single-view data, and using    =  (X  ) for the complete representation which extracts from all the given views.Also, with the information from the gound-truth of the task, we have the task-related representation    =  (X   ) which is fused by the representations from several single-views {   }  ∈ = X   ⊆ X that are related to the task.TMC uses contrastive learning to align the complete representation to the task-related one.Specifically, we compute the similarity of positive pair as: where we choose cosine for (•, •) and  is the temperature hyperparameter.
And for the negative pairs, we set the similarity between two different samples as the negative one.We consider the similarity between the complete representation and the task-related one as: So the TMC loss has the following form: 3.2.2Approximation Task-Related Representation.As we mentioned before, the ideology task-related representation is unavailable in the testing stage, but it can be approximated through the complete representation by using the TMC learning.Specifically, in the AAD task, where we use multi-view VAE as the backbone network, the single-view representation    =  (   ) are sampled from the single-view posteriors learned by encoders related to different views: Take the  1 as the attended speech for an example, the complete representation    =  (X  ) and task-related representation    =  (X   ) are extracted by the multi-view VAE from different fused posteriors: ∼   (|,  1 ).
And the TMC can encourage the multi-view VAE to approximate the task-related representation by aligning the complete representation to the task-related one, which is fused by attended speech and EEG.We implement that by using the TMC loss to joint training multi-view VAE.So the loss function we used for the AAD task in our method is: the  and  are the weights of classification loss and TMC loss.
Although we propose TMC learning for the AAD task, we must point out that TMC is a general learning method.And the intuitive idea behind TMC can be applied to any multi-view data which bothered by the task-unrelated views.
In the implementation, we take the advantage of MoPoE that MoE and PoE are special cases of MoPoE.Specifically, when we constraint all the subsets of complete view only have single-view X 1 = {}, X 2 = { 1 }X 3 = { 2 }, we can have the MoE posterior: and when we constraint the MoPoE to have only one subset which is the complete view itself, we can have the PoE posterior: In Section 4, we give a thorough evaluation of our TMC learning with different fusion methods of multi-view VAE.

EXPERIMENTS 4.1 Experiment Setup
4.1.1AAD Datasets.We test our method on two popular AAD datasets.The first one is the KUL dataset [12] which collects EEG data from 16 normal-hearing subjects in a soundproof and electromagnetically shielded room.The EEG data are collected by a 64-channel BioSemi ActiveTwo system at 8196 Hz sample rate.The stimuli are Dutch short stories narrated by different male speakers.To help the subjects focus on the experiment, the KUL dataset truncates the silences longer than 500 ms to 500 ms.The stimuli have two presentation conditions, HRTF (head-related transfer function) and dry.In the HRTF, stimuli applied to the subject's left and right ears are simulated by HRTF functions.While in the dry condition, the different story tracks are presented separately in the left ear or the right ear.We use the dry condition in our experiments, which has 4 trials with 6 mins duration for every subject.The DTU dataset [18] contains EEG data from 18 subjects who take the experiment in a soundproof room.The EEG data are collected by a 64-channel BioSemi ActiveTwo system at 512 Hz sample rate.Different from the KUL dataset, the DTU datasets use Danish speeches narrated by a male and a female speaker.Every subject will experience 60 trials of speech stimuli, and every trial last for 50 s.

Data
Preprocessing.The speech stimuli are filtered and downsampled before extracting the spectrogram.We first pass the speeches through a Chebyshev (type II) low-pass filter with 8 kHz cut-off frequency and downsampled the speeches to 16 kHz.Then we split the speeches into many decision windows and extract the spectrogram using the short-time Fourier transform with 32 ms Hann window and 12 ms hop length.
For the EEG signal, we form a 3D filter bank by passing the EEG signal into the different Chebyshev (type II) band-pass filters, and concatenate the different frequency bands in one tensor.We use frequency bands of 1-4 Hz, 4-8 Hz, 8-12 Hz, 12-30 Hz, and 30-50 Hz which are known as ,  , , , and low- bands in EEG [5].While for the EEG channel, we follow the Joint CNN-LSTM [26] to use F7, F3, F4, F8, T7, C3, Cz, C4, T8, Pz instead of all the electrode.We refer to figure 5 for the topographic map of the EEG filter bank features.
In our experiments, we use two different decision window settings, 2 s and 3 s.With the longer decision window, the signal will contain more auditory attention information.Since our method is based on deep learning, we adopt the data augmentation by adding overlap between two windows.The overlap is set to 1 s for the 2 s decision window and 2 s for the 3 s decision window.The data volume after performing data augmentation is listed in Table 1.Since we use the same hop length for different decision windows, the total amount of data under different decision window lengths is the same.This setting can eliminate the impact on the performance caused by training data volume in different decision window lengths, especially for deep learning methods where data is a critical factor.4.1.3Network Settings.Our method is implemented based on Pytorch [31].For the encoder, we adapt the CNN part from Joint CNN-LSTM [26], which uses 4 convolution layers for the EEG encoder and 5 for the speech encoder.We add one common linear layer and two private linear layers after the CNNs for the mean and variance of single-view posteriors.While for the decoder, we use a linear layer and several deconvolution layers (the same number as the single-view encoder) to reverse the process of encoding.For the classification part, we use a 3 layers MLP as our classifier.
We keep the network architecture identical in different decision windows.When decoding the auditory attention in 2 s decision windows, we just repeat and truncate the 2 s signal to make the input of the encoders have a 3 s length.In this part, we first evaluate the performance of task-related representation in the testing stage.Then we compare our method to several previous works in different decision windows on both datasets.We also include the MoPoE-VAE [36] and make a comparison of the representation similarity to evaluate the effectiveness of our TMC learning.After that, we take close scrutiny to our TMC learning by evaluating TMC learning with different fusion methods.In all the tables except Table 6, * denotes the TMC-VAE performance is significantly better than the compared method (one-tailed unpaired t-test, p<0.05).

Evaluation of Task-Related Representation.
To verify the reliability of our ideas, we first evaluate the decoding performance using task-related representation.We evaluate the performance in two aspects: 1) the decoding accuracy and 2) the visualization of task-related representation.
We use the label to construct task-related representation and classify the auditory attention based on it.The decoding performances are shown in Table 2.We find that the task-related representation can yield 100% accuracy on both testing sets.We also visualize the task-related representation in the testing stage.As is shown in Figure 6, samples with different attended speeches are well separated in the task-related representation space.
Even though these results can not prove the performance of our method, the superior separation in the task-related representation space supports our motivation, which aims to construct an approximate task-related representation.[22] to identify the attended speech in two input speeches.Deep CCA [23] performs the correlation base AAD using the deep neural network with regularization.CNN-FC [6] learn a discriminative representation for AAD by using the attention mechanism in their network, which is the state-of-the-art method on the KUL and the DTU datasets according to our knowledge.And the MoPoE-VAE refers to using the same settings as our method but without TMC loss.We choose MoPoE as the fusion method of our approach in this study and use TMC-VAE to represent this configuration.We report the results in 2 s and 3 s decision windows on the KUL dataset in table 3. Our TMC-VAE yields state-of-the-art result on the KUL dataset under 3 s decision window.Moreover, our TMC-VAE outperforms the existing methods by a large margin.Compared with the joint CNN-LSTM [26], which has the same CNN encoder and more robust sequence data modeling capability with its LSTM module [22], TMC-VAE can improve the decoding performance by 18.4%.Also, the comparison between MoPoE-VAE and existing methods can prove the advantages of using multi-view VAE to decode auditory attention.We notice that our method has a performance drop under a smaller decision window, but we must point out that our main contribution is using multi-view VAE and TMC learning to learn an approximate task-related representation, rather than carefully designing the networks.And our method can be easily adopted with a well-designed VAE backbone.
We present the results on the DTU dataset in table 4. The TMC-VAE also outperforms existing methods with a large margin under 3 s decision window.Although we use ordinary CNN architecture in our encoders and decoders, our method can perform comparable results with elaborately designed attention-based architecture (CNN-FC [6]) under 2 s decision window.Also, in both datasets and all the decision windows, the comparison between the MoPoE-VAE and our method can demonstrate the effectiveness of TMC learning.To make a better evaluation of the TMC learning and validate the performance improvement of TMC-VAE originates from the approximate representation, we compare the similarity between the task-related representation and the approximate one (complete representation) using the cosine similarity.We use the representations from two models in the testing stage in this part: 1) the MoPoE-VAE which trained without the TMC learning and 2) the TMC-VAE which trained with the TMC learning.The results are shown in Table 5 in which we can find that the similarity between the complete representation and task-related representation is increased significantly.To make a thorough study of the effectiveness of our TMC learning, we present the ablation study of TMC learning with different multiview VAEs.We choose three typical multi-view VAEs here: 1) MVAE [40] which uses PoE to fuse the single-view posteriors 2) MMVAE [32] which proposes MoE in the fusion of complete posteriors and 3) MoPoE-VAE [36] which take the advantages from PoE and MoE, and proposed a general fusion modal.It is shown in table 6 that TMC learning can encourage multiview VAE to learn an approximate task-related representation with different posterior fusion formulas.Also, the TMC-VAE yields the best results, which also gives quantitative support for the advantage of choosing MoPoE in TMC-VAE.

Representation Visualization.
In this part of the study, we present some qualitative results in figure 7 to give intuitive evidence of the effectiveness of TMC learning.Specifically, we visualize   All the visualizations are yields in the testing stage on the KUL dataset under a 3 s decision window.To perform the visualizations, we first feed a batch of data into the trained encoders to extract the 128-dim representations, and then map these representations to 2-dim using t-SNE [37].We apply blue for the task-related representations and red for the complete ones, and distinguish the samples with different attended speeches using circle and cross.
As shown in figure 7, the task-related representations (in blue) are more separable in the representation space than the complete one (in red).While TMC learning can encourage the multi-view VAEs to learn a more separable complete representation by aligning it to the task-related one.We also observed that the TMC-VAE has the most prominent consistency and separability in all the visualizations, which suggests that MoPoE is the optimal choice of the fusion formula in our method.We report the decoding performance of different posterior fusion methods in the next part study.

CONCLUSIONS.
In this work, we first introduce a multi-view VAE and a classifier to learn the multi-view representation for AAD efficiently.Then, inspired by Broadbent's filter model, we define the task-related representation in the AAD task, and propose the TMC learning to encourage the complete representation aligning with the taskrelated one.Finally, the experiments on the KUL and the DTU datasets prove the advantages of our method.

Figure 1 :
Figure 1: In a cocktail party, one can focus on the interested speaker while ignoring other interference sounds.(Image from The Great Gatsby: Warner Bros. Pictures and Roadshow Films.)

Figure 2 :
Figure 2: Auditory attention decoding in the dual-speaker scenario.The subject will hear two different speeches when during the EEG recording.The AAD methods can infer the subject's attention based on the speeches and EEG.

2. 1 . 1
The Traditional Methods.The traditional correlation-based methods can be divided into two categories, forward encoding methods and backward decoding methods.The ideas of forward encoding methods and backward decoding methods are consistent

Figure 3 :
Figure 3: Overview of our model architecture.(a) The raw EEG and speech stimuli.(b) Preprocessed EEG and speech stimuli.We use the spectrogram for speech stimuli and the filter bank feature for the EEG.(c) We use a multi-view VAE architecture to extract the representation from the EEG and speech stimuli.(d) The reconstructed inputs.(e) Our classification network is constructed by a three-layer MLP.(f) The classifier will output a one-hot vector that indicates the AAD result.(g) The process of extracting filter bank EEG feature.The raw EEG will be passed through different band-pass filters to form the filter bank feature.

Figure 4 :
Figure 4: An example of TMC learning between two multiview samples   and   .We set the positive pair as the complete representation and the task-related representation in the same sample, and set the negative pair between different samples.

Figure 5 :
Figure 5: A topographic map of the EEG filter bank feature we used.

4. 1 . 4
Parameter Settings.For all the experiments, we use  = 1,  = 1 for the weight of classification loss and TMC loss, and set the temperature hyperparameter  = 1.5.Moreover, for the dimension of representation learned by multi-view VAE, we use 128-dim for all the posterior fusion methods and conducted the experiment on a batch of 128 samples.

Figure 6 :
Figure 6: Representation visualization of task-related representation.The • and × in the figure denote samples with different attended speeches.

Figure 7 :
Figure 7: Representation visualization of different multi-view VAE with or without TMC learning.We perform all the visualization on the KUL dataset with a 3 s decision window.In the first row, we present the visualization of representation without TMC learning.We use blue for the task-related representations and red for the complete ones.The • and × in the figure denote samples with different attended speeches.

Table 1 :
Data volume of two datasets after data augmentation.

Table 2 :
Decoding the auditory attention using task-related representation.

Table 3 :
Accuracy on the KUL dataset in different decision windows.

Table 4 :
Accuracy on the DTU dataset in different decision windows.

Table 6 :
AAD accuracy of different fusion methods on the KUL and the DTU datasets.* denotes the performance with TMC is significantly better than without TMC (one-tailed unpaired t-test, p<0.05).95.6% * 78.4% * 91.5% * TMC-VAE 85.5% 96.6% 80.8% 92.1% the representation learned by different multi-view VAEs with or without TMC learning.We use the same multi-view VAEs in Section 4.2.4. *