Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction

In this paper, we study how well human speech can automatically be filtered when this overlaps with the voice and fan noise of a social robot, Pepper. We ultimately aim for an HRI scenario where the microphone can remain open when the robot is speaking, enabling a more natural turn-taking scheme where the human can interrupt the robot. To respond appropriately, the robot would need to understand what the interlocutor said in the overlapping part of the speech, which can be accomplished by target speech extraction (TSE). To investigate how well TSE can be accomplished in the context of the popular social robot Pepper, we set out to manufacture a datase composed of a mixture of recorded speech of Pepper itself, its fan noise (which is close to the microphones), and human speech as recorded by the Pepper microphone, in a room with low reverberation and high reverberation. Comparing a signal processing approach, with and without post-filtering, and a convolutional recurrent neural network (CRNN) approach to a state-of-the-art speaker identification-based TSE model, we found that the signal processing approach without post-filtering yielded the best performance in terms of Word Error Rate on the overlapping speech signals with low reverberation, while the CRNN approach is more robust for reverberation. These results show that estimating the human voice in overlapping speech with a robot is possible in real-life application, provided that the room reverberation is low and the human speech has a high volume or high pitch.


INTRODUCTION
Unlike humans who are capable of selective auditory attention [33], social robots currently cannot prioritize particular sounds.More specifically, they generally lack the ability to extract and recognize human speech when they are talking themselves, since state-of-the-art automatic speech recognition (ASR) systems are not able to separately transcribe such audio streams.Because these systems cannot handle overlapping speech, one approach is to use a simplex channel [30] and configure the robots to listen only to their users when the robot is not talking itself.Rigid and unnatural turn-taking schemes based on this approach are often used, where the microphone needs to be switched off when the robot is talking and switched on again when the robot stops talking.This approach raises several limitations during the human-robot interaction (HRI).For example, during the speech, the robot cannot listen to a user's backchanneling to indicate that they are listening.Furthermore, when the user starts answering the question before the robot finishes, parts of the answers will be left out.This will lead to miscommunication in HRI [30].
Another approach is to enable a duplex channel, that is, one that allows the system to hear what the user is saying while it is speaking [30].This approach deploys an additional microphone very close to human users and performs ASR on the signal recorded by this microphone when they start talking [26].In such a setup, the robot voice can be treated as background noise, and ASR systems will most of the time be able to filter it out.A variation on this setup was proposed in [20].They used two separate (sets of) microphones: one close to the robot speaker and the other relatively far from the speaker and closer to the user.The recordings of one microphone were used as a mask to actively denoise the speaker signal in the other recordings.Neither of these approaches is natural [31], as they require human users to adjust to a rigid turn-taking scheme or be positioned next to a dedicated microphone.
A more natural approach would be to keep the robot microphone open for the entire duration of the conversation [29].This requires the robot to apply a target speech extraction (TSE) system [40] to the recordings to separate the voice of the user from that of the robot.From an engineering point of view, the TSE problem is directly related to noise reduction and blind source separation (BSS) [15,40].While regular noise reduction can handle overlapping speech only if the overlapping speaker's voice is known [4], BSS does not require any information about the target speaker, as is typical for HRI scenarios.However, BSS requires the estimation of the number of speakers and can lead to global permutation ambiguity1 [14].
Since the success in solving global permutation by deep-clustering [12] and permutation invariant training (PIT) [38], a large number of neural-based TSE networks have been proposed that are trained and tested on single-channel audio.Jun Du et al. [6] proposed an initial network based on talker-closed2 audio clues to extract the speech of a target talker.Quan Wang et al. [35] proposed a talkeropen 3 network that extracted a representation of the target talker from a clean enrollment sequence and then isolated the talker's voice in a mix.Similar ideas have also been explored by Meng Ge et al. [7] and Katerina Zmolikova et al. [41].Other works [10,23] use visual and/or spatial clues, which are not the focus of the current study.
Although the results of the research mentioned above are promising [7,15,35], the gap between laboratory experiments and their deployments in HRI has not yet been demonstrated.This is likely due to three factors.First, the robot speech signal generated by the text-to-speech (TTS) models differs from what its microphone records due to the non-linear and inconsistent microphone response at different frequency levels.As shown in Fig. 1, the speech to be played by the robot speaker has more spectral characteristics than the same speech recorded by its microphones.As a consequence, it is not practical to use the original speech signal to generate a speech mask (SM) or perform spectral subtraction (SS), one of the most widely deployed noise deduction methods, to extract overlapping human speech.Furthermore, this SS-based speech filtering method can over-subtract and result in severe distortion in audio output [35].It requires post-filtering to restore the authentic target speech.
Second, the significantly low signal-power ratio between human speech and robot speech also complicates this task, as shown in Fig. 2. In most of the current humanoid robot geometry designs, the close distance between the microphones and speakers causes the robot's speech to possess significantly more power than the overlapping human speech in the received signal.This hinders most speech separation systems in extracting the human speech signal.This is also the reason why this focused task is different from the common TSE task, which could be evaluated on the public benchmark dataset.
Third, most TSE models are deep and computationally intensive, which can lead to unworkable delays for real-time HRI.
This paper aims to contribute to HRI by enabling a robot to filter out its speech, as well as its stationary ego noise, from the mixture that its microphone receives and improve the speech recognition   result of the overlapping human speech signal during HRI.In this work, we address three research questions.
(1) Pipeline Design: How can we filter out robot speech and preserve human speech information from a generated overlapping speech signal?(2) Dataset Construction: How can we construct an overlapping speech dataset that resembles the real recordings?(3) Performance Evaluation: To what extent can performance be improved by post-filtering?What is the trade-off between performance and computational load?
We aim to overcome the gaps mentioned above and enable robots to open microphones during HRI and make sense of what the human says during the overlapping speech.In order to do so, we experiment with methods to remove the speech and ego noise produced by the robot itself from a single-channel recorded audio, and to recover the overlapping human speech signals to improve the ASR result.We propose two different audio processing architectures to filter out the robot's speech, with the help from the robot's embedded API and the online TTS API [9] to pre-acquire the interfering speech signal.The experimental results show that the proposed signal processingbased method without post-filtering is most effective in improving the human speech ASR results under the circumstances when the room reverberation is low and the target speaker is high pitched or at a relatively high volume, while the proposed neural network approach shows good robustness to the reverberation condition.
The rest of this paper is organized as follows.In Section 2, we introduce several neural network-based and signal processing-based TSE methods.In Section 3, we present our two proposed pipelines for solving the robot's ego speech filtering problem.In Section 4, we elaborate on the setup of the experiment and the evaluation metrics.In Section 5, we present and analyze the results of the baseline and proposed methods.In Section 6, we draw our conclusions and discuss future work.

RELATED WORK
The field focused on TSE for single-channel recordings, also known as target voice filtering [35], can be summarized into two approaches: signal processing-based and neural network-based.

Signal Processing-Based TSE
In the single-channel approach, signal processing-based TSE methods generally calculate and reduce audio noise in a spectrum space.The enhanced signal Ŝ  is obtained by multiplying the input signal    by non-negative real-value weights,    [2], also known as the signal mask (SM).Ideally, the SM is 0 if only the undesired signal is active and 1 if the desired signal is active in a certain time-frequency (TF) bin.A wide variety of approaches have been proposed to optimize this SM [1], including spectral subtraction, the Wiener filter, minimum mean-square estimation, the factorial hidden Markov model, and minima-controlled recursive averaging.These methods have been commonly designed and implemented to estimate SM during speech pause or silence.They are efficient in attenuating stationary noise [32].Several spectral subtraction schemes have been proposed for robotics [13].But they were mainly aimed at estimating target speech from robot's ego fan noise or joint noise.

Neural Network-Based TSE
In the mid-2010s, deep neural networks were introduced for the first time to address the TSE problem.Katerina Zmolikova et al. [39] introduced SpeakerBeam, which explored three different methods to inform the network to modify the behaviour of the acoustic model.Quan Wang et al. proposed VoiceFilter [32] and its subsequent work, VoiceFilter-Lite [34], as plug-ins before automatic speech recognition.They used a pre-trained speaker diarization network as an additional informant.Shulin He et al. [11] followed this idea and proposed SpeakerFilter, which learned the target speaker's information while producing the SM and no longer required a pretrained speaker identification network.To avoid the adverse effect on performance from different window lengths when analyzing the reference signal and the input mixture signal, Meng Ge et al. [8] proposed a time-domain solution for TSE, which avoided phase estimation in the TF domain.Although the performance of these proposed networks is promising on public datasets [18], they have not been tested or evaluated in real recordings during HRI, where human speech overlaps with robot speech.
In this study, we compare these two different methods, aiming to filter out the robot's speech in the overlapping audio mixture.Inspired by the work [32,35] and their promising results in a related task, we select spectral subtraction and convolutional recurrent neural network (CRNN) as our method to achieve this objective.

METHODS 3.1 Problem Formulation
The generic application can be illustrated in Fig. 3.In this figure, the where  is the spectrum of the target speech signal,  is the interfering signal comprising the speech from other speakers and noise (in the experiment we only consider the reverberation from the interfering speaker), and  and  are time and frequency indices, respectively.There is also a reference speech  from the text-to-speech model, which the robot will play and record during the interaction.In this paper, our objective is to extract the target speech  from the mixture signal  with the help of .

Proposed Structure
We propose two different audio processing pipelines to extract human speech when it overlaps with robot speech.

Audio
Processing Pipeline based on Spectral Subtraction.Inspired by proprioception [3], which enables humans to subconsciously generate a speech mask to filter their own voice when they start talking, and given that a new deployment4 [24] enables the pre-acquisition of the robot speech signal from the TTS APIs, we propose the self-speech filtering pipeline, as shown in Fig. 4. The dashed box highlights the designed ego-speech filtering pipeline.
As mentioned in Section 1, the reference speech signal differs from the recorded speech signal, due to the non-linear microphone frequency response function.To obtain the SM for the interference speech in the recordings, the microphone response function must be precalculated.The frequency response of the speaker can be defined as below: Figure 4: Signal Processing Pipeline.In the pipeline, the input texture is sent to on cloud text-to-speech API using SIC framework.
where   () is the frequency response of the speaker at time , ℎ  is the response function of that speaker, * means convolutional operation, and () is the robot's input signal.Correspondingly, the response of the microphone can be defined as follows: where   () is the frequency response of the microphone, ℎ  is the frequency response function of that microphone, and  () is the ego noise signal.Using the STFT on both sides, we can obtain the following.
where   (,  ), (,  ), and  (,  ) respectively denote the spectrogram of the received mixture audio signal, the robot's input signal, and the recorded ego noise signal.Furthermore, due to the time-invariant characteristics of the frequency response coefficients,   ( ) •   ( ) represents the speaker-microphone frequency response coefficients and can be calculated as follows: From Equation 5, the frequency response function between the robot input signal and the recorded signal can be obtained by recording the robot's ego fan noise, as well as a sine signal whose frequency sweeps over all the possible bins [22].The SM based on the reference signal spectrogram    can be calculated by the following equation: where |  | and |   | are respectively the real-value spectrogram of the overlapping speech signal and the reference speech signal, and  is the over-subtraction factor.Considering the non-linear characteristics of speech [19], we apply a Hanning window on the  produced by Eq.6 and obtain a new Ŝ : where  (2 ×  + 1, 2 ×  + 1) is the Hanning window, 2 •  + 1 is the window dimension in time-frame direction, and 2 •  + 1 is in frequency direction.In our experiment, we set  = 3 and  = 1, resulting in a window dimension of (7, 3).The estimated speech signal can be obtained as follows: where   is the spectrogram of the estimation speech, ( * ) represents the element-wise indication of the sign of the original   , and  is amplify coefficient.And finally, the estimated speech signal can be obtained by inverse STFT (iSTFT).The estimated speech signal obtained may be distorted due to oversubtraction.We will then use a pre-trained VoiceFixer model [17], which shows great performance in restoring strongly degraded human speech, to reconstruct this.Another crucial factor in Eq.6 is the matching  between the robot speech and the reference speech.Because we focus on the estimation of human speech when interrupting robot speech and the time delay is unstable when sending the command to make the robot speak during real-life operation, we propose to use the first 0.5 second length of the reference signal as a detector and calculate the cross-correlation (CC) value  between the detector and the recorded robot speech signal.The maximum value 's corresponding value  will be considered the time delay between the reference signal and the recorded signal.The comparison result is presented in Fig. 5 between the proposed method and the traditional frame power-based voice activity detection (VAD) method [28].We can observe that the recorded signal after trimming the silent part based on cross-correlation aligns better with the reference signal in the time-frequency (TF) domain.In order to alleviate oversubtraction, which is inevitable in spectral subtraction, we tested to adopt a pre-trained model, VoiceFixer [17], to post-filter and restore the estimated signal.We designed a CRNN architecture as shown in Fig. 6.The network predicts a soft mask, which is elementwise multiplied with the mixture magnitude spectrogram to produce an estimated waveform.We directly merge the phase of the noisy audio with the estimated magnitude spectrogram and apply an iSTFT on the result.The network is trained to minimize the difference between the masked magnitude spectrogram and the target magnitude spectrogram computed from the clean audio.The system consists of two separate convolutional neural networks (CNN), each with eight layers and batch normalization layers, and one bidirectional long-short-time memory (BLSTM) layer, followed by two fully connected (FC) layers.All of these layers have ReLU activations except for the last layer, which has a sigmoid activation.
The system takes three inputs for one training step: (1) clean ground truth audio from the target human speaker, (2) noisy audio containing the overlapping speech from the robot and the target speaker, and (3) reference audio generated from the APIs [9].Because we expect the network to estimate the target speech based on the reference signal instead of the reference speaker identification, we did not adopt the Speaker Encoder in our architecture as [35] and [11] did.Instead, we adopted another convolutional neural network to learn from the reference spectrogram, which shares the same hypermeter setting as that for the input spectrogram.Parameter values are provided in Table 1.To train the system, all input audios are truncated with a 5second length and are converted to single-channel with a sampling rate of 16kHz if necessary.

EXPERIMENTS 4.1 Dataset
Instead of recording overlapping speech between Pepper and human interlocutors as a reference dataset, we chose to generate a mixed speech signal instead.There are three reasons why we made this decision.First, we intended to create a dataset that resembled the real recorded overlapping speech as much as possible.Second, we adopted an end-to-end supervised learning method to train our proposed network, which requires human labor to label not only the start of robot speech, but also that of human speech in the recorded data.Generating a mixed signal helps considerably reduce this human labor, as the start times can be controlled.Third, TSE models trained on manufactured overlapping datasets have been reported to have good generalizability to real overlapping recordings [40].
Therefore, we collected three sets of real recorded data for the development and evaluation of the proposed methods: one with the robot playing a sine signal described in Section 3 and its ego fan noise, one with a speaker playing human speech, and the other with the robot playing speech generated by the robot's embedded TTS API and Google Dialogflow's TTS API.We used the first to calculate the speaker-microphone frequency response coefficients, the second to determine the human speech power gain when overlapping with robot speech, and the third to generate the overlapping speech data for training and testing.
We used Pepper for all recordings, using one of the four microphones on top of its head with a forward look direction.The audio signals received by the microphone are strongly affected by the fan noise inside its head.The sampling rate is 16 kHz.The corresponding collected data can be found at this link 5 .
Recording Sine Signal and Ego Noise: We collected these data by recording the sine signal whose frequency sweeps over (0 Hz, 8001 Hz) with a step of 13 Hz.We placed the robot in a large and quiet laboratory room, and programmed it to stand still and look ahead while recording the sound produced by its own speakers at volume 50 (as shown in Fig. 7(a)).We also recorded Pepper's ego fan noise under this condition, without Pepper doing anything but standing still.
Recording Human Speech: We collected the data by recording clean speech played on a loudspeaker placed 1 meter from Pepper in the same laboratory room.We programmed Pepper to look at the speaker when the speaker was playing.A total of 1,163 clean speech fragments from the Librispeech corpus [25] were selected.We altered the speaker volume from 10 to 100, and decided that volume 50 was close enough to the common human's volume when interacting with a robot.With this volume, the speech would retain its characteristics in the TF domain from fan noise, as shown in Fig. 8(c) and Fig. 8(d).
Recording Robot Speech: Robot speech was collected by recording Pepper playing the API-generated speech signal, whose length is longer than 5 seconds, from its embedded speakers at volume 50.We used 17 different speaker voices, including Pepper's own embodied one.We created recordings in the laboratory and a small office room (as shown in Fig. 7), and recorded 7913 audio in total.These 7913 different speech contents were randomly selected from Librispeech [25].We trimmed the silent parts of the recordings to align with the speech signal in the time domain using cross-correlation.Taking into account reverberation, only the first 5-second segment was selected in each recording.Finally, we used 1800 audio segments recorded in the large laboratory and 6200 in the small office to train and evaluate the proposed methods.

Data Generation
We cannot use a "standard" benchmark dataset due to the reasons mentioned in Section 1.Therefore, we use the scheme shown in Fig. 9 to obtain the training triplets.The noisy audio is generated   Based on the distribution of the recordings in room size, we randomly selected 200 segments recorded in the large laboratory room and 600 in the small office as a validation dataset to evaluate different approaches, while the rest were used for network training.

Network Training
We adopted the power-law compressed reconstruction error [35] as a loss function to train our network and monitored the training process with Tensorboard to avoid overfitting.

Baseline Method
To compare the proposed methods with the state-of-the-art method [35], we adopted the pre-trained VoiceFilter model provided by Seung-won Park [37].Since this model requires a reference speech signal to learn which voice to filter out, we randomly selected another recorded robot speech file that belongs to same speaker  identification in the same room.This is in line with what was done in [35].

Evaluation Metrics
To evaluate the performance of the three different proposed methods, we use three metrics: the speech recognition Word Error Rate (WER), the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) between the estimated signal and the target speech, and the computing time.4.5.1 Word Error Rate.As mentioned in Section 1, the main goal of our system is to improve speech recognition when human speech overlaps with robot speech.We chose the state-of-the-art opensource Whisper [27] ASR system for WER evaluation.Because we truncated human speech at the point where speech ends in each fragment, we cannot use ground-truth transcription.Instead, we used the transcription of clean human speech truncated at the same point as the ground truth to calculate the WER.We calculated the WER value after processing the overlapping audio using each of the methods.As a reference, we also calculated the WER on the overlapping audio without any processing performed.A good filtering system should be able to reduce the WER significantly, which means that this system is improving human speech recognition when a robot is actively speaking itself.4.5.2Scale-Invariant Signal-to-Distortion Ratio.The SI-SDR is a common metric to evaluate single-channel speech separation systems [16].It is an energy ratio, expressed in dB, between the orthogonal projection of the estimated signal on the spanned line of the target speech signal.A higher value indicates better performance.

Computing Time.
The computing time required to process the input signal is crucial for the real-life application of a TSE system during HRI.Therefore, we present the computing time for each proposed method to process a noisy speech mixture with a 5 second length.All calculations are done on a local desktop with an Intel(R) Core(TM) i9-9900K CPU and a NVIDIA GeForce RTX 2070 SUPER GPU for network training acceleration.

RESULTS AND ANALYSIS
In Table 2, we present the results of the proposed methods in different rooms compared to the original overlapping files and the baseline model.

Results
We compare the mean, median, and standard deviation values of the estimated speech WER and SI-SDR between the proposed methods and the baseline methods, with the original unfiltered overlapping speech data as a reference.We can observe that the baseline model does not filter out robot speech in most of the files.In fact, the WER and SI-SDR are close to the original unfiltered data.In contrast, the proposed signal processing-based pipeline without post-filtering has the best WER under the weak reverberation condition.There is a significant gap between the results before and after post-filtering in each condition.
However, when overlapping speech is strongly polluted by the reverberation of robot speech, this method does not significantly improve the ASR result.In comparison, the proposed CRNN-based method shows robustness for each condition with only a slightly higher average WER in comparison to the low-reverberation performance, although the WERs are still greater than 50%.The estimated speech from CRNN has the best SI-SDR results, although they are still less than 0.
The computing time required to process 5-second-long audio by the signal processing-based pipeline without post-filtering is the shortest, as low as 854 milliseconds.However, the time required by other methods is close to that.The proposed CRNN consumes an acceptable 28 milliseconds more than the baseline network.This shows promise for application of the proposed methods in real-life HRI.

Discussion
On the basis of the analysis of the results, we find that the reverberation of the room limits our proposed signal processing-based architecture to practical application in real life.This is because the proposed signal processing-based pipeline is able to filter out the robot's ego speech, but has no impact on the reverberation.There are two factors that contribute to this.First, the residual of the robot speech still has a relatively larger power compared to the target speech after filtering.Second, to filter out robot speech and emphasize target speech in the TF domain, the parameters  and  in Equations 6 and 7 result in oversubtraction (e.g.parts of the target speech are also removed), which will further result in distortion in the estimated signal.This distortion is reinforced by the pre-trained post-filtering network.This also explains why SI-SDR after post-filtering drops significantly in each condition.When the reverberation is low, the target human speech will possess a relatively greater power in the estimated speech, and the ASR system will translate this instead of reverberation.However, when the reverberation is strong and has greater power than the target speech, the ASR system cannot recognize the target speech and instead translates the reverberation.We need to emphasize that it is impractical to use a simple filter, such as the Wiener filter, to filter out the reverberation of robot speech, because the filter will regard low-energy human speech as noise and eliminate the target speech in the estimated signal.Furthermore, the amplify coefficient  in Equation 7 cannot be set too high to prevent distortion.
A possible solution to improve the ASR result of the signal processing-based method is to perform mask filtering not only on the magnitude of the overlapping speech spectrogram but also on the phase.For example, Donald S Williamson et al. [36] proposed to perform complex ratio masking for monaural speech separation and got great performance in perceptual evaluation of speech quality.A second potential solution is to use the adaptive step size method [21] to generate the SM.For example, we can time-shift the SM and perform spectral subtraction on the estimated speech  signal iteratively.However, this requires more computational time or an estimate of the number of iterations.Another reason why the ASR results of each estimated signal are significantly different is that the weak or low pitched target speech in our fabricated segments posed a problem for both the signal processing pipeline and the CRNN.This is because some values in the essential frequency bins are discarded in the spectrogram of the estimated speech signal, as shown in Fig. 10.These discarded characteristics in high-frequency bins in the Time-Frequency domain are important for the ASR system because they contain more details on the speakers [5].
For the CRNN, a different problem is at play.In our setup, we included a batch normalization layer, in accordance with [35].The role of this layer is to normalize the spectrum values based on all segments in a given batch.However, robot speech often possesses much more power compared to human speech if they overlay at the same point on the spectrogram.As a result, we found that the normalization procedure reduced the value of human speech to a value that is too small to be learned during training.It is therefore not practical to adopt a batch normalization layer in the TSE models when the interfering speech is significantly more powerful than the target speech.Another reason why the CRNN result is unsatisfactory is that the training data set is as small as 7916 compared to the baseline network, which used 100,000 triplets for training.
Another takeaway from the result is that the WER does not directly relate to the SI-SDR, which means that state-of-the-art ASR systems are tolerant to some distortion in human speech.It demonstrates the necessity to report not only the distortion of the restored speech signal but also the WER of the restored speech contents, which is the key concern in HRI research.
We compared the results of the signal processing pipeline and CRNN with a baseline model based on speaker identification, which yielded considerably worse performance.This is surprising, given that [35] claimed that their model using speaker identification as reference showed strong robustness when the interference signal had more power than the target speech.However, in our focused task, it was not possible to estimate the target speech when the robot speech possesses greater power than the target human speech.In fact, their model failed to filter out robot speech interference for most of the files.

CONCLUSION AND FUTURE WORK
In this paper, we designed and evaluated two different architectures that focus on filtering the robot speech signal from the robot received signal and improving the ability to recognize human speech during interaction when the humanoid robot and the human are both actively speaking.We demonstrated the effectiveness of these proposed methods on a manufactured dataset of realrecorded human and robot speech.We found that the proposed signal processing-based pipeline without post-filtering was able to improve the ASR ability when the reverberation of the room is weak in real time and the target speech is high pitched or at a relatively high volume.The proposed CRNN also showed good robustness to each condition, but the performance was still not satisfactory.
In terms of future work, we will look for more possible methods to improve performance.For the signal processing-based pipeline, a dereverberation speech mask should be designed to filter out the reverberation of robot speech.For the neural network-based architecture, we need to construct a larger dataset for training.Furthermore, instead of applying iSTFT to recover the estimated signal, a decoder network should be adopted.Furthermore, we will train the network jointly with an ASR tokenizer to further increase the improvement in WER.And last, we will also apply the proposed methods to real-life HRI to evaluate the method's effectiveness in identifying the human subjects' interruption and backchanneling.

Figure 1 :
Figure 1: Spectrogram of the recorded speech signal and the corresponding speech signal played by the robot.

Figure 2 :
Figure 2: The recorded overlapping speech signal on time domain.

Figure 3 :
Figure 3: The illustration of the generic application scenario.
(a) Reference signal.(b) Recorded signal after frame power VAD.(c) Recorded signal after crosscorrelation VAD.

Figure 5 :
Figure 5: The spectrogram of different signals at time  0 .
(a) Large laboratory room.(b) Small office room.

Figure 7 :
Figure 7: Experiment setup for Data collection with Pepper.
(a) Recorded human speech at volume 50.(b) Manufactured human speech with -25 power gain.(c) Spectrogram of the recorded signal.(d) Spectrogram of the manufactured signal.

Figure 8 :
Figure 8: Time domain and frequency domain display of the recorded speech signal and the generated human speech signal played by the speaker.

Figure 10 :
Figure 10: The spectrogram of the CRNN estimated signal and the target speech signal.

Table 1 :
Parameter setting of the proposed network.

Table 2 :
TSE model performance on the test set.The laboratory and small room differ in the extent to which the recorded robot speech is interfered by its reverberation.