Abstract
Automatic spoken instruction understanding (SIU) of the controller-pilot conversations in the air traffic control (ATC) requires not only recognizing the words and semantics of the speech but also determining the role of the speaker. However, few of the published works on the automatic understanding systems in air traffic communication focus on speaker role identification (SRI). In this article, we formulate the SRI task of controller-pilot communication as a binary classification problem. Furthermore, the text-based, speech-based, and speech-and-text-based multi-modal methods are proposed to achieve a comprehensive comparison of the SRI task. To ablate the impacts of the comparative approaches, various advanced neural network architectures are applied to optimize the implementation of text-based and speech-based methods. Most importantly, a multi-modal speaker role identification network (MMSRINet) is designed to achieve the SRI task by considering both the speech and textual modality features. To aggregate modality features, the modal fusion module is proposed to fuse and squeeze acoustic and textual representations by modal attention mechanism and self-attention pooling layer, respectively. Finally, the comparative approaches are validated on the ATCSpeech corpus collected from a real-world ATC environment. The experimental results demonstrate that all the comparative approaches worked for the SRI task, and the proposed MMSRINet shows competitive performance and robustness compared with the other methods on both seen and unseen data, achieving 98.56% and 98.08% accuracy, respectively.
- . 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In International Conference on Advances in Neural Information Processing Systems.Google Scholar
- . 2019. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (
Feb. 2019), 423–443.DOI: DOI: https://doi.org/10.1109/TPAMI.2018.2798607Google ScholarDigital Library
- . 2017. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio, Speech Lang. Process. 25, 6 (2017), 1291–1303.Google Scholar
Digital Library
- . 2017. Convolutional recurrent neural networks for music classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 2392–2396.Google Scholar
Digital Library
- . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186.
DOI: DOI: https://doi.org/10.18653/v1/n19-1423Google Scholar - . 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 457–468.
DOI: DOI: https://doi.org/10.18653/v1/d16-1044Google ScholarCross Ref
- . 2018. Image and encoded text fusion for multi-modal classification. In Digital Image Computing: Techniques and Applications (DICTA’18). 1–7.
DOI: DOI: https://doi.org/10.1109/DICTA.2018.8615789Google Scholar - . 2021. A context-aware language model to improve the speech recognition in air traffic control. Aerospace 8, 11 (2021), 348.Google Scholar
Cross Ref
- . 2017. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE ACM Trans. Audio Speech Lang. Process. 25, 1 (2017), 208–221.
DOI: DOI: https://doi.org/10.1109/TASLP.2016.2632307Google ScholarDigital Library
- . 2016. Unsupervised accent classification for deep data fusion of accent and language information. Speech Commun. 78 (2016), 19–33.Google Scholar
Digital Library
- . 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 770–778.
DOI: DOI: https://doi.org/10.1109/CVPR.2016.90Google ScholarCross Ref
- . 2015. Assistant-based speech recognition for ATM applications. In 11th USA/Europe Air Traffic Management Research and Development Seminar (ATM’15).Google Scholar
- . 2017. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 131–135.
DOI: DOI: https://doi.org/10.1109/ICASSP.2017.7952132Google ScholarDigital Library
- . 2020. Pattern recognition and features selection for speech emotion recognition model using deep learning. Int. J. Speech Technol. 23, 4 (2020), 799–806.Google Scholar
Digital Library
- . 2019. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv:1904.08104 [cs, eess] (
July 2019).Google Scholar - . 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 36–45.Google Scholar
Cross Ref
- . 2018. Efficient large-scale multi-modal classification. In 32nd AAAI Conference on Artificial Intelligence, (AAAI’18). AAAI Press, 5198–5204.Google Scholar
Cross Ref
- . 2014. Convolutional neural networks for sentence classification. In Conference on Empirical Methods in Natural Language Processing. ACL, 1746–1751.
DOI: DOI: https://doi.org/10.3115/v1/d14-1181Google ScholarCross Ref
- . 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.Google Scholar
Digital Library
- . 2016. Audio event detection using weakly labeled data. In 24th ACM International Conference on Multimedia. 1038–1047.Google Scholar
Digital Library
- . 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
DOI: DOI: https://doi.org/10.1109/5.726791Google ScholarCross Ref
- . 2021. Spoken instruction understanding in air traffic control: Challenge, technique, and application. Aerospace 8, 3 (2021), 65. Google Scholar
Cross Ref
- . 2020. A real-time ATC safety monitoring framework using a deep learning approach. IEEE Trans. Intell. Transp. Syst. 21, 11 (2020), 4572–4581.
DOI: DOI: https://doi.org/10.1109/TITS.2019.2940992Google ScholarCross Ref
- . 2021a. A unified framework for multilingual speech recognition in air traffic control systems. IEEE Trans. Neural Netw. Learn. Syst. 32, 8 (2021), 3608–3620.
DOI: DOI: https://doi.org/10.1109/TNNLS.2020.3015830Google ScholarCross Ref
- . 2022. Identifying and managing risks of AI-driven operations: A case study of automatic speech recognition for improving air traffic safety. Chinese J. Aeron. (2022).
DOI: DOI: https://doi.org/10.1016/j.cja.2022.08.020Google ScholarCross Ref
- . 2019. Real-time controlling dynamics sensing in air traffic system. Sensors 19, 3 (2019), 679.
DOI: DOI: https://doi.org/10.3390/s19030679Google ScholarCross Ref
- . 2021b. A deep learning framework of autonomous pilot agent for air traffic controller training. IEEE Trans. Hum.-mach. Syst. (2021), 1–9.
DOI: DOI: https://doi.org/10.1109/THMS.2021.3102827Google Scholar - . 2021c. ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems. Appl. Soft Comput. 112 (2021), 107847.
DOI: DOI: https://doi.org/10.1016/j.asoc.2021.107847Google ScholarCross Ref
- . 2015. Multi-timescale long short-term memory neural network for modelling sentences and documents. In Conference on Empirical Methods in Natural Language Processing. 2326–2335.
DOI: DOI: https://doi.org/10.18653/v1/d15-1280Google ScholarCross Ref
- . 2014. Automatic language identification using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 5337–5341.
DOI: DOI: https://doi.org/10.1109/ICASSP.2014.6854622Google ScholarCross Ref
- . 2010. Recurrent neural network-based language model. In 11th Annual Conference of the International Speech Communication Association. ISCA, 1045–1048.Google Scholar
Cross Ref
- . 2021. Deep learning-based text classification: A comprehensive review. arXiv:2004.03705 [cs, stat] (
Jan. 2021).Google Scholar - . 2020. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In AAAI Conference on Artificial Intelligence. 1359–1367.Google Scholar
Cross Ref
- . 2016. Natural language inference by tree-based convolution and heuristic matching. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics.
DOI: DOI: https://doi.org/10.18653/v1/p16-2022Google Scholar - . 2018. Seeing voices and hearing faces: Cross-modal biometric matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8427–8436.Google Scholar
Cross Ref
- . 2011. Multimodal deep learning. In 28th International Conference on Machine Learning. Omnipress, 689–696.Google Scholar
- . 2017. A context-aware speech recognition and understanding system for air traffic control domain. In IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 404–408.
DOI: DOI: https://doi.org/10.1109/ASRU.2017.8268964Google Scholar - . 2017. Gated multimodal units for information fusion. In 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=S12_nquOe.Google Scholar
- . 2011. Automatic understanding of ATC speech: Study of prospectives and field experiments for several controller positions. IEEE Trans. Aerosp. Electron. Syst. 47, 4 (2011), 2709–2730.
DOI: DOI: https://doi.org/10.1109/TAES.2011.6034660Google ScholarCross Ref
- . 2018. Improving language understanding by generative pre-training. Retrieved from https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.Google Scholar
- . 2018a. Speaker recognition from raw waveform with SincNet. In IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 1021–1028.
DOI: DOI: https://doi.org/10.1109/SLT.2018.8639585Google Scholar - . 2018b. Speech and speaker recognition from raw waveform with SincNet. CoRR abs/1812.05920 (2018).Google Scholar
- . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- . 2019. Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development. Lang. Resour. Eval. 53, 3 (2019), 449–464.
DOI: DOI: https://doi.org/10.1007/s10579-019-09449-5Google ScholarDigital Library
- . 2018. X-Vectors: Robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 5329–5333.
DOI: DOI: https://doi.org/10.1109/ICASSP.2018.8461375Google ScholarDigital Library
- . 2019. How to fine-tune BERT for text classification? In Chinese Computational Linguistics - 18th China National Conference, CCL 2019(
Lecture Notes in Computer Science , Vol. 11856). Springer, 194–206.DOI: DOI: https://doi.org/10.1007/978-3-030-32381-3_16Google ScholarDigital Library
- . 2015. Improved semantic representations from tree-structured long short-term memory networks. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. 1556–1566.
DOI: DOI: https://doi.org/10.3115/v1/p15-1150Google ScholarCross Ref
- . 2017. Attention is all you need. In International Conference on Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- . 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In 22nd ACM International Conference on Multimedia. 167–176.Google Scholar
Digital Library
- . 2018. Large-scale weakly supervised audio classification using gated convolutional neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 121–125.
DOI: DOI: https://doi.org/10.1109/ICASSP.2018.8461975Google ScholarDigital Library
- . 2020. ATCSpeech: A multilingual pilot-controller speech corpus from real air traffic control environment. In Annual Conference of the International Speech Communication Association. ISCA, 399–403.
DOI: DOI: https://doi.org/10.21437/Interspeech.2020-1020Google ScholarCross Ref
- . 2019. Spectrogram-based multi-task audio classification. Multim. Tools Appl. 78, 3 (2019), 3705–3722.
DOI: DOI: https://doi.org/10.1007/s11042-017-5539-3Google ScholarDigital Library
- . 2015. Character-level convolutional networks for text classification. In International Conference on Advances in Neural Information Processing Systems. 649–657.Google Scholar
- . 2016. Attention-based bidirectional long short-term memory networks for relation classification. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics.
DOI: DOI: https://doi.org/10.18653/v1/p16-2034Google Scholar - . 2019. Modality attention for end-to-end audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 6565–6569.
DOI: DOI: https://doi.org/10.1109/ICASSP.2019.8683733Google ScholarCross Ref
- . 2020. Automatic speech recognition benchmark for air-traffic communications. In 21st Annual Conference of the International Speech Communication Association. ISCA, 2297–2301.
DOI: DOI: https://doi.org/10.21437/Interspeech.2020-2173Google ScholarCross Ref
Index Terms
A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning Approaches
Recommendations
Speaker Identification Using Whispered Speech
CSNT '13: Proceedings of the 2013 International Conference on Communication Systems and Network TechnologiesThe study of closed set text-independent speaker identification using whisper speech is presented in this paper. A new feature called temporal Teager energy based sub band cepstral coefficients (TTESBCC) is proposed. The work presented compares the ...
Characterizing and detecting spontaneous speech: Application to speaker role recognition
Processing spontaneous speech is one of the many challenges that automatic speech recognition systems have to deal with. The main characteristics of this kind of speech are disfluencies (filled pause, repetition, false start, etc.) and many studies have ...
Spectral histogram of oriented gradients (SHOGs) for Tamil language male/female speaker classification
Gender (Male/Female) classification plays a primary vital role to develop a robust Automatic Tamil Speech Recognition (ASR) applications due to the diversity in the vocal tract of speakers. Various features including Formants (F1, F2, F3, F4), Zero ...






Comments