skip to main content
research-article

A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning Approaches

Published:24 March 2023Publication History
Skip Abstract Section

Abstract

Automatic spoken instruction understanding (SIU) of the controller-pilot conversations in the air traffic control (ATC) requires not only recognizing the words and semantics of the speech but also determining the role of the speaker. However, few of the published works on the automatic understanding systems in air traffic communication focus on speaker role identification (SRI). In this article, we formulate the SRI task of controller-pilot communication as a binary classification problem. Furthermore, the text-based, speech-based, and speech-and-text-based multi-modal methods are proposed to achieve a comprehensive comparison of the SRI task. To ablate the impacts of the comparative approaches, various advanced neural network architectures are applied to optimize the implementation of text-based and speech-based methods. Most importantly, a multi-modal speaker role identification network (MMSRINet) is designed to achieve the SRI task by considering both the speech and textual modality features. To aggregate modality features, the modal fusion module is proposed to fuse and squeeze acoustic and textual representations by modal attention mechanism and self-attention pooling layer, respectively. Finally, the comparative approaches are validated on the ATCSpeech corpus collected from a real-world ATC environment. The experimental results demonstrate that all the comparative approaches worked for the SRI task, and the proposed MMSRINet shows competitive performance and robustness compared with the other methods on both seen and unseen data, achieving 98.56% and 98.08% accuracy, respectively.

REFERENCES

  1. Baevski Alexei, Zhou Yuhao, Mohamed Abdelrahman, and Auli Michael. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In International Conference on Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  2. Baltrušaitis Tadas, Ahuja Chaitanya, and Morency Louis-Philippe. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (Feb.2019), 423443. DOI: DOI: https://doi.org/10.1109/TPAMI.2018.2798607Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cakır Emre, Parascandolo Giambattista, Heittola Toni, Huttunen Heikki, and Virtanen Tuomas. 2017. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio, Speech Lang. Process. 25, 6 (2017), 12911303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Choi Keunwoo, Fazekas György, Sandler Mark, and Cho Kyunghyun. 2017. Convolutional recurrent neural networks for music classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 23922396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 41714186. DOI: DOI: https://doi.org/10.18653/v1/n19-1423Google ScholarGoogle Scholar
  6. Fukui Akira, Park Dong Huk, Yang Daylen, Rohrbach Anna, Darrell Trevor, and Rohrbach Marcus. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 457468. DOI: DOI: https://doi.org/10.18653/v1/d16-1044Google ScholarGoogle ScholarCross RefCross Ref
  7. Gallo Ignazio, Calefati Alessandro, Nawaz Shah, and Janjua Muhammad Kamran. 2018. Image and encoded text fusion for multi-modal classification. In Digital Image Computing: Techniques and Applications (DICTA’18). 17. DOI: DOI: https://doi.org/10.1109/DICTA.2018.8615789Google ScholarGoogle Scholar
  8. Guo Dongyue, Zhang Zichen, Fan Peng, Zhang Jianwei, and Yang Bo. 2021. A context-aware language model to improve the speech recognition in air traffic control. Aerospace 8, 11 (2021), 348.Google ScholarGoogle ScholarCross RefCross Ref
  9. Han Yoonchang, Kim Jae-Hun, and Lee Kyogu. 2017. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE ACM Trans. Audio Speech Lang. Process. 25, 1 (2017), 208221. DOI: DOI: https://doi.org/10.1109/TASLP.2016.2632307Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hansen John H. L. and Liu Gang. 2016. Unsupervised accent classification for deep data fusion of accent and language information. Speech Commun. 78 (2016), 1933.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 770778. DOI: DOI: https://doi.org/10.1109/CVPR.2016.90Google ScholarGoogle ScholarCross RefCross Ref
  12. Helmke Hartmut, Rataj Jürgen, Mühlhausen Thorsten, Ohneiser Oliver, Ehr Heiko, Kleinert Matthias, Oualil Youssef, Schulder Marc, and Klakow D.. 2015. Assistant-based speech recognition for ATM applications. In 11th USA/Europe Air Traffic Management Research and Development Seminar (ATM’15).Google ScholarGoogle Scholar
  13. Hershey Shawn, Chaudhuri Sourish, Ellis Daniel P. W., Gemmeke Jort F., Jansen Aren, Moore R. Channing, Plakal Manoj, Platt Devin, Saurous Rif A., Seybold Bryan, Slaney Malcolm, Weiss Ron J., and Wilson Kevin. 2017. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 131135. DOI: DOI: https://doi.org/10.1109/ICASSP.2017.7952132Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jermsittiparsert Kittisak, Abdurrahman Abdurrahman, Siriattakul Parinya, Sundeeva Ludmila A., Hashim Wahidah, Rahim Robbi, and Maseleno Andino. 2020. Pattern recognition and features selection for speech emotion recognition model using deep learning. Int. J. Speech Technol. 23, 4 (2020), 799806.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jung Jee-weon, Heo Hee-Soo, Kim Ju-ho, Shim Hye-jin, and Yu Ha-Jin. 2019. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv:1904.08104 [cs, eess] (July2019).Google ScholarGoogle Scholar
  16. Kiela Douwe and Bottou Léon. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 3645.Google ScholarGoogle ScholarCross RefCross Ref
  17. Kiela Douwe, Grave Edouard, Joulin Armand, and Mikolov Tomás. 2018. Efficient large-scale multi-modal classification. In 32nd AAAI Conference on Artificial Intelligence, (AAAI’18). AAAI Press, 51985204.Google ScholarGoogle ScholarCross RefCross Ref
  18. Kim Yoon. 2014. Convolutional neural networks for sentence classification. In Conference on Empirical Methods in Natural Language Processing. ACL, 17461751. DOI: DOI: https://doi.org/10.3115/v1/d14-1181Google ScholarGoogle ScholarCross RefCross Ref
  19. Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 10971105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kumar Anurag and Raj Bhiksha. 2016. Audio event detection using weakly labeled data. In 24th ACM International Conference on Multimedia. 10381047.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. LeCun Yann, Bottou Léon, Bengio Yoshua, and Haffner Patrick. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 22782324. DOI: DOI: https://doi.org/10.1109/5.726791Google ScholarGoogle ScholarCross RefCross Ref
  22. Lin Yi. 2021. Spoken instruction understanding in air traffic control: Challenge, technique, and application. Aerospace 8, 3 (2021), 65. Google ScholarGoogle ScholarCross RefCross Ref
  23. Lin Yi, Deng Linjie, Chen Zhengmao, Wu Xiping, Zhang Jianwei, and Yang Bo. 2020. A real-time ATC safety monitoring framework using a deep learning approach. IEEE Trans. Intell. Transp. Syst. 21, 11 (2020), 45724581. DOI: DOI: https://doi.org/10.1109/TITS.2019.2940992Google ScholarGoogle ScholarCross RefCross Ref
  24. Lin Yi, Guo Dongyue, Zhang Jianwei, Chen Zhengmao, and Yang Bo. 2021a. A unified framework for multilingual speech recognition in air traffic control systems. IEEE Trans. Neural Netw. Learn. Syst. 32, 8 (2021), 36083620. DOI: DOI: https://doi.org/10.1109/TNNLS.2020.3015830Google ScholarGoogle ScholarCross RefCross Ref
  25. Lin Yi, Ruan Min, Cai Kunjie, Li Dan, Zeng Ziqiang, Li Fan, and Yang Bo. 2022. Identifying and managing risks of AI-driven operations: A case study of automatic speech recognition for improving air traffic safety. Chinese J. Aeron. (2022). DOI: DOI: https://doi.org/10.1016/j.cja.2022.08.020Google ScholarGoogle ScholarCross RefCross Ref
  26. Lin Yi, Tan Xianlong, Yang Bo, Yang Kai, Zhang Jianwei, and Yu Jing. 2019. Real-time controlling dynamics sensing in air traffic system. Sensors 19, 3 (2019), 679. DOI: DOI: https://doi.org/10.3390/s19030679Google ScholarGoogle ScholarCross RefCross Ref
  27. Lin Yi, Wu YuanKai, Guo Dongyue, Zhang Pan, Yin Changyu, Yang Bo, and Zhang Jianwei. 2021b. A deep learning framework of autonomous pilot agent for air traffic controller training. IEEE Trans. Hum.-mach. Syst. (2021), 19. DOI: DOI: https://doi.org/10.1109/THMS.2021.3102827Google ScholarGoogle Scholar
  28. Lin Yi, Yang Bo, Li Linchao, Guo Dongyue, Zhang Jianwei, Chen Hu, and Zhang Yi. 2021c. ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems. Appl. Soft Comput. 112 (2021), 107847. DOI: DOI: https://doi.org/10.1016/j.asoc.2021.107847Google ScholarGoogle ScholarCross RefCross Ref
  29. Liu Pengfei, Qiu Xipeng, Chen Xinchi, Wu Shiyu, and Huang Xuanjing. 2015. Multi-timescale long short-term memory neural network for modelling sentences and documents. In Conference on Empirical Methods in Natural Language Processing. 23262335. DOI: DOI: https://doi.org/10.18653/v1/d15-1280Google ScholarGoogle ScholarCross RefCross Ref
  30. Lopez-Moreno Ignacio, Gonzalez-Dominguez Javier, Plchot Oldrich, Martinez David, Gonzalez-Rodriguez Joaquin, and Moreno Pedro. 2014. Automatic language identification using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 53375341. DOI: DOI: https://doi.org/10.1109/ICASSP.2014.6854622Google ScholarGoogle ScholarCross RefCross Ref
  31. Mikolov Tomás, Karafiát Martin, Burget Lukás, Cernocký Jan, and Khudanpur Sanjeev. 2010. Recurrent neural network-based language model. In 11th Annual Conference of the International Speech Communication Association. ISCA, 10451048.Google ScholarGoogle ScholarCross RefCross Ref
  32. Minaee Shervin, Kalchbrenner Nal, Cambria Erik, Nikzad Narjes, Chenaghlu Meysam, and Gao Jianfeng. 2021. Deep learning-based text classification: A comprehensive review. arXiv:2004.03705 [cs, stat] (Jan.2021).Google ScholarGoogle Scholar
  33. Mittal Trisha, Bhattacharya Uttaran, Chandra Rohan, Bera Aniket, and Manocha Dinesh. 2020. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In AAAI Conference on Artificial Intelligence. 13591367.Google ScholarGoogle ScholarCross RefCross Ref
  34. Mou Lili, Men Rui, Li Ge, Xu Yan, Zhang Lu, Yan Rui, and Jin Zhi. 2016. Natural language inference by tree-based convolution and heuristic matching. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics. DOI: DOI: https://doi.org/10.18653/v1/p16-2022Google ScholarGoogle Scholar
  35. Nagrani Arsha, Albanie Samuel, and Zisserman Andrew. 2018. Seeing voices and hearing faces: Cross-modal biometric matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 84278436.Google ScholarGoogle ScholarCross RefCross Ref
  36. Ngiam Jiquan, Khosla Aditya, Kim Mingyu, Nam Juhan, Lee Honglak, and Ng Andrew Y.. 2011. Multimodal deep learning. In 28th International Conference on Machine Learning. Omnipress, 689696.Google ScholarGoogle Scholar
  37. Oualil Youssef, Klakow Dietrich, Szaszák György, Srinivasamurthy Ajay, Helmke Hartmut, and Motlícek Petr. 2017. A context-aware speech recognition and understanding system for air traffic control domain. In IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 404408. DOI: DOI: https://doi.org/10.1109/ASRU.2017.8268964Google ScholarGoogle Scholar
  38. Ovalle John Edison Arevalo, Solorio Thamar, Montes-y-Gómez Manuel, and González Fabio A.. 2017. Gated multimodal units for information fusion. In 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=S12_nquOe.Google ScholarGoogle Scholar
  39. Pardo José Manuel, Ferreiros Javier, Martínez Fernando Fernández, Rojo Valentín Sama, Córdoba Ricardo de, Guarasa Javier Macías, Montero Juan Manuel, San-Segundo-Hernández Rubén, D’Haro Luis Fernando, and González Germán. 2011. Automatic understanding of ATC speech: Study of prospectives and field experiments for several controller positions. IEEE Trans. Aerosp. Electron. Syst. 47, 4 (2011), 27092730. DOI: DOI: https://doi.org/10.1109/TAES.2011.6034660Google ScholarGoogle ScholarCross RefCross Ref
  40. Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving language understanding by generative pre-training. Retrieved from https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.Google ScholarGoogle Scholar
  41. Ravanelli Mirco and Bengio Yoshua. 2018a. Speaker recognition from raw waveform with SincNet. In IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 10211028. DOI: DOI: https://doi.org/10.1109/SLT.2018.8639585Google ScholarGoogle Scholar
  42. Ravanelli Mirco and Bengio Yoshua. 2018b. Speech and speaker recognition from raw waveform with SincNet. CoRR abs/1812.05920 (2018).Google ScholarGoogle Scholar
  43. Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  44. Smídl Lubos, Svec Jan, Tihelka Daniel, Matousek Jindrich, Romportl Jan, and Ircing Pavel. 2019. Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development. Lang. Resour. Eval. 53, 3 (2019), 449464. DOI: DOI: https://doi.org/10.1007/s10579-019-09449-5Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Snyder David, Garcia-Romero Daniel, Sell Gregory, Povey Daniel, and Khudanpur Sanjeev. 2018. X-Vectors: Robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 53295333. DOI: DOI: https://doi.org/10.1109/ICASSP.2018.8461375Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sun Chi, Qiu Xipeng, Xu Yige, and Huang Xuanjing. 2019. How to fine-tune BERT for text classification? In Chinese Computational Linguistics - 18th China National Conference, CCL 2019(Lecture Notes in Computer Science, Vol. 11856). Springer, 194206. DOI: DOI: https://doi.org/10.1007/978-3-030-32381-3_16Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tai Kai Sheng, Socher Richard, and Manning Christopher D.. 2015. Improved semantic representations from tree-structured long short-term memory networks. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. 15561566. DOI: DOI: https://doi.org/10.3115/v1/p15-1150Google ScholarGoogle ScholarCross RefCross Ref
  48. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In International Conference on Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  49. Wu Zuxuan, Jiang Yu-Gang, Wang Jun, Pu Jian, and Xue Xiangyang. 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In 22nd ACM International Conference on Multimedia. 167176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xu Yong, Kong Qiuqiang, Wang Wenwu, and Plumbley Mark D.. 2018. Large-scale weakly supervised audio classification using gated convolutional neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 121125. DOI: DOI: https://doi.org/10.1109/ICASSP.2018.8461975Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yang Bo, Tan Xianlong, Chen Zhengmao, Wang Bing, Ruan Min, Li Dan, Yang Zhongping, Wu Xiping, and Lin Yi. 2020. ATCSpeech: A multilingual pilot-controller speech corpus from real air traffic control environment. In Annual Conference of the International Speech Communication Association. ISCA, 399403. DOI: DOI: https://doi.org/10.21437/Interspeech.2020-1020Google ScholarGoogle ScholarCross RefCross Ref
  52. Zeng Yuni, Mao Hua, Peng Dezhong, and Yi Zhang. 2019. Spectrogram-based multi-task audio classification. Multim. Tools Appl. 78, 3 (2019), 37053722. DOI: DOI: https://doi.org/10.1007/s11042-017-5539-3Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhang Xiang, Zhao Junbo Jake, and LeCun Yann. 2015. Character-level convolutional networks for text classification. In International Conference on Advances in Neural Information Processing Systems. 649657.Google ScholarGoogle Scholar
  54. Zhou Peng, Shi Wei, Tian Jun, Qi Zhenyu, Li Bingchen, Hao Hongwei, and Xu Bo. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics. DOI: DOI: https://doi.org/10.18653/v1/p16-2034Google ScholarGoogle Scholar
  55. Zhou Pan, Yang Wenwen, Chen Wei, Wang Yanfeng, and Jia Jia. 2019. Modality attention for end-to-end audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 65656569. DOI: DOI: https://doi.org/10.1109/ICASSP.2019.8683733Google ScholarGoogle ScholarCross RefCross Ref
  56. Zuluaga-Gomez Juan, Motlícek Petr, Zhan Qingran, Veselý Karel, and Braun Rudolf A.. 2020. Automatic speech recognition benchmark for air-traffic communications. In 21st Annual Conference of the International Speech Communication Association. ISCA, 22972301. DOI: DOI: https://doi.org/10.21437/Interspeech.2020-2173Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning Approaches

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
      April 2023
      682 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3588902
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 March 2023
      • Online AM: 24 November 2022
      • Accepted: 18 November 2022
      • Revised: 17 September 2022
      • Received: 21 September 2021
      Published in tallip Volume 22, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!