Abstract
Automatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction. A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum since the early 2000s, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields.
- [1] . 2013. Spoken language recognition: From fundamentals to practice. Proc. IEEE 101, 5 (2013), 1136–1159.Google Scholar
Cross Ref
- [2] . 2017. An investigation of deep neural networks for multilingual speech recognition training and adaptation. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). ISCA, 714–718.Google Scholar
Cross Ref
- [3] . 2020. Towards emotion independent language identification system. In Proceedings of the International Conference on Signal Processing and Communications (SPCOM’20). IEEE, 1–5.Google Scholar
Cross Ref
- [4] . 2017. Analysis of score normalization in multilingual speaker recognition. Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17) (2017), 1567–1571.Google Scholar
- [5] . 2020. Ethnologue: Languages of the World, Twenty-third Edition. SIL International, Dallas, TX.Google Scholar
- [6] . 2017. Linguistics: An Introduction to Language and Communication. MIT Press.Google Scholar
- [7] . 2003. Introducing Linguistic Morphology. Edinburgh University Press, Edinburgh.Google Scholar
- [8] . 2007. Psychology of Language. Nelson Education.Google Scholar
- [9] . 2011. Language identification: A tutorial. IEEE Circ. Syst. Mag. 11, 2 (2011), 82–108.Google Scholar
Cross Ref
- [10] . 2013. Speech recognition technology: A survey on Indian languages. Int. J. Inf. Sci. Intell. Syst. 2, 4 (2013), 1–38.Google Scholar
- [11] . 2006. Speech recognition for illiterate access to information and technology. In Proceedings of the International Conference on Information and Communication Technologies and Development. IEEE, 83–92.Google Scholar
Cross Ref
- [12] . 2005. Development of Indian language speech databases for large vocabulary speech recognition systems. In Proceedings of the International Conference on Speech and Computer (SPECOM’05). ISCA, 343–347.Google Scholar
- [13] . 2019. ASRoIL: A comprehensive survey for automatic speech recognition of Indian languages. Artif. Intell. Rev. (2019), 1–32.Google Scholar
- [14] . 2018. TDNN-based multilingual speech recognition system for low resource Indian languages. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). ISCA, 3197–3201.Google Scholar
Cross Ref
- [15] . 2020. A survey on speech synthesis techniques in Indian languages. Multimedia Syst. 26 (2020), 453–478.Google Scholar
Cross Ref
- [16] . 2018. An investigation of convolution attention based models for multilingual speech synthesis of Indian languages. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). ISCA, 2474–2478.Google Scholar
Cross Ref
- [17] . 2012. Multivariability speaker recognition database in Indian scenario. Int. J. Speech Technol. 15, 4 (2012), 441–453.Google Scholar
Cross Ref
- [18] . 2012. IITKGP-MLILSC speech database for language identification. In Proceedings of the National Conference on Communications (NCC’12). IEEE, 1–5.Google Scholar
Cross Ref
- [19] . 2012. Indian language speech database: A review. Int. Comput. Appl. 47, 5 (2012), 17–21.Google Scholar
Cross Ref
- [20] . 2012. Design issues in developing speech corpus for Indian languages—A survey. In Proceedings of the International Conference on Computer Communication and Informatics. IEEE, 1–4.Google Scholar
Cross Ref
- [21] . 2015. A review on speech corpus development for automatic speech recognition in Indian languages. Int. J. Adv. Netw. Appl. 6, 6 (2015), 2556.Google Scholar
- [22] Debapriya Sengupta and Goutam Saha. 2016. Identification of the major language families of India and evaluation of their mutual influence. Current Science 110 (2016), 667–681.Google Scholar
- [23] . 2012. A hierarchical language identification system for Indian languages. Digital Sign. Process. 22, 3 (2012), 544–553.Google Scholar
Digital Library
- [24] . 2015. Study on similarity among Indian languages using language verification framework. Advances in Artificial Intelligence 2015 (2015), 1–24.Google Scholar
- [25] . 2012. Identification of language using Mel-frequency cepstral coefficients (MFCC). Proc. Eng. 38 (2012), 3391–3398.Google Scholar
Cross Ref
- [26] . 2015. Implicit excitation source features for robust language identification. Int. J. Speech Technol. 18, 3 (2015), 459–477.Google Scholar
Digital Library
- [27] . 2018. Language identification using phase information. Int. J. Speech Technol. 21, 3 (2018), 509–519.Google Scholar
Digital Library
- [28] . 2020. Language specific information from LP residual signal using linear sub band filters. In Proceedings of the National Conference on Communications (NCC’20). IEEE, 1–5.Google Scholar
Cross Ref
- [29] . 2011. Phonotactic model for spoken language identification in Indian language perspective. Int. J. Comput. Appl. 19, 9 (2011), 18–24.Google Scholar
Cross Ref
- [30] . 2013. Identification of Indian languages using multi-level spectral and prosodic features. Int. J. Speech Technol. 16, 4 (2013), 489–511.Google Scholar
Digital Library
- [31] . 2019. A pre-classification-based language identification for Northeast Indian languages using prosody and spectral features. Circ. Syst. Sign. Process. 38, 5 (2019), 2266–2296.Google Scholar
Digital Library
- [32] . 2020. Bottleneck feature-based hybrid deep autoencoder approach for Indian language Identification. Arab. J. Sci. Eng. 45, 4 (2020), 3425–3436.Google Scholar
Cross Ref
- [33] . 2020. A hybrid meta-heuristic feature selection method for identification of Indian spoken languages from audio signals. IEEE Access 8 (2020), 181432–181449.Google Scholar
Cross Ref
- [34] . 2020. A language identification system using hybrid features and back-propagation neural network. Appl. Acoust. 164 (2020), 107289.Google Scholar
Cross Ref
- [35] . 2007. Spoken language identification for Indian languages using split and merge EM algorithm. In Proceedings of the International Conference on Pattern Recognition and Machine Intelligence. Springer, 463–468.Google Scholar
Digital Library
- [36] . 2015. Significance of GMM-UBM based modelling for Indian language identification. Proc. Comput. Sci. 54 (2015), 231–236.Google Scholar
Cross Ref
- [37] . 2016. An investigation of deep neural network architectures for language recognition in Indian languages. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). ISCA, 2930–2933.Google Scholar
- [38] . 2018. Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks. Int. J. Speech Technol. 21, 3 (2018), 501–508.Google Scholar
Digital Library
- [39] . 2019. Deep neural network based two-stage Indian language identification system using glottal closure instants as anchor points. J. King Saud Univ. Comput. Inf. Sci. 34, 4 (2019), 1439–14574.Google Scholar
- [40] . 2019. Attention based residual-time delay neural network for Indian language identification. In Proceedings of the International Conference on Contemporary Computing (IC3’19). IEEE, 1–5.Google Scholar
Cross Ref
- [41] . 2019. An investigation of LSTM-CTC based joint acoustic model for Indian language identification. In Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU’19). IEEE, 389–396.Google Scholar
Cross Ref
- [42] . 2003. Acoustic, phonetic, and discriminative approaches to automatic language identification. In Proceedings of the European Conference on Speech Communication and Technology. ISCA, 1345–1348.Google Scholar
Cross Ref
- [43] . 1996. Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Aud. Process. 4, 1 (1996), 31.Google Scholar
Cross Ref
- [44] . 2018. Spoken Indian language identification: A review of features and databases. Sādhanā 43, 4 (2018), 1–14.Google Scholar
Cross Ref
- [45] . 2013. Fusion of spectral and time domain features for crowd noise classification system. In Proceedings of the International Conference on Intelligent Systems Design and Applications. IEEE, 1–6.Google Scholar
Cross Ref
- [46] . 2007. Springer Handbook of Speech Processing. Springer.Google Scholar
- [47] . 2006. Pattern Recognition and Machine Learning. Springer.Google Scholar
- [48] . 2016. Deep Learning. Vol. 1. MIT Press, Cambridge, MA.Google Scholar
Digital Library
- [49] . 2012. Shifted-delta MLP features for spoken language recognition. IEEE Sign. Process. Lett. 20, 1 (2012), 15–18.Google Scholar
Cross Ref
- [50] . 2002. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Proceedings of the International Conference on Spoken Language Processing. 89–92.Google Scholar
Cross Ref
- [51] . 1994. Language identification using shifted delta cepstrum. In Proceedings of the Annual Speech Research Symposium, Vol. 41, 42.Google Scholar
- [52] . 2018. IIITH-ILSC speech database for Indain language identification. In Proceedings of the Spoken Language Technologies for Under-Resourced Languages (SLTU). 56–60.Google Scholar
Cross Ref
- [53] . 2018. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 5329–5333.Google Scholar
Digital Library
- [54] . 2019. PyTorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019), 8026–8037.Google Scholar
- [55] . 2019. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR’19).Google Scholar
- [56] . 2012. On the use of phone log-likelihood ratios as features in spoken language recognition. In Proceedings of the Spoken Language Technology Workshop (SLT’12). IEEE, 274–279.Google Scholar
Cross Ref
- [57] . 2020. Maximal figure-of-merit framework to detect multi-label phonetic features for spoken language recognition. IEEE/ACM Trans. Aud. Speech Lang. Process. 28 (2020), 682–695.Google Scholar
Digital Library
- [58] . 2013. Spoken language recognition with prosodic features. IEEE Trans. Aud. Speech Lang. Process. 21, 9 (2013), 1841–1853.Google Scholar
Digital Library
- [59] . 2014. Neural network bottleneck features for language identification. In Proceedings of Odyssey 2014: The Speaker and Language Recognition Workshop. ISCA, 299–304.Google Scholar
Cross Ref
- [60] . 2015. Deep neural network approaches to speaker and language recognition. IEEE Sign. Process. Lett. 22, 10 (2015), 1671–1675.Google Scholar
Cross Ref
- [61] . 2015. Multilingual bottleneck features for language recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). ISCA, 389–393.Google Scholar
- [62] . 2016. Exploring the role of phonetic bottleneck features for speaker and language recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5575–5579.Google Scholar
Digital Library
- [63] . 2017. An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition. PloS One 12, 8 (2017), e0182580.Google Scholar
Cross Ref
- [64] . 2001. Feature warping for robust speaker verification. In Proceedings of Odyssey 2001: The Speaker Recognition Workshop. European Speech Communication Association, 213–218.Google Scholar
- [65] . 2019. Parametric cepstral mean normalization for robust speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6735–6739.Google Scholar
Cross Ref
- [66] . 1994. RASTA processing of speech. IEEE Trans. Speech Aud. Process. 2, 4 (1994), 578–589.Google Scholar
Cross Ref
- [67] . 1996. A parametric approach to vocal tract length normalization. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’96), Vol. 1. IEEE, 346–348.Google Scholar
Digital Library
- [68] . 2017. Trainable frontend for robust and far-field keyword spotting. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 5670–5674.Google Scholar
Digital Library
- [69] . 2018. Per-channel energy normalization: Why and how. IEEE Sign. Process. Lett. 26, 1 (2018), 39–43.Google Scholar
Cross Ref
- [70] . 2018. BUT/Phonexia bottleneck feature extractor. In Proceedings of Odyssey 2018: The Speaker and Language Recognition Workshop. ISCA, 283–287.Google Scholar
Cross Ref
- [71] . 1977. Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62, 3 (1977), 708–713.Google Scholar
Cross Ref
- [72] . 1991. Automatic language recognition using acoustic features. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’91). IEEE, 813–816.Google Scholar
Cross Ref
- [73] . 1994. Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’94), Vol. i. IEEE, 305–308.Google Scholar
Cross Ref
- [74] . 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Aud. Process. 3, 1 (1995), 72–83.Google Scholar
Cross Ref
- [75] . 2000. Speaker verification using adapted Gaussian mixture models. Digit. Sign. Process. 10, 1-3 (2000), 19–41.Google Scholar
Digital Library
- [76] . 2013. Indian language identification using k-means clustering and support vector machine (SVM). In Proceedings of the Students Conference on Engineering and Systems (SCES’13). IEEE, 1–5.Google Scholar
Cross Ref
- [77] . 2004. Language recognition with support vector machines. In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop. ISCA, 41–44.Google Scholar
- [78] . 2006. Discriminatively trained language models using support vector machines for language identification. In Proceedings of Odyssey 2006: The Speaker and Language Recognition Workshop. ISCA, 1–6.Google Scholar
Cross Ref
- [79] . 2007. Language identification using acoustic models and speaker compensated cepstral-time matrices. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Vol. 4. IEEE, 1013–1016.Google Scholar
Cross Ref
- [80] . 2011. Language recognition via i-vectors and dimensionality reduction. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’11). ISCA, 857–860.Google Scholar
Cross Ref
- [81] . 2015. Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans. Aud. Speech Lang. Process. 24, 1 (2015), 105–116.Google Scholar
Digital Library
- [82] . 2017. Direct optimization of the detection cost for I-vector-based spoken language recognition. IEEE/ACM Trans. Aud. Speech Lang. Process. 25, 3 (2017), 588–597.Google Scholar
Digital Library
- [83] . 2019. Attention based hybrid i-vector BLSTM model for language recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). ISCA, 1263–1267.Google Scholar
Cross Ref
- [84] . 2010. Front-end factor analysis for speaker verification. IEEE Trans. Aud. Speech Lang. Process. 19, 4 (2010), 788–798.Google Scholar
Digital Library
- [85] . 2015. Frame-by-frame language identification in short utterances using deep neural networks. Neural Netw. 64 (2015), 49–58.Google Scholar
Digital Library
- [86] . 2014. Automatic language identification using deep neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). IEEE, 5337–5341.Google Scholar
Cross Ref
- [87] . 2009. Deep learning for spoken language identification. In Proceedings of the NIPS Workshop on Deep Learning for Speech Recognition and Related Applications. Citeseer, 1–4.Google Scholar
- [88] . 2014. Application of convolutional neural networks to language identification in noisy conditions. In Proceedings of Odyssey 2014: The Speaker and Language Recognition Workshop. IEEE, 287–292.Google Scholar
Cross Ref
- [89] . 2016. End-to-end language identification using attention-based recurrent neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). ISCA, 2944–2948.Google Scholar
Cross Ref
- [90] . 2014. Automatic language identification using long short-term memory recurrent neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). ISCA, 2155–2159.Google Scholar
Cross Ref
- [91] . 2016. Evaluation of an LSTM-RNN system in different NIST language recognition frameworks. In Proceedings of Odyssey 2016: The Speaker and Language Recognition Workshop. ISCA, 231–236.Google Scholar
Cross Ref
- [92] . 2017. Bidirectional modelling for short duration language identification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). ISCA, 2809–2813.Google Scholar
Cross Ref
- [93] . 2019. End-to-end language recognition using attention based hierarchical gated recurrent unit models. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5966–5970.Google Scholar
Cross Ref
- [94] . 2016. Stacked long-term TDNN for spoken language recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). ISCA, 3226–3230.Google Scholar
Cross Ref
- [95] . 2019. A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). ISCA, 4080–4084.Google Scholar
Cross Ref
- [96] . 2018. Spoken language recognition using x-vectors. In Proceedings of Odyssey 2018: The Speaker and Language Recognition Workshop. ISCA, 105–111.Google Scholar
Cross Ref
- [97] . 2018. End-to-end versus embedding neural networks for language recognition in mismatched conditions. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 112–119.Google Scholar
- [98] . 2020. Compensation on x-vector for short utterance spoken language identification. In Proceedings of Odyssey 2018: The Speaker and Language Recognition Workshop. ISCA, 47–52.Google Scholar
Cross Ref
- [99] . 2018. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). ISCA, 3743–3747.Google Scholar
Cross Ref
- [100] . 2019. The JHU speaker recognition system for the VOiCES 2019 challenge. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). ISCA, 2468–2472.Google Scholar
Cross Ref
- [101] . 2020. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’20). ISCA, 1–5.Google Scholar
Cross Ref
- [102] . 2019. A comparative study on Transformer vs RNN in speech applications. In Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU’19). IEEE, 449–456.Google Scholar
Cross Ref
- [103] . 2018. Improved language identification using stacked SDC features and residual neural network. In Proceedings of the Spoken Language Technologies for Under-Resourced Languages (SLTU’18)). 210–214.Google Scholar
Cross Ref
- [104] . 2020. Knowledge distillation-based representation learning for short-utterance spoken language identification. IEEE/ACM Trans. Aud. Speech Lang. Process. 28 (2020), 2674–2683.Google Scholar
Digital Library
- [105] . 2009. A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45, 4 (2009), 427–437.Google Scholar
Digital Library
- [106] . 2017. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73 (2017), 220–239.Google Scholar
Digital Library
- [107] . 2006. Application-independent evaluation of speaker detection. Comput. Speech Lang. 20, 2-3 (2006), 230–275.Google Scholar
Cross Ref
- [108] . 2018. The 2017 NIST language recognition evaluation. In Proceedings of Odyssey 2018: The Speaker and Language Recognition Workshop. ISCA, 82–89.Google Scholar
Cross Ref
- [109] . 2012. The 2011 NIST language recognition evaluation. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’12). ISCA, 34–37.Google Scholar
Cross Ref
- [110] . 2010. The 2009 NIST language recognition evaluation. In Proceedings of Odyssey 2010: The Speaker and Language Recognition Workshop, Vol. 30. ISCA, 165–171.Google Scholar
- [111] . 2003. NIST 2003 language recognition evaluation. In Proceedings of the European Conference on Speech Communication and Technology (Eurospeech’03). ISCA, 1341–1344.Google Scholar
Cross Ref
- [112] . 2020. AP20-OLR challenge: Three tasks and their baselines. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’20). IEEE, 550–555.Google Scholar
- [113] . 2019. AP19-OLR challenge: Three tasks and their baselines. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’19). IEEE, 1917–1921.Google Scholar
Cross Ref
- [114] . 2017. AP17-OLR challenge: Data, plan, and baseline. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’17). IEEE, 749–753.Google Scholar
Cross Ref
- [115] . 2009. An experimental comparison of performance measures for classification. Pattern Recogn. Lett. 30, 1 (2009), 27–38.Google Scholar
Digital Library
- [116] . 2010. Measuring, Refining and Calibrating Speaker and Language Information Extracted from Speech. Ph.D. Dissertation. University of Stellenbosch.Google Scholar
- [117] . 2018. A Concise Introduction to Linguistics. Routledge.Google Scholar
Cross Ref
- [118] . 2018. A Bayesian phylogenetic study of the Dravidian language family. Roy. Soc. Open Sci. 5, 3 (2018), 171504.Google Scholar
Cross Ref
- [119] . 1990. Dravidian Linguistics: An Introduction. Pondicherry Institute of Linguistics and Culture.Google Scholar
- [120] . 2021. Improving Indian spoken-language identification by feature selection in duration mismatch framework. SN Comput. Sci. 2, 6 (2021), 1–16.Google Scholar
Cross Ref
- [121] . 2014. Automatic speech recognition for under-resourced languages: A survey. Speech Commun. 56 (2014), 85–100.Google Scholar
Digital Library
- [122] . 2014. NIST language recognition evaluation-past and future. In Proceedings of Odyssey 2014: The Speaker and Language Recognition Workshop. ISCA, 145–151.Google Scholar
Cross Ref
- [123] . 1906. Linguistic Survey of India. Vol. 4. Office of the Superintendent of Government Printing, India.Google Scholar
- [124] . 1956. India as a lingustic area. Language 32, 1 (1956), 3–16.Google Scholar
Cross Ref
- [125] . 2019. Phonet: A tool based on gated recurrent neural networks to extract phonological posteriors from speech. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). ISCA, 549–553.Google Scholar
Cross Ref
- [126] . 2006. The influence of Gujarati and Tamil L1s on Indian English: A preliminary study. World Engl. 25, 1 (2006), 91–104.Google Scholar
Cross Ref
- [127] . 2017. Exploiting acoustic similarities between Tamil and Indian English in the development of an HMM-based bilingual synthesiser. IET Sign. Process. 11, 3 (2017), 332–340.Google Scholar
Cross Ref
- [128] . 2010. The acoustic characteristics of diphthongs in Indian English. World Engl. 29, 1 (2010), 27–44.Google Scholar
Cross Ref
- [129] . 2018. On the issues of intra-speaker variability and realism in speech, speaker, and language recognition tasks. Speech Commun. 101 (2018), 94–108.Google Scholar
Cross Ref
- [130] . 2014. A simple method to determine if a music information retrieval system is a “horse.”IEEE Trans. Multimedia 16, 6 (2014), 1636–1644.Google Scholar
Cross Ref
- [131] . 2015. Factors affecting i-vector based foreign accent recognition: A case study in spoken Finnish. Speech Commun. 66 (2015), 118–129.Google Scholar
Digital Library
- [132] . 2011. Automatic Dialect and Accent Recognition and Its Application to Speech Recognition. Ph.D. Dissertation. Columbia University.Google Scholar
- [133] . 2010. Multilevel and session variability compensated language recognition: AVS-UAM systems at NIST LRE 2009. IEEE J. Select. Top. Sign. Process. 4, 6 (2010), 1084–1093.Google Scholar
Cross Ref
- [134] . 2004. Developing Asian language corpora: Standards and practice. In Asian Language Resources.Google Scholar
- [135] . 1992. The OGI multi-language telephone speech corpus. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’92). ISCA, 895–898.Google Scholar
Cross Ref
- [136] . 2017. Multi-language conversational telephone speech 2011—South Asian LDC2017S14. Linguistic Data Consortium, Philadelphia, PA.Google Scholar
- [137] . 2020. Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 60 (2020), 101027.Google Scholar
Digital Library
- [138] . 2021. VoxLingua107: A dataset for spoken language recognition. In Proceedings of the Spoken Language Technology (SLT’21). IEEE, 895–898.Google Scholar
Cross Ref
- [139] . 2021. Multilingual speech corpus in low-resource Eastern and Northeastern Indian languages for speaker and language identification. Circ. Syst. Sign. Process. (2021), 1–28.Google Scholar
- [140] . 2000. Language identification from short segments of speech. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’00). ISCA, 1033–1036.Google Scholar
Cross Ref
- [141] . 2004. Language identification for multilingual speech recognition systems. In Proceedings of the Conference Speech and Computer.Google Scholar
- [142] . 2004. Autoassociative neural network models for language identification. In Proceedings of the International Conference on Intelligent Sensing and Information Processing. IEEE, 317–320.Google Scholar
Cross Ref
- [143] . 2013. Analysis of language identification performance based on gender and hierarchial grouping approaches. In Proceedings of the International Conference on Natural Language Processing. 127.Google Scholar
- [144] . 2017. Spoken Indian language classification using artificial neural network—An experimental study. In Proceedings of the International Conference on Signal Processing and Integrated Networks (SPIN’17). IEEE, 424–430.Google Scholar
Cross Ref
- [145] . 2021. A GMM supervector approach for spoken Indian language identification for mismatch utterance length. Bull. Electr. Eng. Inf. 10, 2 (2021), 1114–1121.Google Scholar
Cross Ref
- [146] . 2017. Automatic language identification for seven Indian languages using higher level features. In Proceedings of the International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES’17). IEEE, 1–6.Google Scholar
Cross Ref
- [147] . 2020. Cascade convolutional neural network-long short-term memory recurrent neural networks for automatic tonal and non-tonal preclassification-based Indian language identification. Expert Syst. 37, 5 (2020), e12544.Google Scholar
- [148] . 2021. Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system. Lang. Resourc. Eval. (2021), 1–42.Google Scholar
- [149] . 2019. Deep learning for spoken language identification: Can we visualize speech signal patterns?Neural Comput. Appl. 31, 12 (2019), 8483–8501.Google Scholar
Cross Ref
- [150] . 2020. A lazy learning-based language identification from speech using MFCC-2 features. Int. J. Mach. Learn. Cybernet. 11, 1 (2020), 1–14.Google Scholar
Cross Ref
- [151] . 2021. FuzzyGCP: A deep learning architecture for automatic spoken language identification from speech signals. Expert Syst. Appl. 168 (2021), 114416.Google Scholar
Cross Ref
- [152] . 2021. Performance evaluation of language identification on emotional speech corpus of three Indian languages. In Intelligence Enabled Research. Springer, 55–63.Google Scholar
- [153] . 2021. Feature selection for improving Indian spoken language identification in utterance duration mismatch condition. Bull. Electr. Eng. Inf. 10, 5 (2021), 2578–2587.Google Scholar
Cross Ref
- [154] . 2021. Noise-robust spoken language identification using language relevance factor based embedding. In Proceedings of the Spoken Language Technology Workshop (SLT’21). IEEE, 644–651.Google Scholar
- [155] . 2021. Spoken language identification in unseen target domain using within-sample similarity loss. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 7223–7227.Google Scholar
- [156] . 2021. DenseRecognition of spoken languages. In Proceedings of the International Conference on Pattern Recognition (ICPR’21). IEEE, 9674–9681.Google Scholar
Cross Ref
- [157] . 2021. Cross-corpora language recognition: A preliminary investigation with Indian languages. In Proceedings of the European Signal Processing Conference (EUSIPCO’21). IEEE, 546–550.Google Scholar
Cross Ref
- [158] . 2021. Self-supervised phonotactic representations for language identification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21), 1514–1518.Google Scholar
- [159] . 2022. A novel approach for spoken language identification and performance comparison using machine learning-based classifiers and neural network. In Proceedings of the International e-Conference on Intelligent Systems and Signal Processing. Springer, 547–555.Google Scholar
Cross Ref
- [160] . 2022. Automatic spoken language identification using MFCC based time series features. Multimedia Tools Appl. (2022), 1–31.Google Scholar
- [161] . 2021. Indian regional spoken language identification using deep learning approach. In Proceedings of the International Conference on Mathematics and Computing. Springer, 263–274.Google Scholar
Cross Ref
- [162] . 2019. Enabling spoken dialogue systems for low-resourced languages–end-to-end dialect recognition for North Sami. In Proceedings of the 9th International Workshop on Spoken Dialogue System Technology. Springer, 221–235.Google Scholar
Cross Ref
- [163] . 2021. Identification of Scandinavian languages from speech using bottleneck features and X-vectors. In Proceedings of the International Conference on Text, Speech, and Dialogue. Springer, 371–381.Google Scholar
Digital Library
- [164] . 2009. Development of a spoken language identification system for South African languages. SAIEE Afr. Res. J. 100, 4 (2009), 97–103.Google Scholar
Cross Ref
- [165] . 2020. A robust ensemble model for spoken language recognition. Appl. Comput. Sci. 16, 3 (2020).Google Scholar
- [166] . 2016. AP16-OL7: A multilingual database for oriental languages and a language recognition baseline. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA’16). IEEE, 1–5.Google Scholar
Cross Ref
- [167] . 2019. Residual convolutional neural network with attentive feature pooling for end-to-end language identification from short-duration speech. Comput. Speech Lang. 58 (2019), 364–376.Google Scholar
Digital Library
- [168] . 2021. Language recognition on unknown conditions: The LORIA-Inria-MULTISPEECH system for AP20-OLR challenge. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 3256–3260.Google Scholar
Cross Ref
- [169] . 2021. Dynamic multi-scale convolution for dialect identification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 3261–3265.Google Scholar
Cross Ref
- [170] . 2018. Analysis of BUT-PT submission for NIST LRE 2017. In Proceedings of Odyssey 2018: The Speaker and Language Recognition Workshop. ISCA, 47–53.Google Scholar
Cross Ref
- [171] . 2020. Identification of seven low-resource North-Eastern languages: An experimental study. In Intelligence Enabled Research. Springer, 71–81.Google Scholar
- [172] . 2020. Spoken language recognition on open-source datasets. SMU Data Sci. Rev. 3, 2 (2020), 3.Google Scholar
- [173] . 2020. Common Voice: A massively-multilingual speech corpus. In Proceedings of the Language Resources and Evaluation Conference (LREC’20). 4218–4222.Google Scholar
- [174] . 2018. Designing an IVR based framework for telephony speech data collection and transcription in under-resourced languages. In Proceedings of the Spoken Language Technologies for Under-Resourced Languages (SLTU’18). 47–51.Google Scholar
Cross Ref
- [175] . 2019. A survey of zero-shot learning: Settings, methods, and applications. ACM Trans. Intell. Syst. Technol. 10, 2 (2019), 1–37.Google Scholar
Digital Library
- [176] . 2020. Multi-task self-supervised learning for robust speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 6989–6993.Google Scholar
Cross Ref
- [177] . 2019. Self-supervised speaker embeddings. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). ISCA, 2863–2867.Google Scholar
Cross Ref
- [178] . 2019. vq-wav2vec: Self-supervised learning of discrete speech representations. In Proceedings of the International Conference on Learning Representations (ICLR’19).Google Scholar
- [179] . 2017. Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 2047–2051. http://dx.doi.org/10.1109/ICASSP.2017.7952516Google Scholar
Digital Library
- [180] . 2020. On cross-corpus generalization of deep learning based speech enhancement. IEEE/ACM Trans. Aud. Speech Lang. Process. 28 (2020), 2489–2499.Google Scholar
Digital Library
- [181] . 2010. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Trans. Affect. Comput. 1, 2 (2010), 119–131.
DOI: DOI: http://dx.doi.org/10.1109/T-AFFC.2010.8Google ScholarDigital Library
- [182] . 2020. Magneto: X-vector magnitude estimation network plus offset for improved speaker recognition. In Proceedings of Odyssey 2020: The Speaker and Language Recognition Workshop. ISCA, 1–8.Google Scholar
Cross Ref
- [183] . 2014. Data augmentation for low resource languages. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). ISCA, 810–814.Google Scholar
Cross Ref
- [184] . 2019. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). ISCA, 2613–2617.Google Scholar
Cross Ref
- [185] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, April 30 - May 3, 2018.Google Scholar
- [186] . 2021. Micaugment: One-shot microphone style transfer. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 3400–3404.Google Scholar
Cross Ref
- [187] . 2020. X-vectors meet emotions: A study on dependencies between emotion and speaker recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7169–7173.Google Scholar
Cross Ref
- [188] . 2018. Transfer learning for improving speech emotion classification accuracy. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). ISCA, 257–261.Google Scholar
Cross Ref
- [189] . 2020. Improving cross-lingual transfer learning for end-to-end speech recognition with speech translation. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH. ISCA, 4731–4735.Google Scholar
Cross Ref
- [190] . 2018. Language-adversarial transfer learning for low-resource speech recognition. IEEE/ACM Trans. Aud. Speech Lang. Process. 27, 3 (2018), 621–630.Google Scholar
Digital Library
- [191] . 2014. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Sign. Process. Lett. 21, 9 (2014), 1068–1072.Google Scholar
Cross Ref
- [192] . 2017. An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257 (2017), 79–87.Google Scholar
Cross Ref
- [193] . 2018. Domain adversarial training for accented speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4854–4858.Google Scholar
Digital Library
- [194] . 2019. Channel adversarial training for cross-channel text-independent speaker recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6221–6225.Google Scholar
Cross Ref
- [195] . 2013. Code-Switching in Conversation: Language, Interaction and Identity. Routledge.Google Scholar
Cross Ref
- [196] . 2020. Multilingual bottleneck features for improving ASR performance of code-switched speech in under-resourced languages. In Proceedings of the Workshop on Speech Technologies for Code-Switching in Multilingual Communities (WSTCSMC’20), 65.Google Scholar
- [197] . 2013. Language diarization for code-switch conversational speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). IEEE, 7314–7318.Google Scholar
Cross Ref
- [198] . 2021. Spoken language diarization using an attention based neural network. In Proceedings of the National Conference on Communications (NCC’21). IEEE, 1–6.Google Scholar
Cross Ref
- [199] . 2018. SVM based language diarization for code-switched bilingual Indian speech using bottleneck features. In Proceedings of the Spoken Language Technologies for Under-Resourced Languages (SLTU’18). ISCA, 132–136.Google Scholar
- [200] . 2021. Training hybrid models on noisy transliterated transcripts for code-switched speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21), 2906–2910.Google Scholar
- [201] . 2021. Dual script E2E framework for multilingual and code-switching ASR. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 2441–2445.Google Scholar
Cross Ref
- [202] . 2021. SRI-B end-to-end system for multilingual and code-switching ASR challenges for low resource Indian languages. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 2456–2460.Google Scholar
Cross Ref
- [203] . 2020. Exploiting spectral augmentation for code-switched spoken language identification. Proceedings of the Workshop on Speech Technologies for Code-Switching in Multilingual Communities (WSTCSMC’20), 36.Google Scholar
- [204] . 2020. Language identification for code-mixed Indian languages in the wild. Proceedings of the Workshop on Speech Technologies for Code-Switching in Multilingual Communities (WSTCSMC’20), 48.Google Scholar
- [205] . 2022. Applications of multilingual phone recognition in code-switched and non-code-switched scenarios. In Multilingual Phone Recognition in Indian Languages. Springer, 67–83.Google Scholar
Cross Ref
- [206] . 2020. Using X-vectors to automatically detect Parkinson’s disease from speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 1155–1159.Google Scholar
Cross Ref
- [207] . 2020. Alzheimer’s disease and automatic speech analysis: A review. Expert Syst. Appl. 150 (2020), 113213.Google Scholar
Cross Ref
Index Terms
An Overview of Indian Spoken Language Recognition from Machine Learning Perspective
Recommendations
Approaches for Multilingual Phone Recognition in Code-switched and Non-code-switched Scenarios Using Indian Languages
In this study, we evaluate and compare two different approaches for multilingual phone recognition in code-switched and non-code-switched scenarios. First approach is a front-end Language Identification (LID)-switched to a monolingual phone recognizer (...
Code-switched automatic speech recognition in five South African languages
AbstractMost automatic speech recognition (ASR) systems are optimised for one specific language and their performance consequently deteriorates drastically when confronted with multilingual or code-switched speech. We describe our efforts to ...
Highlights- Addressed different aspects of ASR for South African code-switched speech.
- Four ...
Pronunciation augmentation for Mandarin-English code-switching speech recognition
AbstractCode-switching (CS) refers to the phenomenon of using more than one language in an utterance, and it presents great challenge to automatic speech recognition (ASR) due to the code-switching property in one utterance, the pronunciation variation ...






Comments