Abstract
In recent years, text-independent speaker verification has remained a hot research topic, especially for the limited enrollment and/or test data. At the same time, due to the lack of sufficient training data, the study of low-resource few-shot speaker verification makes the models prone to overfitting and low accuracy of recognition. Therefore, a bidirectional sampling aggregation-based meta-metric learning method is proposed to solve the low-accuracy problem of speaker recognition in a low-resource environment with limited data, termed bidirectional sampling multi-scale Fisher feature fusion (BSML). First, the BSML method was used for effective feature enhancement in the feature extraction stage; second, a large number of similar and disjoint tasks were used to train the models to learn how to compare sample similarity; finally, new tasks were used to identify unknown samples by calculating the similarity of the samples. Extensive experiments are conducted on a short-duration text-independent speaker verification dataset generated from the THUYG-20 low-resource Uyghur with limited data, which comprised speech samples of diverse lengths. The experimental result has shown that the metric learning approach is effective in avoiding model overfitting and improving model generalization, with significant results in the identification of short-duration speaker verification in low-resource Uyghur with few-shot. It also demonstrates that BSML outperforms the state-of-the-art deep-embedding speaker recognition architectures and recent metric learning approach by at least 18%–67% in the few-shot test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods and achieves better performance and generalization ability.
- [1] . 1997. Speaker recognition: A tutorial. Proc. IEEE 85, 9 (Sept. 1997), 1437–1462.
DOI: Google ScholarCross Ref
- [2] . 2015. Morpho and AGNITiO Partner, Bring Voice Biometrics to Criminal ID. Retrieved from https://findbiometrics.com/morpho-and-agnitio-partner-bring-voice-biometricsto-criminal-id-21261/.Google Scholar
- [3] . 2010. Making confident speaker verification decisions with minimal speech. IEEE Trans. Audio, Speech, Lang. Process. 18, 6 (Aug. 2010), 1182–1192.
DOI: Google ScholarCross Ref
- [4] . 2017. A study on speech recognition control for a surgical robot. IEEE Trans. Industr. Inform. 13, 2 (Apr. 2017), 607–615.
DOI: Google ScholarCross Ref
- [5] . 2013. The 2012 NIST speaker recognition evaluation. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH’13). ISCA, 1971–1975.
DOI: Google ScholarCross Ref
- [6] . 2010. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 52, 1 (Jan. 2010), 12–40.
DOI: Google ScholarDigital Library
- [7] . 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag. 32, 6 (Nov. 2015), 74–99.
DOI: Google ScholarCross Ref
- [8] . 1926. A new mode of identifying-criminal. J. Amer. Inst. Crim. Law Criminol. 17, 2 (1926), 165–166.
DOI: Google ScholarCross Ref
- [9] . 2016. Implementation of Forensic Voice Comparison Within the New Paradigm for the Evaluation of Forensic Evidence. Ph.D. dissertation, University of New South Wales, Sydney, New South Wales.Google Scholar
- [10] . 2010. A segment selection technique for speaker verification. Speech Commun. 52, 9 (Sept. 2010), 753–761.
DOI: Google ScholarDigital Library
- [11] . 2020. Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans. Info. Forens. Secur. 15 (2020), 1616–1629.
DOI: Google ScholarDigital Library
- [12] . 2018. Deep learning: A critical appraisal. Retrieved from Google Scholar
Cross Ref
- [13] 2018. Low-resource Uyghur speech recognition based on transfer learning. Comput. Eng. 44, 10 (2018), 281–285, 291.
DOI: Google ScholarCross Ref
- [14] . 2019. Discriminative neural embedding learning for short-duration text-independent speaker verification. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27, 11 (Nov. 2019), 1686–1696.
DOI: Google ScholarDigital Library
- [15] . 2005. Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13, 3 (May 2005), 345–354.
DOI: Google ScholarCross Ref
- [16] . 2010. Cosine similarity scoring without score normalization techniques. In Proceedings of Odyssey 2010—The Speaker and Language Recognition Workshop (Odyssey’10). ISCA.Google Scholar
- [17] . 2018. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH’18). ISCA, 3743–3747.
DOI: Google ScholarCross Ref
- [18] . 2019. Speaker recognition for multi-speaker conversations using x-vectors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5796–5800.
DOI: Google ScholarCross Ref
- [19] 2020. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Comput. Speech Lang. 60 (Mar. 2020), 1–21.
DOI: Google ScholarDigital Library
- [20] . 2020. ECAPA-TDNN: Emphasized channel attention, propagation, and aggregation in TDNN-based speaker verification. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20). ISCA, 3830–3834.
DOI: Google ScholarCross Ref
- [21] . 2020. Lightweight speaker verification for online identification of new speakers with short segments. Appl. Soft Comput. 95 (2020), 106704. ISSN: 1568-4946.
DOI: Google ScholarCross Ref
- [22] . 2018. GMM and CNN hybrid method for short utterance speaker recognition. IEEE Trans. Industr. Inform. 14, 7 (July 2018), 3244–3252.
DOI: Google ScholarCross Ref
- [23] . 2016. Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 6 (June 2016), 1129–1139.
DOI: Google ScholarDigital Library
- [24] . 2016. Exploring different attributes of source information for speaker verification with limited test data. J. Acoustic. Soc. Amer. 140, 1 (2016):184.
DOI: Google ScholarCross Ref
- [25] . 2022. RawNeXt: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 7647–7651.
DOI: Google ScholarCross Ref
- [26] . 2022. MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 7517–7521.
DOI: Google ScholarCross Ref
- [27] . 2011. i-vector-based speaker recognition on short utterances. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH’11). ISCA, 2341–2344.
DOI: Google ScholarCross Ref
- [28] . 2019. Centroid-based deep metric learning for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 3652–3656.
DOI: Google ScholarCross Ref
- [29] . 2014. Generative adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS’14). Curran Associates, 2672–2680. URL: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.Google Scholar
- [30] . 2020. Automatic speaker recognition with limited data. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM’20). ACM, 340–348.
DOI: Google ScholarDigital Library
- [31] . 2019. Few shot speaker recognition using deep neural networks. Retrieved from .Google Scholar
Cross Ref
- [32] . 2017. Dynamic routing between capsules. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS’17). Curran Associates, 3857–3867. Retrieved from https://proceedings.neurips.cc/paper/2017/file/2cad8fa47bbef282badbb8de5374b894-Paper.pdf.Google Scholar
- [33] . 2020. Bridging mixture density networks with meta-learning for automatic speaker identification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 3522–3526.
DOI: Google ScholarCross Ref
- [34] . 2020. Few Shot Text-Independent speaker verification using 3D-CNN. Retrieved from .Google Scholar
Cross Ref
- [35] . 1993. Signature verification using a siamese time delay neural network. Int. J. Pattern Recogn. Artif. Intell. 07 (1993), 669.
DOI: Google ScholarCross Ref
- [36] . 2016. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS’16). Curran Associates, 3637–3645. URL: Retrieved from https://proceedings.neurips.cc/paper/2016/file/90e1357833654983612fb05e3ec9148c-Paper.pdf.Google Scholar
- [37] . 2017. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS’17). Curran Associates, 4080–4090. Retrieved from https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf.Google Scholar
- [38] . 2019. Distribution consistency-based covariance metric networks for few-shot learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19), 8642–8649.
DOI: Google ScholarDigital Library
- [39] . 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 1199–1208.
DOI: Google ScholarCross Ref
- [40] . 2018. Deep comparison: Relation columns for few-shot learning. Retrieved from .Google Scholar
Cross Ref
- [41] . 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 815–823.
DOI: Google ScholarCross Ref
- [42] . 2018. Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio, Speech, Lang. Process. 26, 9 (Sept. 2018), 1633–1644.
DOI: Google ScholarDigital Library
- [43] . 2017. THUYG-20: A free Uyghur speech database. J. Tsinghua Univ. (Sci. Technol.) 2 (2017), 182–187.
DOI: Google ScholarCross Ref
- [44] . 2017. VoxCeleb: A large-scale speaker identification dataset. In Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH’17). ISCA, 2616–2620.
DOI: Google ScholarCross Ref
- [45] . 2017. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA’17). IEEE, 1–5.
DOI: Google ScholarCross Ref
- [46] . 2022. MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 7517–7521.
DOI: Google ScholarCross Ref
- [47] . 2017. Automatic differentiation in PyTorch. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS’17). Curran Associates, 1–4. Retrieved from https://openreview.net/pdf?id=BJJsrmfCZ.Google Scholar
- [48] . 2020. Improving multi-scale aggregation using feature pyramid module for robust speaker verification of variable-duration utterances. In Proceedings of the 21th Annual Conference of the International Speech Communication Association (INTERSPEECH’20). ISCA, 1501–1505.
DOI: Google ScholarCross Ref
- [49] . 2020. Improving deep CNN networks with long temporal context for text-independent speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 6834–6838.
DOI: Google ScholarCross Ref
- [50] 2019. BUT system description to VoxCeleb speaker recognition challenge 2019. Retrieved from .Google Scholar
Cross Ref
- [51] . 2021. RACP: A network with attention corrected prototype for few-shot speaker recognition using indefinite distance metric. Neurocomputing 490, 14 (June 2022), 83–294.
DOI: Google ScholarDigital Library
- [52] . 2021. Attentional feature fusion. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’21). IEEE, 3559–3568.
DOI: Google ScholarCross Ref
Index Terms
BSML: Bidirectional Sampling Aggregation-based Metric Learning for Low-resource Uyghur Few-shot Speaker Verification
Recommendations
Effect of Language Mixture on Speaker Verification: An Investigation with Amharic, English, and Mandarin Chinese
Artificial Intelligence and SecurityAbstractSpeaker verification (SV) tasks with low-resource language corpora naturally face technical difficulties and often require language mixture processing. In this paper, the LibriSpeech ASR corpus, the AISHELL-I Mandarin Speech corpus, and the ...
Non-Native Speaker Identity Verification Based on Speech
ICNC '08: Proceedings of the 2008 Fourth International Conference on Natural Computation - Volume 06Speaker identity verification is an useful biometric recognition approach. Native speaker verification has achieved some better effects. But non-native speaker verification remains a challenging task because of wide varieties of non-native accents. ...
Significance of duration modification for speaker verification under mismatch speech tempo condition
This work explores the scope of duration modification for speaker verification (SV) under mismatch speech tempo condition. The SV performance is found to depend on speaking rate of a speaker. The mismatch in the speaking rate can degrade the performance ...






Comments