skip to main content
research-article

BSML: Bidirectional Sampling Aggregation-based Metric Learning for Low-resource Uyghur Few-shot Speaker Verification

Published:10 March 2023Publication History
Skip Abstract Section

Abstract

In recent years, text-independent speaker verification has remained a hot research topic, especially for the limited enrollment and/or test data. At the same time, due to the lack of sufficient training data, the study of low-resource few-shot speaker verification makes the models prone to overfitting and low accuracy of recognition. Therefore, a bidirectional sampling aggregation-based meta-metric learning method is proposed to solve the low-accuracy problem of speaker recognition in a low-resource environment with limited data, termed bidirectional sampling multi-scale Fisher feature fusion (BSML). First, the BSML method was used for effective feature enhancement in the feature extraction stage; second, a large number of similar and disjoint tasks were used to train the models to learn how to compare sample similarity; finally, new tasks were used to identify unknown samples by calculating the similarity of the samples. Extensive experiments are conducted on a short-duration text-independent speaker verification dataset generated from the THUYG-20 low-resource Uyghur with limited data, which comprised speech samples of diverse lengths. The experimental result has shown that the metric learning approach is effective in avoiding model overfitting and improving model generalization, with significant results in the identification of short-duration speaker verification in low-resource Uyghur with few-shot. It also demonstrates that BSML outperforms the state-of-the-art deep-embedding speaker recognition architectures and recent metric learning approach by at least 18%–67% in the few-shot test set. The ablation experiments further illustrate that our proposed approaches can achieve substantial improvement over prior methods and achieves better performance and generalization ability.

REFERENCES

  1. [1] Campbell J. P.. 1997. Speaker recognition: A tutorial. Proc. IEEE 85, 9 (Sept. 1997), 14371462. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Counter Peter B.. 2015. Morpho and AGNITiO Partner, Bring Voice Biometrics to Criminal ID. Retrieved from https://findbiometrics.com/morpho-and-agnitio-partner-bring-voice-biometricsto-criminal-id-21261/.Google ScholarGoogle Scholar
  3. [3] Vogt R., Sridharan S., and Mason M.. 2010. Making confident speaker verification decisions with minimal speech. IEEE Trans. Audio, Speech, Lang. Process. 18, 6 (Aug. 2010), 11821192. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Zinchenko K., Wu C., and Song K.. 2017. A study on speech recognition control for a surgical robot. IEEE Trans. Industr. Inform. 13, 2 (Apr. 2017), 607615. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Greenberg C. S., Stanford V. M., Martin A. F., Yadagiri M., Doddington G. R., Godfrey J. J., and Hernandez-Cordero J.. 2013. The 2012 NIST speaker recognition evaluation. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH’13). ISCA, 19711975. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Kinnunen Tomi and Li Haizhou. 2010. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 52, 1 (Jan. 2010), 1240. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Hansen J. H. L. and Hasan T.. 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag. 32, 6 (Nov. 2015), 7499. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Wigmore J. H.. 1926. A new mode of identifying-criminal. J. Amer. Inst. Crim. Law Criminol. 17, 2 (1926), 165166. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Enzinger E.. 2016. Implementation of Forensic Voice Comparison Within the New Paradigm for the Evaluation of Forensic Evidence. Ph.D. dissertation, University of New South Wales, Sydney, New South Wales.Google ScholarGoogle Scholar
  10. [10] Nosratighods M., Ambikairajah E., Epps J., and Carey M. J.. 2010. A segment selection technique for speaker verification. Speech Commun. 52, 9 (Sept. 2010), 753761. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chowdhury A. and Ross A.. 2020. Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals. IEEE Trans. Info. Forens. Secur. 15 (2020), 16161629. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Marcus G.. 2018. Deep learning: A critical appraisal. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Junchao Wang, Hao Huang, Haihua Xu, et al. 2018. Low-resource Uyghur speech recognition based on transfer learning. Comput. Eng. 44, 10 (2018), 281285, 291. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Wang S., Huang Z., Qian Y., and Yu K.. 2019. Discriminative neural embedding learning for short-duration text-independent speaker verification. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27, 11 (Nov. 2019), 16861696. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Kenny P., Boulianne G., and Dumouchel P.. 2005. Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13, 3 (May 2005), 345354. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Dehak N., Dehak R., Glass J., Reynolds D., and Kenny P.. 2010. Cosine similarity scoring without score normalization techniques. In Proceedings of Odyssey 2010—The Speaker and Language Recognition Workshop (Odyssey’10). ISCA.Google ScholarGoogle Scholar
  17. [17] Povey D., Cheng G., Wang Y., Li K., Xu H., Yarmohammadi M., and Khudanpur S.. 2018. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH’18). ISCA, 37433747. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Snyder D., Garcia-Romero D., Sell G., McCree A., Povey D., and Khudanpur S.. 2019. Speaker recognition for multi-speaker conversations using x-vectors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 57965800. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Villalba J. et al. 2020. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Comput. Speech Lang. 60 (Mar. 2020), 121. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Desplanques B., Thienpondt J., and Demuynck K.. 2020. ECAPA-TDNN: Emphasized channel attention, propagation, and aggregation in TDNN-based speaker verification. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20). ISCA, 38303834. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Velez I., Rascon C., and Fuentes G.. 2020. Lightweight speaker verification for online identification of new speakers with short segments. Appl. Soft Comput. 95 (2020), 106704. ISSN: 1568-4946. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Liu Z., Wu Z., Li T., Li J., and Shen C.. 2018. GMM and CNN hybrid method for short utterance speaker recognition. IEEE Trans. Industr. Inform. 14, 7 (July 2018), 32443252. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Li L., Wang D., Zhang C., and Zheng T. F.. 2016. Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24, 6 (June 2016), 11291139. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Das R. K. and Prasanna S. M.. 2016. Exploring different attributes of source information for speaker verification with limited test data. J. Acoustic. Soc. Amer. 140, 1 (2016):184. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Kim J.-H., Shim H.-J., Heo J., and Yu H.-J.. 2022. RawNeXt: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 76477651. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liu T., Das R. K., Aik Lee, K. and Li H.. 2022. MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 75177521. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Kanagasundaram A., Vogt R., Dean D. B., Sridharan S., and Mason M. W.. 2011. i-vector-based speaker recognition on short utterances. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH’11). ISCA, 23412344. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Wang J., Wang K. -C., Law M. T., Rudzicz F., and Brudno M.. 2019. Centroid-based deep metric learning for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 36523656. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., and Bengio Y.. 2014. Generative adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS’14). Curran Associates, 26722680. URL: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.Google ScholarGoogle Scholar
  30. [30] Li R., Jiang J.-Y., Liu J., Hsieh C.-C., and Wang W.. 2020. Automatic speaker recognition with limited data. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM’20). ACM, 340348. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Anand P., Singh A.K., Srivastava S., and Lall B.. 2019. Few shot speaker recognition using deep neural networks. Retrieved from .Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Sabour S., Frosst N., and Hinton G. E.. 2017. Dynamic routing between capsules. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS’17). Curran Associates, 38573867. Retrieved from https://proceedings.neurips.cc/paper/2017/file/2cad8fa47bbef282badbb8de5374b894-Paper.pdf.Google ScholarGoogle Scholar
  33. [33] Li R., Jiang J.-Y., Wu X., Mao H., Hsieh C.-C., and Wang W.. 2020. Bridging mixture density networks with meta-learning for automatic speaker identification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 35223526. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Mishra P.. 2020. Few Shot Text-Independent speaker verification using 3D-CNN. Retrieved from .Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Bromley J., Bentz J.W., Bottou L., Guyon I., LeCun Y., Moore C., Säckinger E., and Shah R.. 1993. Signature verification using a siamese time delay neural network. Int. J. Pattern Recogn. Artif. Intell. 07 (1993), 669. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Vinyals Oriol, Blundell Charles, Lillicrap Timothy, Kavukcuoglu Koray, and Wierstra Daan. 2016. Matching networks for one shot learning. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS’16). Curran Associates, 36373645. URL: Retrieved from https://proceedings.neurips.cc/paper/2016/file/90e1357833654983612fb05e3ec9148c-Paper.pdf.Google ScholarGoogle Scholar
  37. [37] Snell Jake, Swersky Kevin, and Zemel Richard. 2017. Prototypical networks for few-shot learning. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS’17). Curran Associates, 40804090. Retrieved from https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf.Google ScholarGoogle Scholar
  38. [38] Li W., Xu J., Huo J., Wang L., and Luo J.. 2019. Distribution consistency-based covariance metric networks for few-shot learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19), 86428649. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Sung F., Yang Y., Zhang L., Xiang T., Torr, P. H. S. and Hospedales T. M.. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 11991208. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Zhang X., Sung F., Qiang Y., Yang Y., and Hospedales T. M.. 2018. Deep comparison: Relation columns for few-shot learning. Retrieved from .Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Schroff F., Kalenichenko D., and Philbin J.. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 815823. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Zhang C., Koishida K., and Hansen J. H. L.. 2018. Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio, Speech, Lang. Process. 26, 9 (Sept. 2018), 16331644. DOI: Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Rouzi Aisikaer, Yin Shi, Zhang Zhiyong, Wang Dong, Hamdulla Askar, and Zheng Fang. 2017. THUYG-20: A free Uyghur speech database. J. Tsinghua Univ. (Sci. Technol.) 2 (2017), 182187. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Nagrani A., Chung J. S., and Zisserman A.. 2017. VoxCeleb: A large-scale speaker identification dataset. In Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH’17). ISCA, 26162620. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Bu H., Du J., Na X., Wu B., and Zheng H.. 2017. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA’17). IEEE, 15. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Liu T., Das R. K., Aik Lee K., and Li H.. 2022. MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 75177521. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, DeVito Zachary, Lin Zeming, Desmaison Alban, Antiga Luca, and Lerer Adam. 2017. Automatic differentiation in PyTorch. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS’17). Curran Associates, 14. Retrieved from https://openreview.net/pdf?id=BJJsrmfCZ.Google ScholarGoogle Scholar
  48. [48] Jung Y., Kye S. M., Choi Y., Jung M., and Kim H.. 2020. Improving multi-scale aggregation using feature pyramid module for robust speaker verification of variable-duration utterances. In Proceedings of the 21th Annual Conference of the International Speech Communication Association (INTERSPEECH’20). ISCA, 15011505. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zhao Y., Zhou T., Chen Z., and Wu J.. 2020. Improving deep CNN networks with long temporal context for text-independent speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 68346838. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Zeinali H., Wang S., Silnova A., et al. 2019. BUT system description to VoxCeleb speaker recognition challenge 2019. Retrieved from .Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wang Xingmei, Meng Jiaxiang, Wen Bin, and Xue Fuzhao. 2021. RACP: A network with attention corrected prototype for few-shot speaker recognition using indefinite distance metric. Neurocomputing 490, 14 (June 2022), 83294. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Dai Y., Gieseke F., Oehmcke S., Wu Y., and Barnard K.. 2021. Attentional feature fusion. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’21). IEEE, 35593568. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. BSML: Bidirectional Sampling Aggregation-based Metric Learning for Low-resource Uyghur Few-shot Speaker Verification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
      March 2023
      570 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3579816
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 March 2023
      • Online AM: 29 September 2022
      • Accepted: 20 September 2022
      • Revised: 20 July 2022
      • Received: 1 May 2022
      Published in tallip Volume 22, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)168
      • Downloads (Last 6 weeks)10

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!