Abstract
Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.
- [1] . 2015. Learning to see by moving. In Proceedings of International Conference on Computer Vision (ICCV’15). 37–45. Google Scholar
Digital Library
- [2] . 2019. Noisy BiLSTM-based models for disfluency detection. In Proceedings of the INTERSPEECH Conference (INTERSPEECH’19). 4230–4234.Google Scholar
- [3] . 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb.2003), 1137–1155. Google Scholar
Digital Library
- [4] . 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). 164–169.Google Scholar
Cross Ref
- [5] . 2001. Edit detection and parsing for transcribed speech. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL’01). Google Scholar
Digital Library
- [6] . 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 740–750.Google Scholar
Cross Ref
- [7] . 2005. Reducing labeling effort for structured prediction tasks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’05), Vol. 5. 746–751. Google Scholar
Digital Library
- [8] . 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL’19).Google Scholar
- [9] . 2019. Adapting translation models for transcript disfluency detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6351–6358. Google Scholar
Digital Library
- [10] . 2020. Active learning for BERT: An empirical study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 7949–7962.Google Scholar
Cross Ref
- [11] . 2015. Disfluency detection with a semi-markov model and prosodic features. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL’15). 257–262.Google Scholar
Cross Ref
- [12] . 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’17). 3636–3645.Google Scholar
Cross Ref
- [13] . 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning. 1050–1059. Google Scholar
Digital Library
- [14] . 2009. Using integer linear programming for detecting speech disfluencies. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL’09). Google Scholar
Digital Library
- [15] . 2019. Discriminative active learning. arXiv:1907.06347. Retrieved from https://arxiv.org/abs/1907.06347.Google Scholar
- [16] . 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’92). IEEE, 517–520. Google Scholar
Digital Library
- [17] . 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv:1606.08415. Retrieved from https://arxiv.org/abs/1606.08415.Google Scholar
- [18] . 2014. Joint incremental disfluency detection and dependency parsing. Trans. Assoc. Comput. Linguist. 2 (2014).Google Scholar
- [19] . 2015. Recurrent neural networks for incremental disfluency detection. In Proceedings of the INTERSPEECH Conference (INTERSPEECH’15).Google Scholar
Cross Ref
- [20] . 2018. Disfluency detection using auto-correlational neural networks. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’18). 4610–4619.Google Scholar
- [21] . 2019. Neural constituency parsing of speech transcripts. In Proceedings of the 2019 Conference of the North American, Volume 1. 2756–2765.Google Scholar
- [22] . 2004. A TAG-based noisy channel model of speech repairs. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’04). Google Scholar
Digital Library
- [23] . 1994. A sequential algorithm for training text classifiers. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94). Springer, 3–12. Google Scholar
Digital Library
- [24] . 2018. Learning to actively learn neural machine translation. In Proceedings of the 22nd Conference on Computational Natural Language Learning. 334–344.Google Scholar
Cross Ref
- [25] . 2016. Generating and exploiting large-scale pseudo training data for zero pronoun resolution. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 102–111.Google Scholar
- [26] . 2017. Disfluency detection using a noisy channel model and a deep neural language model. In Proceedings of ACL (Volume 2: Short Papers). 547–553.Google Scholar
- [27] . 2020. Improving disfluency detection by self-training a self-attentive model In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3754–3763.Google Scholar
- [28] . 2017. When is multitask learning effective? Semantic sequence prediction under varying data conditions. In Proceedings of Conference of the European Chapter of the Association for Computational Linguistics (EACL’17).Google Scholar
- [29] . 1995. Dysfluency Annotation Stylebook for the Switchboard Corpus. University of Pennsylvania.Google Scholar
- [30] . 2013. Efficient estimation of word representations in vector space. Computer Science.Google Scholar
- [31] . 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119. Google Scholar
Digital Library
- [32] . 2013. A sequential repetition model for improved disfluency detection. In Proceedings of the INTERSPEECH Conference (INTERSPEECH’13).Google Scholar
Cross Ref
- [33] . 2017. Deep multitask learning for semantic dependency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’17). https://doi.org/10.18653/v1/P17-1186Google Scholar
Cross Ref
- [34] . 2018. Active learning for interactive neural machine translation of data streams. In Proceedings of the 22nd Conference on Computational Natural Language Learning. Association for Computational Linguistics, 151–160. https://doi.org/10.18653/v1/K18-1015Google Scholar
Cross Ref
- [35] . 2018. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL’18). https://doi.org/10.18653/v1/N18-1202Google Scholar
Cross Ref
- [36] . 2019. Sampling bias in deep active classification: An empirical study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 4058–4068. https://doi.org/10.18653/v1/D19-1417Google Scholar
Cross Ref
- [37] . 2013. Disfluency detection using multi-step stacked learning. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’13). 820–825.Google Scholar
- [38] . 2018. Improving Language Understanding with Unsupervised Learning.
Technical Report . OpenAI.Google Scholar - [39] . 2013. Joint parsing and disfluency detection in linear time. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’13). 124–129.Google Scholar
- [40] . 2020. Active sentence learning by adversarial uncertainty sampling in discrete space. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4908–4917.Google Scholar
- [41] . 2017. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations. https://openreview.net/forum?id=H1aIuk-RW.Google Scholar
- [42] . 2009. Active Learning Literature Survey.
Technical Report . University of Wisconsin—Madison Department of Computer Sciences.Google Scholar - [43] . 2017. Deep active learning for named entity recognition. In Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, 252–256. https://doi.org/10.18653/v1/W17-2630Google Scholar
Cross Ref
- [44] . 1994. Preliminaries to a Theory of Speech Disfluencies. Ph.D. Dissertation. Citeseer.Google Scholar
- [45] . 2018. Deep Bayesian active learning for natural language processing: Results of a large-scale empirical study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2904–2909. https://doi.org/10.18653/v1/D18-1318Google Scholar
Cross Ref
- [46] . 2019. Disfluency detection based on speech-aware token-by-token sequence labeling with BLSTM-CRFs and attention mechanisms. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC’19). IEEE, 1009–1013.Google Scholar
Cross Ref
- [47] . 2016. Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).Google Scholar
- [48] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google Scholar
Digital Library
- [49] . 2018. Semi-supervised disfluency detection. In Proceedings of the International Conference on Computational Linguistics (COLING’18).Google Scholar
- [50] . 2020. Multi-task self-supervised learning for disfluency detection. In Proceedings of the AAAI Conference on Artificial Intellgence (AAAI’20).Google Scholar
Cross Ref
- [51] . 2016. A neural attention model for disfluency detection. In Proceedings of the International Conference on Computational Linguistics (COLING’16).Google Scholar
- [52] . 2017. Transition-based disfluency detection using LSTMs. In Proceedings of Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’17). 2785–2794.Google Scholar
Cross Ref
- [53] . 2015. Unsupervised learning of visual representations using videos. In Proceedings of the International Conference on Computer Vision (ICCV’15). Google Scholar
Digital Library
- [54] . 2015. Efficient disfluency detection with transition-based parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’15).Google Scholar
Cross Ref
- [55] . 2016. Joint transition-based dependency parsing and disfluency detection for automatic speech recognition texts. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’16).Google Scholar
Cross Ref
- [56] . 2018. Robust cross-domain disfluency detection with pattern match networks. arXiv:1811.07236. Retrieved from https://arxiv.org/abs/1811.07236.Google Scholar
- [57] . 2019. Giving attention to the unexpected: Using prosody innovations in disfluency detection. In Proceedings of the 2019 Conference of the North American. Minneapolis, Minnesota, 86–95.Google Scholar
- [58] . 2014. Multi-domain disfluency and repair detection. In Proceedings of the INTERSPEECH Conference (INTERSPEECH’14).Google Scholar
Cross Ref
- [59] . 2016. Disfluency detection using a bidirectional LSTM. In INTERSPEECH.Google Scholar
- [60] . 2016. Active discriminative text representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Google Scholar
Digital Library
- [61] . 2020. Active learning approaches to enhancing neural machine translation. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 1796–1806.Google Scholar
Cross Ref
- [62] . 2010. Detecting speech repairs incrementally using a noisy channel approach. In Proceedings of the International Conference on Computational Linguistics (COLING’10). 1371–1378.Google Scholar
Index Terms
Combining Self-supervised Learning and Active Learning for Disfluency Detection
Recommendations
Combining active learning and semi-supervised for improving learning performance
ISABEL '11: Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication TechnologiesIn many learning tasks, there are abundant unlabeled samples but the number of labeled training samples is limited, because labeling the samples requires the efforts of human annotators and expertise. There are three major techniques for labeling the ...
Semi-supervised learning combining transductive support vector machine with active learning
In typical data mining applications, labeling the large amounts of data is difficult, expensive, and time consuming, if annotated manually. To avoid manual labeling, semi-supervised learning uses unlabeled data along with the labeled data in the ...
JGCL: Joint Self-Supervised and Supervised Graph Contrastive Learning
WWW '22: Companion Proceedings of the Web Conference 2022Semi-supervised and self-supervised learning on graphs are two popular avenues for graph representation learning. We demonstrate that no single method from semi-supervised and self-supervised learning works uniformly well for all settings in the node ...






Comments