Abstract
Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to grammatical errors, disfluency, and other noises common in spoken communication. These readable issues introduced by speakers and ASR systems will impair the performance of downstream tasks and the understanding of human readers. In this work, we present a task called ASR post-processing for readability (APR) and formulate it as a sequence-to-sequence text generation problem. The APR task aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of speakers. We further study the APR task from the benchmark dataset, evaluation metrics, and baseline models: First, to address the lack of task-specific data, we propose a method to construct a dataset for the APR task by using the data collected for grammatical error correction. Second, we utilize metrics adapted or borrowed from similar tasks to evaluate model performance on the APR task. Lastly, we use several typical or adapted pre-trained models as the baseline models for the APR task. Furthermore, we fine-tune the baseline models on the constructed dataset and compare their performance with a traditional pipeline method in terms of proposed evaluation metrics. Experimental results show that all the fine-tuned baseline models perform better than the traditional pipeline method, and our adapted RoBERTa model outperforms the pipeline method by 4.95 and 6.63 BLEU points on two test sets, respectively. The human evaluation and case study further reveal the ability of the proposed model to improve the readability of ASR transcripts.
- [1] . 2020. Deep learning for arabic error detection and correction. ACM Trans. Asian Low Resour. Lang. Inf. Process. 19, 5 (2020), 71:1–71:13.
DOI: Google ScholarDigital Library
- [2] . 2016. Repairing general-purpose ASR output to improve accuracy of spoken sentences in specific domains using artificial development approach. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16), (Ed.). IJCAI/AAAI Press, 4234–4235. http://www.ijcai.org/Abstract/16/637.Google Scholar
- [3] . 2012. Post-editing error correction algorithm for speech recognition using bing spelling suggestion. ArXiv Preprint abs/1203.5255 (2012). https://arxiv.org/abs/1203.5255.Google Scholar
- [4] . 2008. Recovering capitalization and punctuation marks for automatic speech recognition: Case study for portuguese broadcast news. Speech Communication 50, 10 (2008), 847–862.Google Scholar
Digital Library
- [5] . 2012. Post-processing of the recognized speech for web presentation of large audio archive. In 2012 35th International Conference on Telecommunications and Signal Processing (TSP’12). IEEE, 441–445.Google Scholar
Cross Ref
- [6] . 2016. Findings of the 2016 conference on machine translation. In Proceedings of the 1st Conference on Machine Translation: Volume 2, Shared Task Papers. Association for Computational Linguistics, 131–198.
DOI: Google ScholarCross Ref
- [7] . 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.Google Scholar
- [8] . 2019. The BEA-2019 shared task on grammatical error correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 52–75.
DOI: Google ScholarCross Ref
- [9] . 2017. Multi-source neural automatic post-editing: FBK’s participation in the WMT 2017 APE shared task. In Proceedings of the Second Conference on Machine Translation (WMT’17). Association for Computational Linguistics, 630–638.
DOI: Google ScholarCross Ref
- [10] . 2015. Punctuation insertion for real-time spoken language translation. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT’15). https://aclanthology.org/2015.iwslt-papers.8.Google Scholar
- [11] . 2012. Segmentation and punctuation prediction in speech language translation using a monolingual translation system. In International Workshop on Spoken Language Translation 2012 (IWSLT’12).Google Scholar
- [12] . 2019. A neural grammatical error correction system built on better pre-training and sequential transfer learning. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 213–227.
DOI: Google ScholarCross Ref
- [13] . 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).Google Scholar
- [14] . 2013. Statistical error correction methods for domain-specific ASR systems. In International Conference on Statistical Language and Speech Processing (SLSP’13). Springer, 83–92.Google Scholar
Digital Library
- [15] . 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’12). Association for Computational Linguistics, 568–572. https://aclanthology.org/N12-1067.Google Scholar
Digital Library
- [16] . 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 (NeurIPS’15)., , , , , and (Eds.). 3079–3087. https://proceedings.neurips.cc/paper/2015/hash/7137debd45ae4d0ab9aa953017286b20-Abstract.html.Google Scholar
- [17] . 2018. Modeling multi-speaker latent space to improve neural TTS: Quick enrolling new speaker and enhancing premium voice. ArXiv Preprint abs/1812.05253 (2018). https://arxiv.org/abs/1812.05253.Google Scholar
- [18] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). Association for Computational Linguistics, 4171–4186.
DOI: Google ScholarCross Ref
- [19] . 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS’19), , , , , , and (Eds.). 13042–13054. https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html.Google Scholar
- [20] . 2018. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. Association for Computational Linguistics, 45–54.
DOI: Google ScholarCross Ref
- [21] . 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33, 3 (1973), 613–619.Google Scholar
Cross Ref
- [22] . 2018. Reaching human-level performance in automatic grammatical error correction: An empirical study. ArXiv Preprint abs/1807.01270 (2018). https://arxiv.org/abs/1807.01270.Google Scholar
- [23] . 1997. Switchboard-1 Release 2. Linguistic Data Consortium 926 (1997), 927.Google Scholar
- [24] . 2014. The computer learner corpus: A versatile new source of data for SLA research: Sylviane Granger. In Learner English on Computer. Routledge, 25–40.Google Scholar
- [25] . 2009. Restoring punctuation and capitalization in transcribed speech. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’09). IEEE, 4741–4744.Google Scholar
Digital Library
- [26] . 2019. Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 252–263.
DOI: Google ScholarCross Ref
- [27] . 2019. A spelling correction model for end-to-end speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5651–5655.
DOI: Google ScholarCross Ref
- [28] . 2016. Gaussian error linear units (GELUs). ArXiv Preprint abs/1606.08415 (2016). https://arxiv.org/abs/1606.08415.Google Scholar
- [29] . 2017. Ensembling factored neural machine translation models for automatic post-editing and quality estimation. In Proceedings of the 2nd Conference on Machine Translation (WMT’17). Association for Computational Linguistics, 647–654.
DOI: Google ScholarCross Ref
- [30] . 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). Association for Computational Linguistics, 328–339.
DOI: Google ScholarCross Ref
- [31] . 2020. Correction of automatic speech recognition with transformer sequence-to-sequence model. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7074–7078.
DOI: Google ScholarCross Ref
- [32] . 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR’15), Conference Track Proceedings, and (Eds.). http://arxiv.org/abs/1412.6980.Google Scholar
- [33] . 2019. An empirical study of incorporating pseudo data into grammatical error correction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 1236–1242.
DOI: Google ScholarCross Ref
- [34] . 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). Association for Computational Linguistics, Online, 7871–7880.
DOI: Google ScholarCross Ref
- [35] . 2020. High-accuracy and low-latency speech recognition with two-head contextual layer trajectory LSTM model. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7699–7703.Google Scholar
Cross Ref
- [36] . 2021. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs (2021).Google Scholar
- [37] . 2004. ORANGE: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). COLING, 501–507. https://aclanthology.org/C04-1072.Google Scholar
Digital Library
- [38] . 2021. A framework for indonesian grammar error correction. ACM Trans. Asian Low Resour. Lang. Inf. Process. 20, 4 (2021), 57:1–57:12.
DOI: Google ScholarDigital Library
- [39] . 2019. Roberta: A robustly optimized BERT pretraining approach. ArXiv Preprint abs/1907.11692 (2019). https://arxiv.org/abs/1907.11692.Google Scholar
- [40] . 2016. Using the TED talks to evaluate spoken post-editing of machine translation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), 2232–2239. https://aclanthology.org/L16-1355.Google Scholar
- [41] . 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (NeurIPS’17). , , , , , , and (Eds.). 6294–6305. https://proceedings.neurips.cc/paper/2017/hash/20c86a628232a67e7bd46f76fba7ce12-Abstract.html.Google Scholar
- [42] . 2016. Simplifying long short-term memory acoustic models for fast training and decoding. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 2284–2288.Google Scholar
Digital Library
- [43] . 2011. Mining revision log of language learning SNS for automated japanese error correction of second language learners. In Proceedings of 5th International Joint Conference on Natural Language Processing (IJCNLP’11). Asian Federation of Natural Language Processing, 147–155. https://aclanthology.org/I11-1017.Google Scholar
- [44] . 2022. An automatic post editing with efficient and simple data generation method. IEEE Access 10 (2022), 21032–21040.Google Scholar
Cross Ref
- [45] . 2019. Enabling robust grammatical error correction in new domains: Data sets, metrics, and analyses. Transactions of the Association for Computational Linguistics 7 (2019), 551–566.
DOI: Google ScholarCross Ref
- [46] . 2015. Ground truth for grammaticality correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL’15, July 26-31, 2015, Beijing, China, Volume 2: Short Papers). The Association for Computer Linguistics, 588–593. Google Scholar
Cross Ref
- [47] . 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). Association for Computational Linguistics, 229–234. https://aclanthology.org/E17-2037.Google Scholar
Cross Ref
- [48] . 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). Association for Computational Linguistics, 1–12. https://aclanthology.org/W13-3601.Google Scholar
- [49] . 2017. Neural automatic post-editing using prior alignment and reranking. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). Association for Computational Linguistics, 349–355. https://aclanthology.org/E17-2056.Google Scholar
Cross Ref
- [50] . 2016. A neural network based approach to automatic post-editing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). Association for Computational Linguistics, 281–286.
DOI: Google ScholarCross Ref
- [51] . 2002. Machine translation evaluation: N-grams to the rescue. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC’02). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2002/pdf/347.pdf.Google Scholar
- [52] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, 311–318.
DOI: Google ScholarDigital Library
- [53] . 2008. Sentence segmentation and punctuation recovery for spoken language translation. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’08). IEEE, 5105–5108.Google Scholar
Cross Ref
- [54] . 2018. A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations (ICLR’18), Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=HkAClQgA-.Google Scholar
- [55] . 2018. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper.pdf.Google Scholar
- [56] . 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).Google Scholar
- [57] . 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021).Google Scholar
- [58] . 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv Preprint abs/1910.10683 (2019). https://arxiv.org/abs/1910.10683.Google Scholar
- [59] . 2021. A simple recipe for multilingual grammatical error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP’21). 702–707.Google Scholar
Cross Ref
- [60] . 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). Association for Computational Linguistics, 1715–1725.
DOI: Google ScholarCross Ref
- [61] . 2018. Natural TTS synthesis by conditioning wavenet on Mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4779–4783.Google Scholar
Digital Library
- [62] . 2021. Exploration of effective attention strategies for neural automatic post-editing with transformer. ACM Trans. Asian Low Resour. Lang. Inf. Process. 20, 6 (2021), 111:1–111:17.
DOI: Google ScholarDigital Library
- [63] . 2019. Learning from past mistakes: Improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Transactions on Signal and Information Processing 8 (2019).Google Scholar
Cross Ref
- [64] . 2010. Formatting time-aligned ASR transcripts for readability. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT’10). Association for Computational Linguistics, 198–206. https://aclanthology.org/N10-1023.Google Scholar
Digital Library
- [65] . 2012. Discretion of speech units for the text post-processing phase of automatic transcription (in the Czech language). In International Conference on Text, Speech and Dialogue (TSD’12). Springer, 446–455.Google Scholar
Cross Ref
- [66] . 2022. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022).Google Scholar
- [67] . 2019. MASS: Masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Conference on Machine Learning (ICML’19)(
Proceedings of Machine Learning Research , Vol. 97), and (Eds.). PMLR, 5926–5936. http://proceedings.mlr.press/v97/song19d.html.Google Scholar - [68] . 2012. LSTM neural networks for language modeling. In 13th Annual Conference of the International Speech Communication Association (INTERSPEECH’12 Portland, Oregon, USA, September 9-13, 2012), ISCA, 194–197. http://www.isca-speech.org/archive/interspeech_2012/i12_0194.html.Google Scholar
Cross Ref
- [69] . 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NeurIPS’14). 3104–3112.Google Scholar
- [70] . 2016. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE Computer Society, 2818–2826.
DOI: Google ScholarCross Ref
- [71] . 2012. Tense and aspect error correction for ESL learners using global context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL’12). Association for Computational Linguistics, 198–202. https://aclanthology.org/P12-2039.Google Scholar
- [72] . 2017. Neural post-editing based on quality estimation. In Proceedings of the 2nd Conference on Machine Translation (WMT’17). Association for Computational Linguistics, 655–660.
DOI: Google ScholarCross Ref
- [73] . 2017. Attention is all you need. In Annual Conference on Neural Information Processing Systems 2017 (NeurIPS’17). , , , , , , and (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.Google Scholar
- [74] . 2016. Sequence student-teacher training of deep neural networks. In 17th Annual Conference of the International Speech Communication Association (INTERSPEECH’16, San Francisco, CA, USA, September 8–12, 2016), Nelson Morgan (Ed.). ISCA, 2761–2765. .Google Scholar
Cross Ref
- [75] . 2002. Large scale discriminative training of hidden markov models for speech recognition. Computer Speech & Language 16, 1 (2002), 25–47.Google Scholar
Digital Library
- [76] . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv Preprint abs/1609.08144 (2016). https://arxiv.org/abs/1609.08144.Google Scholar
- [77] . 2018. The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 5934–5938.
DOI: Google ScholarDigital Library
- [78] . 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS’19), , , , , , and (Eds.). 5754–5764. https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html.Google Scholar
- [79] . 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL’11). Association for Computational Linguistics, 180–189. https://aclanthology.org/P11-1019.Google Scholar
- [80] . 2019. Sequence-to-sequence Pre-training with data augmentation for sentence rewriting. ArXiv Preprint abs/1909.06002 (2019). https://arxiv.org/abs/1909.06002.Google Scholar
Index Terms
Improving Readability for Automatic Speech Recognition Transcription
Recommendations
A Robust Feature Normalization Algorithm for Automatic Speech Recognition
JCAI '09: Proceedings of the 2009 International Joint Conference on Artificial IntelligenceIn this paper, we present an effective feature normalization algorithm to improve the robustness of automatic speech recognition systems. At front-end, minimum mean square error log-spectral amplitude estimation speech enhancement is adopted to suppress ...
Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language
Highlights- Adding dysarthric speech resources from the dominant variety for training improves automatic recognition of dysarthric speech of the non-dominant variety.
AbstractSpeech intelligibility is an essential though complex construct for evaluating dysarthric speech. Various procedures can be used to measure speech intelligibility, most of which are based on subjective ratings assigned by experts. ...
Harmonicity Based Dereverberation for Improving Automatic Speech Recognition Performance and Speech Intelligibility
A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades both the speech intelligibility and Automatic Speech Recognition (ASR) performance. Previously, we proposed a single-microphone ...






Comments