skip to main content
research-article

Improving Readability for Automatic Speech Recognition Transcription

Published:09 May 2023Publication History
Skip Abstract Section

Abstract

Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to grammatical errors, disfluency, and other noises common in spoken communication. These readable issues introduced by speakers and ASR systems will impair the performance of downstream tasks and the understanding of human readers. In this work, we present a task called ASR post-processing for readability (APR) and formulate it as a sequence-to-sequence text generation problem. The APR task aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of speakers. We further study the APR task from the benchmark dataset, evaluation metrics, and baseline models: First, to address the lack of task-specific data, we propose a method to construct a dataset for the APR task by using the data collected for grammatical error correction. Second, we utilize metrics adapted or borrowed from similar tasks to evaluate model performance on the APR task. Lastly, we use several typical or adapted pre-trained models as the baseline models for the APR task. Furthermore, we fine-tune the baseline models on the constructed dataset and compare their performance with a traditional pipeline method in terms of proposed evaluation metrics. Experimental results show that all the fine-tuned baseline models perform better than the traditional pipeline method, and our adapted RoBERTa model outperforms the pipeline method by 4.95 and 6.63 BLEU points on two test sets, respectively. The human evaluation and case study further reveal the ability of the proposed model to improve the readability of ASR transcripts.

REFERENCES

  1. [1] Alkhatib Manar, Monem Azza Abdel, and Shaalan Khaled. 2020. Deep learning for arabic error detection and correction. ACM Trans. Asian Low Resour. Lang. Inf. Process. 19, 5 (2020), 71:1–71:13. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Anantaram C., Kopparapu Sunil Kumar, Patel Chiragkumar, and Mittal Aditya. 2016. Repairing general-purpose ASR output to improve accuracy of spoken sentences in specific domains using artificial development approach. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16), Kambhampati Subbarao (Ed.). IJCAI/AAAI Press, 42344235. http://www.ijcai.org/Abstract/16/637.Google ScholarGoogle Scholar
  3. [3] Bassil Youssef and Alwani Mohammad. 2012. Post-editing error correction algorithm for speech recognition using bing spelling suggestion. ArXiv Preprint abs/1203.5255 (2012). https://arxiv.org/abs/1203.5255.Google ScholarGoogle Scholar
  4. [4] Batista Fernando, Caseiro Diamantino, Mamede Nuno, and Trancoso Isabel. 2008. Recovering capitalization and punctuation marks for automatic speech recognition: Case study for portuguese broadcast news. Speech Communication 50, 10 (2008), 847862.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bohac Marek, Blavka Karel, Kucharova Michaela, and Skodova Svatava. 2012. Post-processing of the recognized speech for web presentation of large audio archive. In 2012 35th International Conference on Telecommunications and Signal Processing (TSP’12). IEEE, 441445.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Bojar Ondřej, Chatterjee Rajen, Federmann Christian, Graham Yvette, Haddow Barry, Huck Matthias, Yepes Antonio Jimeno, Koehn Philipp, Logacheva Varvara, Monz Christof, Negri Matteo, Névéol Aurélie, Neves Mariana, Popel Martin, Post Matt, Rubino Raphael, Scarton Carolina, Specia Lucia, Turchi Marco, Verspoor Karin, and Zampieri Marcos. 2016. Findings of the 2016 conference on machine translation. In Proceedings of the 1st Conference on Machine Translation: Volume 2, Shared Task Papers. Association for Computational Linguistics, 131198. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, Agarwal Sandhini, Herbert-Voss Ariel, Krueger Gretchen, Henighan Tom, Child Rewon, Ramesh Aditya, Ziegler Daniel M., Wu Jeffrey, Winter Clemens, Hesse Christopher, Chen Mark, Sigler Eric, Litwin Mateusz, Gray Scott, Chess Benjamin, Clark Jack, Berner Christopher, McCandlish Sam, Radford Alec, Sutskever Ilya, and Amodei Dario. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 18771901.Google ScholarGoogle Scholar
  8. [8] Bryant Christopher, Felice Mariano, Andersen Øistein E., and Briscoe Ted. 2019. The BEA-2019 shared task on grammatical error correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 5275. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Chatterjee Rajen, Farajian M. Amin, Negri Matteo, Turchi Marco, Srivastava Ankit, and Pal Santanu. 2017. Multi-source neural automatic post-editing: FBK’s participation in the WMT 2017 APE shared task. In Proceedings of the Second Conference on Machine Translation (WMT’17). Association for Computational Linguistics, 630638. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Cho Eunah, Niehues Jan, Kilgour Kevin, and Waibel Alex. 2015. Punctuation insertion for real-time spoken language translation. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT’15). https://aclanthology.org/2015.iwslt-papers.8.Google ScholarGoogle Scholar
  11. [11] Cho Eunah, Niehues Jan, and Waibel Alex. 2012. Segmentation and punctuation prediction in speech language translation using a monolingual translation system. In International Workshop on Spoken Language Translation 2012 (IWSLT’12).Google ScholarGoogle Scholar
  12. [12] Choe Yo Joong, Ham Jiyeon, Park Kyubyong, and Yoon Yeoil. 2019. A neural grammatical error correction system built on better pre-training and sequential transfer learning. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 213227. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chowdhery Aakanksha, Narang Sharan, Devlin Jacob, Bosma Maarten, Mishra Gaurav, Roberts Adam, Barham Paul, Chung Hyung Won, Sutton Charles, Gehrmann Sebastian, Schuh Parker, Shi Kensen, Tsvyashchenko Sasha, Maynez Joshua, Rao Abhishek, Barnes Parker, Tay Yi, Shazeer Noam, Prabhakaran Vinodkumar, Reif Emily, Du Nan, Hutchinson Ben, Pope Reiner, Bradbury James, Austin Jacob, Isard Michael, Gur-Ari Guy, Yin Pengcheng, Duke Toju, Levskaya Anselm, Ghemawat Sanjay, Dev Sunipa, Michalewski Henryk, Garcia Xavier, Misra Vedant, Robinson Kevin, Fedus Liam, Zhou Denny, Ippolito Daphne, Luan David, Lim Hyeontaek, Zoph Barret, Spiridonov Alexander, Sepassi Ryan, Dohan David, Agrawal Shivani, Omernick Mark, Dai Andrew M., Pillai Thanumalayan Sankaranarayana, Pellat Marie, Lewkowycz Aitor, Moreira Erica, Child Rewon, Polozov Oleksandr, Lee Katherine, Zhou Zongwei, Wang Xuezhi, Saeta Brennan, Diaz Mark, Firat Orhan, Catasta Michele, Wei Jason, Meier-Hellstern Kathy, Eck Douglas, Dean Jeff, Petrov Slav, and Fiedel Noah. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).Google ScholarGoogle Scholar
  14. [14] Cucu Horia, Buzo Andi, Besacier Laurent, and Burileanu Corneliu. 2013. Statistical error correction methods for domain-specific ASR systems. In International Conference on Statistical Language and Speech Processing (SLSP’13). Springer, 8392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Dahlmeier Daniel and Ng Hwee Tou. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’12). Association for Computational Linguistics, 568572. https://aclanthology.org/N12-1067.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Dai Andrew M. and Le Quoc V.. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 (NeurIPS’15)., Cortes Corinna, Lawrence Neil D., Lee Daniel D., Sugiyama Masashi, and Garnett Roman (Eds.). 30793087. https://proceedings.neurips.cc/paper/2015/hash/7137debd45ae4d0ab9aa953017286b20-Abstract.html.Google ScholarGoogle Scholar
  17. [17] Deng Yan, He Lei, and Soong Frank. 2018. Modeling multi-speaker latent space to improve neural TTS: Quick enrolling new speaker and enhancing premium voice. ArXiv Preprint abs/1812.05253 (2018). https://arxiv.org/abs/1812.05253.Google ScholarGoogle Scholar
  18. [18] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). Association for Computational Linguistics, 41714186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Dong Li, Yang Nan, Wang Wenhui, Wei Furu, Liu Xiaodong, Wang Yu, Gao Jianfeng, Zhou Ming, and Hon Hsiao-Wuen. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS’19), Wallach Hanna M., Larochelle Hugo, Beygelzimer Alina, d’Alché-Buc Florence, Fox Emily B., and Garnett Roman (Eds.). 1304213054. https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html.Google ScholarGoogle Scholar
  20. [20] Fan Angela, Grangier David, and Auli Michael. 2018. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. Association for Computational Linguistics, 4554. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Fleiss Joseph L. and Cohen Jacob. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33, 3 (1973), 613619.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Ge Tao, Wei Furu, and Zhou Ming. 2018. Reaching human-level performance in automatic grammatical error correction: An empirical study. ArXiv Preprint abs/1807.01270 (2018). https://arxiv.org/abs/1807.01270.Google ScholarGoogle Scholar
  23. [23] Godfrey John J. and Holliman Edward. 1997. Switchboard-1 Release 2. Linguistic Data Consortium 926 (1997), 927.Google ScholarGoogle Scholar
  24. [24] Granger Sylviane. 2014. The computer learner corpus: A versatile new source of data for SLA research: Sylviane Granger. In Learner English on Computer. Routledge, 2540.Google ScholarGoogle Scholar
  25. [25] Gravano Agustin, Jansche Martin, and Bacchiani Michiel. 2009. Restoring punctuation and capitalization in transcribed speech. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’09). IEEE, 47414744.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Grundkiewicz Roman, Junczys-Dowmunt Marcin, and Heafield Kenneth. 2019. Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 252263. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Guo Jinxi, Sainath Tara N., and Weiss Ron J.. 2019. A spelling correction model for end-to-end speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 56515655. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Hendrycks Dan and Gimpel Kevin. 2016. Gaussian error linear units (GELUs). ArXiv Preprint abs/1606.08415 (2016). https://arxiv.org/abs/1606.08415.Google ScholarGoogle Scholar
  29. [29] Hokamp Chris. 2017. Ensembling factored neural machine translation models for automatic post-editing and quality estimation. In Proceedings of the 2nd Conference on Machine Translation (WMT’17). Association for Computational Linguistics, 647654. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Howard Jeremy and Ruder Sebastian. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). Association for Computational Linguistics, 328339. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Hrinchuk Oleksii, Popova Mariya, and Ginsburg Boris. 2020. Correction of automatic speech recognition with transformer sequence-to-sequence model. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 70747078. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR’15), Conference Track Proceedings, Bengio Yoshua and LeCun Yann (Eds.). http://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  33. [33] Kiyono Shun, Suzuki Jun, Mita Masato, Mizumoto Tomoya, and Inui Kentaro. 2019. An empirical study of incorporating pseudo data into grammatical error correction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 12361242. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Veselin, and Zettlemoyer Luke. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). Association for Computational Linguistics, Online, 78717880. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Li Jinyu, Zhao Rui, Sun Eric, Wong Jeremy H. M., Das Amit, Meng Zhong, and Gong Yifan. 2020. High-accuracy and low-latency speech recognition with two-head contextual layer trajectory LSTM model. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 76997703.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Lieber Opher, Sharir Or, Lenz Barak, and Shoham Yoav. 2021. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs (2021).Google ScholarGoogle Scholar
  37. [37] Lin Chin-Yew and Och Franz Josef. 2004. ORANGE: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). COLING, 501507. https://aclanthology.org/C04-1072.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Lin Nankai, Chen Boyu, Lin Xiaotian, Wattanachote Kanoksak, and Jiang Shengyi. 2021. A framework for indonesian grammar error correction. ACM Trans. Asian Low Resour. Lang. Inf. Process. 20, 4 (2021), 57:1–57:12. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. Roberta: A robustly optimized BERT pretraining approach. ArXiv Preprint abs/1907.11692 (2019). https://arxiv.org/abs/1907.11692.Google ScholarGoogle Scholar
  40. [40] Liyanapathirana Jeevanthi and Popescu-Belis Andrei. 2016. Using the TED talks to evaluate spoken post-editing of machine translation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), 22322239. https://aclanthology.org/L16-1355.Google ScholarGoogle Scholar
  41. [41] McCann Bryan, Bradbury James, Xiong Caiming, and Socher Richard. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (NeurIPS’17). Guyon Isabelle, Luxburg Ulrike von, Bengio Samy, Wallach Hanna M., Fergus Rob, Vishwanathan S. V. N., and Garnett Roman (Eds.). 62946305. https://proceedings.neurips.cc/paper/2017/hash/20c86a628232a67e7bd46f76fba7ce12-Abstract.html.Google ScholarGoogle Scholar
  42. [42] Miao Yajie, Li Jinyu, Wang Yongqiang, Zhang Shi-Xiong, and Gong Yifan. 2016. Simplifying long short-term memory acoustic models for fast training and decoding. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 22842288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Mizumoto Tomoya, Komachi Mamoru, Nagata Masaaki, and Matsumoto Yuji. 2011. Mining revision log of language learning SNS for automated japanese error correction of second language learners. In Proceedings of 5th International Joint Conference on Natural Language Processing (IJCNLP’11). Asian Federation of Natural Language Processing, 147155. https://aclanthology.org/I11-1017.Google ScholarGoogle Scholar
  44. [44] Moon Hyeonseok, Park Chanjun, Seo Jaehyung, Eo Sugyeong, and Lim Heuiseok. 2022. An automatic post editing with efficient and simple data generation method. IEEE Access 10 (2022), 2103221040.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Napoles Courtney, Nădejde Maria, and Tetreault Joel. 2019. Enabling robust grammatical error correction in new domains: Data sets, metrics, and analyses. Transactions of the Association for Computational Linguistics 7 (2019), 551566. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Napoles Courtney, Sakaguchi Keisuke, Post Matt, and Tetreault Joel R.. 2015. Ground truth for grammaticality correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL’15, July 26-31, 2015, Beijing, China, Volume 2: Short Papers). The Association for Computer Linguistics, 588593. Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Napoles Courtney, Sakaguchi Keisuke, and Tetreault Joel. 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). Association for Computational Linguistics, 229234. https://aclanthology.org/E17-2037.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Ng Hwee Tou, Wu Siew Mei, Wu Yuanbin, Hadiwinoto Christian, and Tetreault Joel. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL’13). Association for Computational Linguistics, 112. https://aclanthology.org/W13-3601.Google ScholarGoogle Scholar
  49. [49] Pal Santanu, Naskar Sudip Kumar, Vela Mihaela, Liu Qun, and Genabith Josef van. 2017. Neural automatic post-editing using prior alignment and reranking. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). Association for Computational Linguistics, 349355. https://aclanthology.org/E17-2056.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Pal Santanu, Naskar Sudip Kumar, Vela Mihaela, and Genabith Josef van. 2016. A neural network based approach to automatic post-editing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). Association for Computational Linguistics, 281286. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Papineni Kishore. 2002. Machine translation evaluation: N-grams to the rescue. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC’02). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2002/pdf/347.pdf.Google ScholarGoogle Scholar
  52. [52] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, 311318. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Paulik Matthias, Rao Sharath, Lane Ian, Vogel Stephan, and Schultz Tanja. 2008. Sentence segmentation and punctuation recovery for spoken language translation. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’08). IEEE, 51055108.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Paulus Romain, Xiong Caiming, and Socher Richard. 2018. A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations (ICLR’18), Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=HkAClQgA-.Google ScholarGoogle Scholar
  55. [55] Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper.pdf.Google ScholarGoogle Scholar
  56. [56] Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).Google ScholarGoogle Scholar
  57. [57] Rae Jack W., Borgeaud Sebastian, Cai Trevor, Millican Katie, Hoffmann Jordan, Song H. Francis, Aslanides John, Henderson Sarah, Ring Roman, Young Susannah, Rutherford Eliza, Hennigan Tom, Menick Jacob, Cassirer Albin, Powell Richard, Driessche George van den, Hendricks Lisa Anne, Rauh Maribeth, Huang Po-Sen, Glaese Amelia, Welbl Johannes, Dathathri Sumanth, Huang Saffron, Uesato Jonathan, Mellor John, Higgins Irina, Creswell Antonia, McAleese Nat, Wu Amy, Elsen Erich, Jayakumar Siddhant M., Buchatskaya Elena, Budden David, Sutherland Esme, Simonyan Karen, Paganini Michela, Sifre Laurent, Martens Lena, Li Xiang Lorraine, Kuncoro Adhiguna, Nematzadeh Aida, Gribovskaya Elena, Donato Domenic, Lazaridou Angeliki, Mensch Arthur, Lespiau Jean-Baptiste, Tsimpoukelli Maria, Grigorev Nikolai, Fritz Doug, Sottiaux Thibault, Pajarskas Mantas, Pohlen Toby, Gong Zhitao, Toyama Daniel, d.Autume Cyprien de Masson, Li Yujia, Terzi Tayfun, Mikulik Vladimir, Babuschkin Igor, Clark Aidan, Casas Diego de Las, Guy Aurelia, Jones Chris, Bradbury James, Johnson Matthew, Hechtman Blake A., Weidinger Laura, Gabriel Iason, Isaac William S., Lockhart Edward, Osindero Simon, Rimell Laura, Dyer Chris, Vinyals Oriol, Ayoub Kareem, Stanway Jeff, Bennett Lorrayne, Hassabis Demis, Kavukcuoglu Koray, and Irving Geoffrey. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021).Google ScholarGoogle Scholar
  58. [58] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv Preprint abs/1910.10683 (2019). https://arxiv.org/abs/1910.10683.Google ScholarGoogle Scholar
  59. [59] Rothe Sascha, Mallinson Jonathan, Malmi Eric, Krause Sebastian, and Severyn Aliaksei. 2021. A simple recipe for multilingual grammatical error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP’21). 702707.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). Association for Computational Linguistics, 17151725. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Shen Jonathan, Pang Ruoming, Weiss Ron J., Schuster Mike, Jaitly Navdeep, Yang Zongheng, Chen Zhifeng, Zhang Yu, Wang Yuxuan, Ryan RJ-Skerrv, Saurous Rif A., Agiomyrgiannakis Yannis, and Wu Yonghui. 2018. Natural TTS synthesis by conditioning wavenet on Mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 47794783.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Shin Jaehun, Lee WonKee, Go Byung-Hyun, Jung Baikjin, Kim Young Kil, and Lee Jong-Hyeok. 2021. Exploration of effective attention strategies for neural automatic post-editing with transformer. ACM Trans. Asian Low Resour. Lang. Inf. Process. 20, 6 (2021), 111:1–111:17. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Shivakumar Prashanth Gurunath, Li Haoqi, Knight Kevin, and Georgiou Panayiotis. 2019. Learning from past mistakes: Improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Transactions on Signal and Information Processing 8 (2019).Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Shugrina Maria. 2010. Formatting time-aligned ASR transcripts for readability. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT’10). Association for Computational Linguistics, 198206. https://aclanthology.org/N10-1023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Škodová Svatava, Kuchařová Michaela, and Šeps Ladislav. 2012. Discretion of speech units for the text post-processing phase of automatic transcription (in the Czech language). In International Conference on Text, Speech and Dialogue (TSD’12). Springer, 446455.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Smith Shaden, Patwary Mostofa, Norick Brandon, LeGresley Patrick, Rajbhandari Samyam, Casper Jared, Liu Zhun, Prabhumoye Shrimai, Zerveas George, Korthikanti Vijay, Zheng Elton, Child Rewon, Aminabadi Reza Yazdani, Bernauer Julie, Song Xia, Shoeybi Mohammad, He Yuxiong, Houston Michael, Tiwary Saurabh, and Catanzaro Bryan. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022).Google ScholarGoogle Scholar
  67. [67] Song Kaitao, Tan Xu, Qin Tao, Lu Jianfeng, and Liu Tie-Yan. 2019. MASS: Masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Conference on Machine Learning (ICML’19)(Proceedings of Machine Learning Research, Vol. 97), Chaudhuri Kamalika and Salakhutdinov Ruslan (Eds.). PMLR, 59265936. http://proceedings.mlr.press/v97/song19d.html.Google ScholarGoogle Scholar
  68. [68] Sundermeyer Martin, Schlüter Ralf, and Ney Hermann. 2012. LSTM neural networks for language modeling. In 13th Annual Conference of the International Speech Communication Association (INTERSPEECH’12 Portland, Oregon, USA, September 9-13, 2012), ISCA, 194197. http://www.isca-speech.org/archive/interspeech_2012/i12_0194.html.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NeurIPS’14). 31043112.Google ScholarGoogle Scholar
  70. [70] Szegedy Christian, Vanhoucke Vincent, Ioffe Sergey, Shlens Jonathon, and Wojna Zbigniew. 2016. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE Computer Society, 28182826. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Tajiri Toshikazu, Komachi Mamoru, and Matsumoto Yuji. 2012. Tense and aspect error correction for ESL learners using global context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL’12). Association for Computational Linguistics, 198202. https://aclanthology.org/P12-2039.Google ScholarGoogle Scholar
  72. [72] Tan Yiming, Chen Zhiming, Huang Liu, Zhang Lilin, Li Maoxi, and Wang Mingwen. 2017. Neural post-editing based on quality estimation. In Proceedings of the 2nd Conference on Machine Translation (WMT’17). Association for Computational Linguistics, 655660. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Annual Conference on Neural Information Processing Systems 2017 (NeurIPS’17). Guyon Isabelle, Luxburg Ulrike von, Bengio Samy, Wallach Hanna M., Fergus Rob, Vishwanathan S. V. N., and Garnett Roman (Eds.). 59986008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.Google ScholarGoogle Scholar
  74. [74] Wong Jeremy Heng Meng and Gales Mark J. F.. 2016. Sequence student-teacher training of deep neural networks. In 17th Annual Conference of the International Speech Communication Association (INTERSPEECH’16, San Francisco, CA, USA, September 8–12, 2016), Nelson Morgan (Ed.). ISCA, 27612765. .Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Woodland Philip C. and Povey Daniel. 2002. Large scale discriminative training of hidden markov models for speech recognition. Computer Speech & Language 16, 1 (2002), 2547.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V., Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus, Klingner Jeff, Shah Apurva, Johnson Melvin, Liu Xiaobing, Kaiser Lukasz, Gouws Stephan, Kato Yoshikiyo, Kudo Taku, Kazawa Hideto, Stevens Keith, Kurian George, Patil Nishant, Wang Wei, Young Cliff, Smith Jason, Riesa Jason, Rudnick Alex, Vinyals Oriol, Corrado Gregory S., Hughes Macduff, and Dean Jeffrey. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv Preprint abs/1609.08144 (2016). https://arxiv.org/abs/1609.08144.Google ScholarGoogle Scholar
  77. [77] Xiong W., Wu L., Alleva Fil, Droppo Jasha, Huang Xuedong, and Stolcke Andreas. 2018. The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 59345938. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Yang Zhilin, Dai Zihang, Yang Yiming, Carbonell Jaime G., Salakhutdinov Ruslan, and Le Quoc V.. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS’19), Wallach Hanna M., Larochelle Hugo, Beygelzimer Alina, d’Alché-Buc Florence, Fox Emily B., and Garnett Roman (Eds.). 57545764. https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html.Google ScholarGoogle Scholar
  79. [79] Yannakoudakis Helen, Briscoe Ted, and Medlock Ben. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL’11). Association for Computational Linguistics, 180189. https://aclanthology.org/P11-1019.Google ScholarGoogle Scholar
  80. [80] Zhang Yi, Ge Tao, Wei Furu, Zhou Ming, and Sun Xu. 2019. Sequence-to-sequence Pre-training with data augmentation for sentence rewriting. ArXiv Preprint abs/1909.06002 (2019). https://arxiv.org/abs/1909.06002.Google ScholarGoogle Scholar

Index Terms

  1. Improving Readability for Automatic Speech Recognition Transcription

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 5
        May 2023
        653 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3596451
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 May 2023
        • Online AM: 22 August 2022
        • Accepted: 9 August 2022
        • Revised: 2 August 2022
        • Received: 8 March 2022
        Published in tallip Volume 22, Issue 5

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!