Abstract
In this article, we make use of sequence-to-sequence (seq2seq) models for spelling correction in the agglutinative Turkish language. In the baseline system, misspelled and target words are split into their letters and the letter sequences are fed into the seq2seq model. We prefer letters as the unit of the model due to the agglutinative nature of Turkish, which results in an impractical dictionary size when words are used as a dictionary unit. In order to improve the baseline performance, we incorporate right and left context of the misspelled words. All context words are represented with their first three consonants in the context-dependent model. We train the seq2seq models using a large text corpus collected automatically from the Internet. The corpus contains approximately 4 million sentences. We randomly introduce substitution, deletion, and insertion spelling errors to the words in the corpus. We test the performance of the proposed context-dependent seq2seq model using synthetic and realistic test sets. The synthetic test set is constructed similar to the training set. The realistic test set contains human-made misspellings from Twitter messages. In the experiments, we observed that the proposed context-dependent model performs significantly better than the baseline system. Its correction accuracy reaches 94% on the synthetic dataset. Additionally, the proposed method provides 2.1% absolute improvement over a state-of-the-art Turkish spelling correction system on the Twitter test set.
- Ahmet Afsin Akın and Mehmet Dündar Akın. 2007. Zemberek, an open source NLP framework for Turkic languages. Structure 10 (2007), 1--5.Google Scholar
- Ouais Alsharif, Tom Ouyang, Françoise Beaufays, Shumin Zhai, Thomas Breuel, and Johan Schalkwyk. 2015. Long short term memory neural network for keyboard gesture decoding. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2076--2080.Google Scholar
- Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, and Bhuvana Ramabhadran. 2012. Deep neural network language models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT. ACL, 20--28.Google Scholar
Digital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Arxiv Preprint Arxiv:1409.0473 (2014).Google Scholar
- Loghman Barari and Behrang QasemiZadeh. 2005. Clonizer spell checker adaptive, language independent spell checker. In AIML 2005 Conference CICC, Cairo, Egypt. 19--21.Google Scholar
- Xiaojun Bi, Tom Ouyang, and Shumin Zhai. 2014. Both complete and correct?: Multi-objective optimization of touchscreen keyboard. In Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems. ACM, 2297--2306.Google Scholar
Digital Library
- Osman Büyük. 2005. Sub-world Language Modelling for Turkish Speech Recognition. Ph.D. Dissertation.Google Scholar
- Çagri Çöltekin. 2014. A set of open source tools for Turkish natural language processing. In LREC. 1079--1086.Google Scholar
- Hakan Erdogan, Osman Buyuk, and Kemal Oflazer. 2005. Incorporating language constraints in sub-word based speech recognition. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005. IEEE, 98--103.Google Scholar
Cross Ref
- Pravallika Etoori, Manoj Chinnakotla, and Radhika Mamidi. 2018. Automatic spelling correction for resource-scarce languages using deep learning. In Proceedings of ACL 2018, Student Research Workshop. 146--152.Google Scholar
Cross Ref
- Pieter Fivez, Simon Šuster, and Walter Daelemans. 2017. Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embedding. In 16th Workshop on Biomedical Natural Language Processing of the Association for Computational Linguistics. 143--148.Google Scholar
Cross Ref
- Shaona Ghosh and Per Ola Kristensson. 2017. Neural networks for text correction and completion in keyboard decoding. Arxiv Preprint Arxiv:1709.06429 (2017).Google Scholar
- Saša Hasan, Carmen Heger, and Saab Mansour. 2015. Spelling correction of user search queries through statistical machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 451--460.Google Scholar
Cross Ref
- Ahmed Hassan, Sara Noeman, and Hany Hassan. 2008. Language independent text correction using finite state automata. In Proceedings of the 3rd International Joint Conference on Natural Language Processing: Volume-II.Google Scholar
- Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29 (2012).Google Scholar
- Kimmo Koskenniemi. 1983. Two-level Morphology: A General Computational Model for Word-form Recognition and Production. Vol. 11. University of Helsinki, Department of General Linguistics Helsinki.Google Scholar
- Per-Ola Kristensson and Shumin Zhai. 2005. Relaxing stylus typing precision by geometric pattern matching. In Proceedings of the 10th International Conference on Intelligent User Interfaces. ACM, 151--158.Google Scholar
Digital Library
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. Arxiv Preprint Arxiv:1508.04025 (2015).Google Scholar
- Kemal Oflazer. 1994. Two-level description of Turkish morphology. Literary and Linguistic Computing 9, 2 (1994), 137--148.Google Scholar
Cross Ref
- Kemal Oflazer. 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22, 1 (1996), 73--89.Google Scholar
Digital Library
- Kemal Oflazer, Elvan Göçmen, and Cem Bozşahin. 1994. An outline of Turkish morphology. Report to NATO Science Division SfS III (TU-LANGUAGE), Brussels (1994).Google Scholar
- Fred Richardson, Douglas Reynolds, and Najim Dehak. 2015. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters 22, 10 (2015), 1671--1675.Google Scholar
Cross Ref
- Annette Rios. 2011. Spell checking an agglutinative language: Quechua. In Proceedings of the 5th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 51--55.Google Scholar
- Frank Seide. 2017. Keynote: The computer science behind the Microsoft cognitive toolkit: An open source large-scale deep learning toolkit for Windows and Linux. In Proceedings of the 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, xi–xi.Google Scholar
Cross Ref
- Dilara Torunoglu-Selamet, Eren Bekar, Tugay Ilbay, and Gülsen Eryigit. 2016. Exploring spelling correction approaches for Turkish. In Proceedings of the 1st International Conference on Turkic Computational Linguistics at CICLING, Konya. 7--11.Google Scholar
- Keith Vertanen, Haythem Memmi, Justin Emge, Shyam Reyal, and Per Ola Kristensson. 2015. VelociTap: Investigating fast mobile text entry using sentence-based decoding of touchscreen keyboard input. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 659--668.Google Scholar
Digital Library
- Oriol Vinyals and Quoc Le. 2015. A neural conversational model. Arxiv Preprint Arxiv:1506.05869 (2015).Google Scholar
- Casey Whitelaw, Ben Hutchinson, Grace Y. Chung, and Gerard Ellis. 2009. Using the web for language independent spellchecking and autocorrection. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Volume 2. ACL, 890--899.Google Scholar
Digital Library
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. Arxiv Preprint Arxiv:1609.08144 (2016).Google Scholar
- Heiga Ze, Andrew Senior, and Mike Schuster. 2013. Statistical parametric speech synthesis using deep neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7962--7966.Google Scholar
Cross Ref
Index Terms
Context-Dependent Sequence-to-Sequence Turkish Spelling Correction
Recommendations
Word2Vec based spelling correction method of Twitter message
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied ComputingTwitter1 became popular owing to the devices like smartphones and tablets, with which short messages can be easily composed. Due to the popularity of Twitter, the volume of Twitter messages has increased rapidly. Accordingly, studies have been carried ...
Context-aware correction of spelling errors in Hungarian medical documents
HighlightsWe propose two methods to automatically correct Hungarian clinical text.Method 1 generates a ranked list of correction candidates disregarding context.Method 2 uses an SMT decoder to implement context-aware error correction.Method 1 is ...
Context-aware correction of spelling errors in hungarian medical documents
SLSP'13: Proceedings of the First international conference on Statistical Language and Speech ProcessingIn our paper, we present a method for automated correction of spelling errors in Hungarian clinical records. We model the problem of spelling correction as a translation task, where the source language is the erroneous text and the target language is ...






Comments