skip to main content
research-article

Impact of Tokenization on Language Models: An Analysis for Turkish

Published:25 March 2023Publication History
Skip Abstract Section

Abstract

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, in which many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, that is, their outputs vary from the smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using the RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that the morphological-level tokenizer delivers a challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological- and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.

REFERENCES

  1. Ahmadi Sina. 2020. A tokenization system for the Kurdish language. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. International Committee on Computational Linguistics (ICCL), Barcelona, Spain (Online), 114127.Google ScholarGoogle Scholar
  2. Akın Ahmet Afsin and Akın Mehmet Dündar. 2007. Zemberek, an open source NLP framework for Turkic languages. Structure 10 (2007), 15.Google ScholarGoogle Scholar
  3. Baeza-Yates Ricardo. 2022. Ethical challenges in AI. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM) (Virtual Event, AZ, USA). Association for Computing Machinery, 1–2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fikri Figen Beken, Oflazer Kemal, and Yanikoglu Berrin. 2021. Semantic similarity based evaluation for abstractive news summarization. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM’21). Association for Computational Linguistics, Online, 2433. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. Bender Emily M., Gebru Timnit, McMillan-Major Angelina, and Shmitchell Shmargaret. 2021. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT’21), Association for Computing Machinery, 610–623. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bostrom Kaj and Durrett Greg. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 46174624. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. Budur Emrah, Özçelik Rıza, and Güngör Tunga. 2020. Data and representation for Turkish natural language inference. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, Online, 82538267. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. Büyük Osman. 2020. Context-dependent sequence-to-sequence Turkish spelling correction. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 4 (2020), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Can Fazli, Kocberber Seyit, Balcik Erman, Kaynak Cihan, Ocalan H. Cagdas, and Vursavas Onur M.. 2008. Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 59, 3 (2008), 407421.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Creutz Mathias, Hirsimäki Teemu, Kurimo Mikko, Puurula Antti, Pylkkönen Janne, Siivola Vesa, Varjokallio Matti, Arisoy Ebru, Saraçlar Murat, and Stolcke Andreas. 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing 5, 1 (2007), 129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Demirtas Erkin and Pechenizkiy Mykola. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the 2nd International Workshop on Issues of Sentiment Discovery and Opinion Mining (Chicago, Illinois) (WISDOM’13). Association for Computing Machinery. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minnesota, 41714186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. Ding Chenchen, Aye Hnin Thu Zar, Pa Win Pa, Nwet Khin Thandar, Soe Khin Mar, Utiyama Masao, and Sumita Eiichiro. 2019a. Towards Burmese (Myanmar) morphological analysis: Syllable-based tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 1 (2019). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ding Chenchen, Utiyama Masao, and Sumita Eiichiro. 2018. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 2 (2018). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ding Shuoyang, Renduchintala Adithya, and Duh Kevin. 2019b. A call for prudent choice of subword merge operations in neural machine translation. In Proceedings of Machine Translation Summit XVII: Research Track. European Association for Machine Translation, Dublin, Ireland, 204213.Google ScholarGoogle Scholar
  16. Dossou Bonaventure F. P. and Emezue Chris C.. 2021. Crowdsourced phrase-based tokenization for low-resourced neural machine translation: The case of Fon language. arXiv preprint arXiv:2103.08052 (2021).Google ScholarGoogle Scholar
  17. ElectricityMap. 2022. Climate Impact by Area. Retrieved March 16, 2022 from https://app.electricitymap.org/map.Google ScholarGoogle Scholar
  18. Gong Chengyue, He Di, Tan Xu, Qin Tao, Wang Liwei, and Liu Tie-Yan. 2018. FRAGE: Frequency-agnostic word representation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montreal, Canada) (NIPS’18). Curran Associates Inc., Red Hook, NY, 1341–1352.Google ScholarGoogle Scholar
  19. Hakkani-Tür Dilek Zeynep, Saraçlar Murat, Tür Gökhan, Oflazer Kemal, and Yuret Deniz. 2018. Morphological Disambiguation for Turkish. Springer International Publishing, Cham, 5367. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. Henderson Peter, Hu Jieru, Romoff Joshua, Brunskill Emma, Jurafsky Dan, and Pineau Joelle. 2020. Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research 21, 248 (2020), 143.Google ScholarGoogle Scholar
  21. Hiraoka Tatsuya, Takase Sho, Uchiumi Kei, Keyaki Atsushi, and Okazaki Naoaki. 2021. Joint optimization of tokenization and downstream model. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 244255. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. Huggingface. 2021. Oscar Dataset Huggingface. Retrieved March 16, 2022 from https://huggingface.co/datasets/oscar.Google ScholarGoogle Scholar
  23. Kaing Hour, Ding Chenchen, Utiyama Masao, Sumita Eiichiro, Sam Sethserey, Seng Sopheap, Sudoh Katsuhito, and Nakamura Satoshi. 2021. Towards tokenization and part-of-speech tagging for Khmer: Data and discussion. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 6 (2021). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kikstra J. S., Waidelich P., Rising J., Yumashev D., Hope C., and Brierley C. M.. 2021. The social cost of carbon dioxide under climate-economy feedbacks and temperature variability. Environmental Research Letters 16, 9 (2021), 094037.Google ScholarGoogle ScholarCross RefCross Ref
  25. Kudo Taku. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 6675. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. Kudo Taku and Richardson John. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 6671. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. Kulick Seth. 2011. Exploiting separation of closed-class categories for Arabic tokenization and part-of-speech tagging. ACM Transactions on Asian Language Information Processing 10, 1 (2011). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Li Yachao, Jiang Jing, Yangji Jia, and Ma Ning. 2021. Finding better subwords for Tibetan neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 2 (2021). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google ScholarGoogle Scholar
  30. Loodos. 2020. Retrieved March 16, 2022 from Turkish Language Models. https://github.com/Loodos/turkish-language-models.Google ScholarGoogle Scholar
  31. Loshchilov Ilya and Hutter Frank. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, (ICLR’19), New Orleans, LA.Google ScholarGoogle Scholar
  32. Lowphansirikul Lalita, Polpanumas Charin, Jantrakulchai Nawat, and Nutanong Sarana. 2021. WangchanBERTa: Pretraining transformer-based Thai language models. arXiv preprint arXiv:2101.09635 (2021).Google ScholarGoogle Scholar
  33. Jimenez Alejandro Metke, Raymond Kerry, and MacColl Ian. 2011. Information extraction from web services: A comparison of tokenisation algorithms. In Proceedings of the 2nd International Workshop on Software Knowledge 2011, in conjunction with 3rd International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. Scitepress, 1223.Google ScholarGoogle Scholar
  34. Mikolov Tomás, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations (ICLR’13), Workshop Track Proceedings. Scottsdale, AZ.Google ScholarGoogle Scholar
  35. Mitchell Margaret, Wu Simone, Zaldivar Andrew, Barnes Parker, Vasserman Lucy, Hutchinson Ben, et al. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA) (FAT*’19). ACM, New York, NY, 220229. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Nayak Anmol, Timmapathini Hariprasad, Ponnalagu Karthikeyan, and Venkoparao Vijendran Gopalan. 2020. Domain adaptation challenges of BERT in tokenization and sub-word representations of out-of-vocabulary words. In Proceedings of the 1st Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics, Online, 15. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  37. Suárez Pedro Javier Ortiz, Sagot Benoit, and Romary Laurent. 2019. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache, Cardiff, UK, 916. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. Park Chanjun, Eo Sugyeong, Moon Hyeonseok, and Lim Heuiseok. 2021. Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers. Association for Computational Linguistics, Online, 97104. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. Park Kyubyong, Lee Joohong, Jang Seongbo, and Jung Dawoon. 2020. An empirical study of tokenization strategies for various Korean NLP tasks. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 133142.Google ScholarGoogle Scholar
  40. Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 15321543. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  41. Provilkov Ivan, Emelianenko Dmitrii, and Voita Elena. 2020. BPE-Dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 18821892. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, Sutskever Ilya, et al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  43. Ramshaw Lance and Marcus Mitch. 1995. Text chunking using transformation-based learning. In 3rd Workshop on Very Large Corpora. https://aclanthology.org/W95-0107.Google ScholarGoogle Scholar
  44. Turkey Ministry of Environment Urbanization Climate Change Republic of. 2020. Primary Energy and GHG Emissions Coefficients of Electricity. Retrieved March 16, 2022 from https://meslekihizmetler.csb.gov.tr/elektrik-enerjisinin-birincil-enerji-ve-sera-gazi-salimi-katsayilari-2021-yilindan-itibaren-kullanilmak-uzere-guncellenmistir-duyuru-411795.Google ScholarGoogle Scholar
  45. Sanh Victor, Debut Lysandre, Chaumond Julien, and Wolf Thomas. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google ScholarGoogle Scholar
  46. Schuster Mike and Nakajima Kaisuke. 2012. Japanese and Korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12) (Kyoto, Japan). 51495152. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  47. Schweter Stefan. 2020. BERTurk — BERT Models for Turkish. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  48. Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 17151725. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  49. Shuyo Nakatani. 2010. Language Detection Library for Java. Retrieved March 16, 2022 from http://code.google.com/p/language-detection/Google ScholarGoogle Scholar
  50. Smit Peter, Virpioja Sami, Grönroos Stig-Arne, and Kurimo Mikko. 2014. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Gothenburg, Sweden, 2124. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. Sun Fei, Liu Jun, Wu Jian, Pei Changhua, Lin Xiao, Ou Wenwu, and Jiang Peng. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM’19). ACM, New York, NY, 14411450. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Tay Yi, Tran Vinh Q., Ruder Sebastian, Gupta Jai, Chung Hyung Won, Bahri Dara, Qin Zhen, Baumgartner Simon, Yu Cong, and Metzler Donald. 2021. Charformer: Fast character transformers via gradient-based subword tokenization. arXiv preprint arXiv:2106.12672 (2021).Google ScholarGoogle Scholar
  53. Tokgoz Meltem, Turhan Fatmanur, Bolucu Necva, and Can Burcu. 2021. Tuning language representation models for classification of Turkish news. In 2021 International Symposium on Electrical, Electronics and Information Engineering. 402407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Toraman Cagri, Can Fazli, and Koçberber Seyit. 2011. Developing a text categorization template for Turkish news portals. In 2011 International Symposium on Innovations in Intelligent Systems and Applications. 379383. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  55. Toraman Cagri, Şahinuç Furkan, and Yilmaz Eyup Halit. 2022. Large-scale hate speech detection with cross-domain transfer. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC’22) (Marseille, France). 22152225.Google ScholarGoogle Scholar
  56. Tür Gökhan, Hakkani-Tür Dilek, and Oflazer Kemal. 2003. A statistical information extraction system for Turkish. Natural Language Engineering 9, 2 (2003), 181210. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Üstün Ahmet and Can Burcu. 2016. Unsupervised morphological segmentation using neural word embeddings. In International Conference on Statistical Language and Speech Processing. Springer, 4353.Google ScholarGoogle ScholarCross RefCross Ref
  58. Üstün Ahmet, Kurfalı Murathan, and Can Burcu. 2018. Characters or morphemes: How to represent words?. In Proceedings of The 3rd Workshop on Representation Learning for NLP (Melbourne, Australia). 144–153.Google ScholarGoogle Scholar
  59. Vasiu Mihaela Alexandra and Potolea Rodica. 2020. Enhancing tokenization by embedding Romanian language specific morphology. In IEEE 16th International Conference on Intelligent Computer Communication and Processing (ICCP’20). 243250. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  60. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (CA, USA). 59986008.Google ScholarGoogle Scholar
  61. Xu Jingjing, Zhou Hao, Gan Chun, Zheng Zaixiang, and Li Lei. 2021. Vocabulary learning via optimal transport for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 73617373. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  62. Xue Linting, Barua Aditya, Constant Noah, Al-Rfou Rami, Narang Sharan, Kale Mihir, et al. 2021. ByT5: Towards a token-free future with pre-trained byte-to-byte models. arXiv preprint arXiv:2105.13626 (2021).Google ScholarGoogle Scholar
  63. Yates Andrew, Nogueira Rodrigo, and Lin Jimmy. 2021. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM) (Virtual Event, Israel). 11541156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Yeniterzi Reyyan. 2011. Exploiting morphology in Turkish named entity recognition system. In Proceedings of the ACL 2011 Student Session, Association for Computational Linguistics, Portland, OR, 105–110.Google ScholarGoogle Scholar
  65. Yücesoy Veysel and Koç Aykut. 2019. Co-occurrence weight selection in generation of word embeddings for low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18, 3 (2019), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Zhang Wenbo, Li Xiao, Yang Yating, and Dong Rui. 2021b. Pre-training on mixed data for low-resource neural machine translation. Information 12, 3 (2021), 133.Google ScholarGoogle ScholarCross RefCross Ref
  67. Zhang Xinsong, Li Pengshuai, and Li Hang. 2021a. AMBERT: A pre-trained language model with multi-grained tokenization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 421435. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Impact of Tokenization on Language Models: An Analysis for Turkish

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
          April 2023
          682 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3588902
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 March 2023
          • Accepted: 23 December 2022
          • Revised: 21 September 2022
          • Received: 20 April 2022
          Published in tallip Volume 22, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)246
          • Downloads (Last 6 weeks)83

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!