Abstract
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, in which many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, that is, their outputs vary from the smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using the RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that the morphological-level tokenizer delivers a challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological- and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.
- . 2020. A tokenization system for the Kurdish language. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. International Committee on Computational Linguistics (ICCL), Barcelona, Spain (Online), 114–127.Google Scholar
- . 2007. Zemberek, an open source NLP framework for Turkic languages. Structure 10 (2007), 1–5.Google Scholar
- . 2022. Ethical challenges in AI. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM) (Virtual Event, AZ, USA). Association for Computing Machinery, 1–2. Google Scholar
Digital Library
- . 2021. Semantic similarity based evaluation for abstractive news summarization. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM’21). Association for Computational Linguistics, Online, 24–33.
DOI: Google ScholarCross Ref
- . 2021. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (
FAccT’21 ), Association for Computing Machinery, 610–623. Google ScholarDigital Library
- . 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4617–4624.
DOI: Google ScholarCross Ref
- . 2020. Data and representation for Turkish natural language inference. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, Online, 8253–8267.
DOI: Google ScholarCross Ref
- . 2020. Context-dependent sequence-to-sequence Turkish spelling correction. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 4 (2020), 1–16.Google Scholar
Digital Library
- . 2008. Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 59, 3 (2008), 407–421.Google Scholar
Digital Library
- . 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing 5, 1 (2007), 1–29.Google Scholar
Digital Library
- . 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the 2nd International Workshop on Issues of Sentiment Discovery and Opinion Mining (Chicago, Illinois) (
WISDOM’13 ). Association for Computing Machinery. Google ScholarDigital Library
- . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minnesota, 4171–4186.
DOI: Google ScholarCross Ref
- . 2019a. Towards Burmese (Myanmar) morphological analysis: Syllable-based tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 1 (2019).
DOI: Google ScholarDigital Library
- . 2018. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 2 (2018).
DOI: Google ScholarDigital Library
- . 2019b. A call for prudent choice of subword merge operations in neural machine translation. In Proceedings of Machine Translation Summit XVII: Research Track. European Association for Machine Translation, Dublin, Ireland, 204–213.Google Scholar
- . 2021. Crowdsourced phrase-based tokenization for low-resourced neural machine translation: The case of Fon language. arXiv preprint arXiv:2103.08052 (2021).Google Scholar
- . 2022. Climate Impact by Area. Retrieved March 16, 2022 from https://app.electricitymap.org/map.Google Scholar
- . 2018. FRAGE: Frequency-agnostic word representation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montreal, Canada) (NIPS’18). Curran Associates Inc., Red Hook, NY, 1341–1352.Google Scholar
- . 2018. Morphological Disambiguation for Turkish. Springer International Publishing, Cham, 53–67.
DOI: Google ScholarCross Ref
- . 2020. Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research 21, 248 (2020), 1–43.Google Scholar
- . 2021. Joint optimization of tokenization and downstream model. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 244–255.
DOI: Google ScholarCross Ref
- . 2021. Oscar Dataset Huggingface. Retrieved March 16, 2022 from https://huggingface.co/datasets/oscar.Google Scholar
- . 2021. Towards tokenization and part-of-speech tagging for Khmer: Data and discussion. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 6 (2021).
DOI: Google ScholarDigital Library
- . 2021. The social cost of carbon dioxide under climate-economy feedbacks and temperature variability. Environmental Research Letters 16, 9 (2021), 094037.Google Scholar
Cross Ref
- . 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 66–75.
DOI: Google ScholarCross Ref
- . 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 66–71.
DOI: Google ScholarCross Ref
- . 2011. Exploiting separation of closed-class categories for Arabic tokenization and part-of-speech tagging. ACM Transactions on Asian Language Information Processing 10, 1 (2011).
DOI: Google ScholarDigital Library
- . 2021. Finding better subwords for Tibetan neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 2 (2021).
DOI: Google ScholarDigital Library
- . 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
- . 2020. Retrieved March 16, 2022 from Turkish Language Models. https://github.com/Loodos/turkish-language-models.Google Scholar
- . 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, (ICLR’19), New Orleans, LA.Google Scholar
- . 2021. WangchanBERTa: Pretraining transformer-based Thai language models. arXiv preprint arXiv:2101.09635 (2021).Google Scholar
- . 2011. Information extraction from web services: A comparison of tokenisation algorithms. In Proceedings of the 2nd International Workshop on Software Knowledge 2011, in conjunction with 3rd International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. Scitepress, 12–23.Google Scholar
- . 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations (ICLR’13), Workshop Track Proceedings. Scottsdale, AZ.Google Scholar
- . 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA) (
FAT*’19 ). ACM, New York, NY, 220–229.DOI: Google ScholarDigital Library
- . 2020. Domain adaptation challenges of BERT in tokenization and sub-word representations of out-of-vocabulary words. In Proceedings of the 1st Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics, Online, 1–5.
DOI: Google ScholarCross Ref
- . 2019. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache, Cardiff, UK, 9–16.
DOI: Google ScholarCross Ref
- . 2021. Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers. Association for Computational Linguistics, Online, 97–104.
DOI: Google ScholarCross Ref
- . 2020. An empirical study of tokenization strategies for various Korean NLP tasks. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 133–142.Google Scholar
- . 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1532–1543.
DOI: Google ScholarCross Ref
- . 2020. BPE-Dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1882–1892.
DOI: Google ScholarCross Ref
- . 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
- . 1995. Text chunking using transformation-based learning. In 3rd Workshop on Very Large Corpora. https://aclanthology.org/W95-0107.Google Scholar
- . 2020. Primary Energy and GHG Emissions Coefficients of Electricity. Retrieved March 16, 2022 from https://meslekihizmetler.csb.gov.tr/elektrik-enerjisinin-birincil-enerji-ve-sera-gazi-salimi-katsayilari-2021-yilindan-itibaren-kullanilmak-uzere-guncellenmistir-duyuru-411795.Google Scholar
- . 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google Scholar
- . 2012. Japanese and Korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12) (Kyoto, Japan). 5149–5152.
DOI: Google ScholarCross Ref
- . 2020. BERTurk — BERT Models for Turkish.
DOI: Google ScholarCross Ref
- . 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715–1725.
DOI: Google ScholarCross Ref
- . 2010. Language Detection Library for Java. Retrieved March 16, 2022 from http://code.google.com/p/language-detection/Google Scholar
- . 2014. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Gothenburg, Sweden, 21–24.
DOI: Google ScholarCross Ref
- . 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (
CIKM’19 ). ACM, New York, NY, 1441–1450.DOI: Google ScholarDigital Library
- . 2021. Charformer: Fast character transformers via gradient-based subword tokenization. arXiv preprint arXiv:2106.12672 (2021).Google Scholar
- . 2021. Tuning language representation models for classification of Turkish news. In 2021 International Symposium on Electrical, Electronics and Information Engineering. 402–407.Google Scholar
Digital Library
- . 2011. Developing a text categorization template for Turkish news portals. In 2011 International Symposium on Innovations in Intelligent Systems and Applications. 379–383.
DOI: Google ScholarCross Ref
- . 2022. Large-scale hate speech detection with cross-domain transfer. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC’22) (Marseille, France). 2215–2225.Google Scholar
- . 2003. A statistical information extraction system for Turkish. Natural Language Engineering 9, 2 (2003), 181–210.
DOI: Google ScholarDigital Library
- . 2016. Unsupervised morphological segmentation using neural word embeddings. In International Conference on Statistical Language and Speech Processing. Springer, 43–53.Google Scholar
Cross Ref
- . 2018. Characters or morphemes: How to represent words?. In Proceedings of The 3rd Workshop on Representation Learning for NLP (Melbourne, Australia). 144–153.Google Scholar
- . 2020. Enhancing tokenization by embedding Romanian language specific morphology. In IEEE 16th International Conference on Intelligent Computer Communication and Processing (ICCP’20). 243–250.
DOI: Google ScholarCross Ref
- . 2017. Attention is all you need. In Advances in Neural Information Processing Systems (CA, USA). 5998–6008.Google Scholar
- . 2021. Vocabulary learning via optimal transport for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 7361–7373.
DOI: Google ScholarCross Ref
- . 2021. ByT5: Towards a token-free future with pre-trained byte-to-byte models. arXiv preprint arXiv:2105.13626 (2021).Google Scholar
- . 2021. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM) (Virtual Event, Israel). 1154–1156.Google Scholar
Digital Library
- . 2011. Exploiting morphology in Turkish named entity recognition system. In Proceedings of the ACL 2011 Student Session, Association for Computational Linguistics, Portland, OR, 105–110.Google Scholar
- . 2019. Co-occurrence weight selection in generation of word embeddings for low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18, 3 (2019), 1–18.Google Scholar
Digital Library
- . 2021b. Pre-training on mixed data for low-resource neural machine translation. Information 12, 3 (2021), 133.Google Scholar
Cross Ref
- . 2021a. AMBERT: A pre-trained language model with multi-grained tokenization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 421–435.
DOI: Google ScholarCross Ref
Index Terms
Impact of Tokenization on Language Models: An Analysis for Turkish
Recommendations
Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging
This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-...
Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system
This paper describes a new technique of language modeling for a highly inflectional Dravidian language, Tamil. It aims to alleviate the main problems encountered in processing of Tamil language, like enormous vocabulary growth caused by the large number ...
Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging
Research on the problem of morphological disambiguation of Arabic has noted that techniques developed for lexical disambiguation in English do not easily transfer over, since the affixation present in Arabic creates a very different tag set than for ...






Comments