Analysis of the Effectiveness of Subword Tokenizers in a Low-Resource Linguistic Environment: Implementation Experience for the Tajik Language

Main Article Content

Mullosharaf Kurbonovich Arabov
Svetlana Sergeevna Khaybullina

Abstract

This paper examines modern approaches to subword tokenization of texts as applied to the low-resource Tajik language, which is characterized by a complex morphological structure and a high degree of word-form variability. In the course of the study, a large-scale heterogeneous corpus was compiled and preprocessed, comprising 99 books and 134,497 textual articles of various genres and topics, with a total volume exceeding 33 million tokens. The corpus was cleaned of noise, normalized, and used as a basis for training and subsequent testing of subword models.


Based on this corpus, five tokenization models implementing the BPE, WordPiece, and Unigram algorithms were trained and analyzed using the Hugging Face Tokenizers and SentencePiece libraries. Comparative evaluation was conducted using a set of key metrics, including the proportion of out-of-vocabulary (OOV) words, the degree of text representation compression, tokenization speed, as well as characteristics of n-gram distribution, which make it possible to assess the ability of the models to capture the morphological and structural organization of the language. The experimental results made it possible to identify the strengths and weaknesses of different approaches to subword segmentation and to determine the most effective tokenization strategies under conditions of the morphological complexity of the Tajik language. The findings obtained can be used in the development of language models and applied NLP tools for Tajik and other low-resource languages, contributing to the expansion of their presence in the digital environment.

Article Details

How to Cite
Arabov, M. K., and S. S. Khaybullina. “Analysis of the Effectiveness of Subword Tokenizers in a Low-Resource Linguistic Environment: Implementation Experience for the Tajik Language”. Russian Digital Libraries Journal, vol. 29, no. 2, Apr. 2026, pp. 546-64, doi:10.26907/1562-5419-2026-29-2-546-564.

References

1. Ataman D., Aziz W., Federico M. Neural Machine Translation by Minimising the Lexicon Gap with Subword Units // Proc. of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Valletta, Malta, 2017. P. 432–443.
2. Vasiliev V.O., Petrov A.A. Problemy obrabotki maloizvestnykh yazykov v sovremennykh NLP sistemakh [Problems of Processing Low-Resource Languages in Modern NLP Systems] // Zhurnal vychislitel'noi lingvistiki i intellektual'nykh tekhnologii. 2021. No. 2(25). P. 45–58.
3. Sennrich R., Haddow B., Birch A. Neural Machine Translation of Rare Words with Subword Units // Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). Berlin, Germany, 2016. Vol. 1. P. 1715–1725.
4. Kudo T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates: preprint arXiv:1804.10959 [cs.CL]. 2018. 10 p. URL: https://arxiv.org/abs/1804.10959 (accessed: 06.04.2025).
5. Kudo T., Richardson J. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing // Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). Brussels, Belgium, 2018. P. 66–71.
6. Arabov M.K., Sedykh V.V. Sravnitel'nyi analiz metodov modelirovaniya semanticheskikh predstavlenii slov v usloviyakh ogranichennykh yazykovykh resursov: sluchai tadzhikskogo yazyka [Comparative Analysis of Methods for Modeling Semantic Representations of Words in Low-Resource Settings: The Case of Tajik Language] // Nauchno-tekhnicheskii vestnik Povolzh'ya. 2025. No. 6. P. 196–198.
7. Arabov M.K., Makhmadaliev Kh.S., Khabibullozoda K.Kh. Creating a multiformat text corpus for the Tajik language to train modern language models // Science and Innovation. Series of Geological and Technical Sciences. 2025. No. 2. P. 131–136.
8. Devlin J., Chang M. W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc. of NAACL-HLT. Minneapolis, USA, 2019. P. 4171–4186.
9. Gage P. A New Algorithm for Data Compression // C Users Journal. 1994. Vol. 12, No. 2. P. 23–38.
10. khovar.tj – News Portal of the Republic of Tajikistan. URL: https://www.khovar.tj (accessed: 06.04.2025).
11. Asia Plus – Tajik Information Service. URL: https://asiaplustj.info (accessed: 06.04.2025).
12. Ovoz i Tojik – Independent Online Media. URL: https://ovozitojik.tj (accessed: 06.04.2025).
13. Farazh – Online Newspaper of Dushanbe. URL: https://farazh.tj (accessed: 06.04.2025).
14. Bartenev O.O. Otsenka effektivnosti metodov tokenizatsii teksta [Evaluating the Efficiency of Text Tokenization Methods] // Vestnik MEI. 2023. No. 6. P. 15–28.
15. Bostrom A., Durrett G. Comparative Analysis of BPE and Unigram Tokenization in RoBERTa Models. Research Report. 2024. URL: https://iris.ru.is/ws/files/240198035/Language_Representation_Models_for_Low_and_Medium_Resource_Languages.pdf (accessed: 10.12.2025).
16. Comparative Analysis of Subword Tokenization Approaches for Indian Languages // Emergent Mind. 2025. URL: https://www.emergentmind.com/articles/2505.16868 (accessed: 10.12.2025).
17. Mikaberidze B., Nadareishvili T., Abashidze M. A Comparison of Different Tokenization Methods for the Georgian Language // Proc. of ICNLSP 2024. 2024. P. 199–208.
18. Park D., Mehta S., Kudo T. Effects of Subword Segmentation on Multilingual Language Models // Proc. of EMNLP. 2023. P. 3504–3518.
19. Arabov M.K. Tajik Language Tokenizers (v1.1). URL: https://huggingface.co/ArabovMK/tajik-tokenizers-v1 (accessed: 06.04.2025).