Representation of Intraword Syntagmatic Relations in Vector Language Models
Main Article Content
Abstract
The paper discusses semantic structure representation of derivatives in language models, taking into account the intraword syntagmatic relations between derivational morphemes. Experiments were conducted using morphemic models developed by the Russian National Corpus (RNC), as well as fastText and ruRoBERTa models. The study is aimed at the verification of the hypothesis dealing with compositionality of derived words which are represented as aggregated morpheme vectors. In experiments we explore the representation of semantic relationships using fastText morpheme vectors and standard subword vectors in ruRoBERTa. The results indicate moderate sensitivity of fastText vectors to syntagmatic relations between morphemes as well as to derivational types. At the same time, it was found that aggregating morpheme vectors in fastText provides better representation of semantic relations between words compared to aggregating subword vectors in ruRoBERTa.
Standard BPE (Byte-Pair Encoding) and WordPiece tokenizers used in Transformer-based models are poorly interpretable with respect to linguistic data, as word segments do not always correspond to morphemes. The research problem lies in the need to assess the extent to which modern language models can capture linguistic features that characterize the relationships of derived words within word-formation families. The aim of the study is to evaluate the ability of predictive distributed vector embedding models to reproduce syntagmatic connections between morphemes within derived words and at the level of word-formation families in the Russian language.
The obtained results encourage the development of neural network architectures that take into account syntagmatic relations between morphemes, the improvement of morpheme tokenizers, and their integration into language models.
Article Details
References
2. Bolshakova E.I., Sapin A.S. Building a Combined Morphological Model for Russian Word Forms. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol. 13217. Springer, Cham, 2022. P. 45–55. https://doi.org/10.1007/978-3-031-16500-9_5
3. Bolshakova E.I., Sapin A.S. Building Dataset and Morpheme Segmentation Model for Russian Word Forms. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”. Moscow, 2021. P. 154–161. https://doi.org/10.28995/2075-7182-2021-20-154-161
4. Morozov D., Shcherbakova O., Glazkova A. Russian Neural Morpheme Segmentation: From Lemmata to Wordforms. In: Bakaev M. et al. Internet and Modern Society. IMS 2025. Communications in Computer and Information Science, vol. 2671. Springer, Cham, 2025. P. 157–167. https://doi.org/10.1007/978-3-032-04958-2_12
5. Morozov D., Astapenka L., Glazkova A., Garipov T., Lyashevskaya O. BERT-like Models for Slavic Morpheme Segmentation. In: Che W., Nabende J., Shutova E., Pilehvar M.T. (Eds.) Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2025. P. 6795–6815 (Proceedings of the Annual Meeting of the Association for Computational Linguistics). https://doi.org/10.18653/v1/2025.acl-long.337
6. Sorokin A., Kravtsova A. Deep convolutional networks for supervised morpheme segmentation of Russian language. In: Ustalov D., Filchenkov A., Pivovarova L., Zizka J. (Eds.) Artificial Intelligence and Natural Language. P. 3–10. Springer, Cham, 2018. https://doi.org/10.1007/978-3-030-01204-5_1
7. Selkirk E. The syntax of words. Camb. (Mass), 1982. 136 p.
8. Skalička V. Hyposyntax. In: Slovo a slovesnost. Vol. 31. 1970. P. 1–6.
9. Kubryakova E.S. Fundamentals of Morphological Analysis. Moscow, 1974. 320 p.
10. Lopatin V.V. Grammatical Description of Slavic Languages // Word Formation as an Object of Grammatical Description. Moscow, 1974.
11. Lees R. The Grammar of English nominalizations. The Hague, 1963.
12. Marchand H. The Categories and Types of Present-day English Word-Formation. Wiesbaden, 1960.
13. Fiveyskaya E.A. Word-Formation Modeling of the Semantics of Verbal Nouns in the Aspect of Proposition Theory // Siberian Philological Journal. 2010(3). P. 127–133.
14. Fillmore C. The Case for Case // New in Foreign Linguistics. Issue 10. Moscow, 1981.
15. Shadrin V.I. The Semantics of Morphological Components of Derived Words in the English Language in Light of the Categories of Case Grammar // Morphemics. Principles of Segmentation, Identification, and Classification of Morphological Units / Ed. by S.I. Bogdanov, A.S. Gerd. St. Petersburg, 1997. P. 171–177.
16. Morfessor. URL: https://github.com/aalto-speech/morfessor, last access 24.03.2026
17. RussianMorphParsing. URL: https://github.com/alesapin/RussianMorphParsing, last access 24.03.2026
18. ruMorpheme. URL: https://github.com/EvilFreelancer/ruMorpheme, last access 24.03.2026
19. Neuromodels. URL: https://ruscorpora.ru/license-content/neuromodels/, last access 24.03.2026
20. Asgari E., El Kheir Y., Sadraei Javaheri M.A. MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies, 2025. https://doi.org/10.48550/arXiv.2502.00894
21. Teklehaymanot et al. MoVoC: Morphology-Aware Subword Construction for Ge’ez Script Languages. In: Findings of the Association for Computational Linguistics: EMNLP 2025, p. 13131–13144, Suzhou, China. Association for Computational Linguistics, 2025. https://doi.org/10.48550/arXiv.2509.08812
22. Nzeyimana A., Niyongabo Rubungo A. KinyaBERT: a Morphology-aware Kinyarwanda Language Model. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), P. 5347–5363, Dublin, Ireland. Association for Computational Linguistics, 2022.
https://doi.org/10.48550/arXiv.2203.08459
23. Potikha Z.A. School Dictionary of Word Structure of the Russian Language: A Guide for Students. 2nd ed., revised. Moscow: Prosveshchenie, 1999. 318 p.
24. Tikhonov A.N. Morphemic-Orthographic Dictionary of the Russian Language. Moscow: AST: Astrel, 2002. 704 p.
25. .Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching Word Vectors with Subword Information. In: Transactions of the Association for Computational Linguistics, 2017. P. 135–146. https://doi.org/10.48550/arXiv.2309.10931
26. RusVectōrēs. URL: https://rusvectores.org/ru/models/, last access 24.03.2026
27. Zmitrovich D., Abramov A., Kalmykov A., Kadulin V., Tikhonova M., Taktasheva E., Astafurov D., Baushenko M., Snegirev A., Shavrina T., Markov S., Mikhailov V., Fenogenova A. A Family of Pretrained Transformer Language Models for Russian. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, 2024. P. 507–524. https://doi.org/10.48550/arXiv.2309.10931
28. ruRoBERTa-large. URL: https://huggingface.co/ai-forever/ruRoBERTa-large, last access 24.03.2026

This work is licensed under a Creative Commons Attribution 4.0 International License.
Presenting an article for publication in the Russian Digital Libraries Journal (RDLJ), the authors automatically give consent to grant a limited license to use the materials of the Kazan (Volga) Federal University (KFU) (of course, only if the article is accepted for publication). This means that KFU has the right to publish an article in the next issue of the journal (on the website or in printed form), as well as to reprint this article in the archives of RDLJ CDs or to include in a particular information system or database, produced by KFU.
All copyrighted materials are placed in RDLJ with the consent of the authors. In the event that any of the authors have objected to its publication of materials on this site, the material can be removed, subject to notification to the Editor in writing.
Documents published in RDLJ are protected by copyright and all rights are reserved by the authors. Authors independently monitor compliance with their rights to reproduce or translate their papers published in the journal. If the material is published in RDLJ, reprinted with permission by another publisher or translated into another language, a reference to the original publication.
By submitting an article for publication in RDLJ, authors should take into account that the publication on the Internet, on the one hand, provide unique opportunities for access to their content, but on the other hand, are a new form of information exchange in the global information society where authors and publishers is not always provided with protection against unauthorized copying or other use of materials protected by copyright.
RDLJ is copyrighted. When using materials from the log must indicate the URL: index.phtml page = elbib / rus / journal?. Any change, addition or editing of the author's text are not allowed. Copying individual fragments of articles from the journal is allowed for distribute, remix, adapt, and build upon article, even commercially, as long as they credit that article for the original creation.
Request for the right to reproduce or use any of the materials published in RDLJ should be addressed to the Editor-in-Chief A.M. Elizarov at the following address: amelizarov@gmail.com.
The publishers of RDLJ is not responsible for the view, set out in the published opinion articles.
We suggest the authors of articles downloaded from this page, sign it and send it to the journal publisher's address by e-mail scan copyright agreements on the transfer of non-exclusive rights to use the work.