Representation of Intraword Syntagmatic Relations in Vector Language Models

Main Article Content

Daria Kirillovna Rodionova
Olga Aleksandrovna Mitrofanova

Abstract

The paper discusses semantic structure representation of derivatives in language models, taking into account the intraword syntagmatic relations between derivational morphemes. Experiments were conducted using morphemic models developed by the Russian National Corpus (RNC), as well as fastText and ruRoBERTa models. The study is aimed at the verification of the hypothesis dealing with compositionality of derived words which are represented as aggregated morpheme vectors. In experiments we explore the representation of semantic relationships using fastText morpheme vectors and standard subword vectors in ruRoBERTa. The results indicate moderate sensitivity of fastText vectors to syntagmatic relations between morphemes as well as to derivational types. At the same time, it was found that aggregating morpheme vectors in fastText provides better representation of semantic relations between words compared to aggregating subword vectors in ruRoBERTa.


Standard BPE (Byte-Pair Encoding) and WordPiece tokenizers used in Transformer-based models are poorly interpretable with respect to linguistic data, as word segments do not always correspond to morphemes. The research problem lies in the need to assess the extent to which modern language models can capture linguistic features that characterize the relationships of derived words within word-formation families. The aim of the study is to evaluate the ability of predictive distributed vector embedding models to reproduce syntagmatic connections between morphemes within derived words and at the level of word-formation families in the Russian language.


The obtained results encourage the development of neural network architectures that take into account syntagmatic relations between morphemes, the improvement of morpheme tokenizers, and their integration into language models.

Article Details

How to Cite
Rodionova, D. K., and O. A. Mitrofanova. “Representation of Intraword Syntagmatic Relations in Vector Language Models”. Russian Digital Libraries Journal, vol. 29, no. 3, June 2026, pp. 898-1, doi:10.26907/1562-5419-2026-29-3-898-918.

References

1. Gerd A.S. Morphology. St. Petersburg: Publishing House of St. Petersburg University, 2004. 176 p.
2. Bolshakova E.I., Sapin A.S. Building a Combined Morphological Model for Russian Word Forms. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol. 13217. Springer, Cham, 2022. P. 45–55. https://doi.org/10.1007/978-3-031-16500-9_5
3. Bolshakova E.I., Sapin A.S. Building Dataset and Morpheme Segmentation Model for Russian Word Forms. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”. Moscow, 2021. P. 154–161. https://doi.org/10.28995/2075-7182-2021-20-154-161
4. Morozov D., Shcherbakova O., Glazkova A. Russian Neural Morpheme Segmentation: From Lemmata to Wordforms. In: Bakaev M. et al. Internet and Modern Society. IMS 2025. Communications in Computer and Information Science, vol. 2671. Springer, Cham, 2025. P. 157–167. https://doi.org/10.1007/978-3-032-04958-2_12
5. Morozov D., Astapenka L., Glazkova A., Garipov T., Lyashevskaya O. BERT-like Models for Slavic Morpheme Segmentation. In: Che W., Nabende J., Shutova E., Pilehvar M.T. (Eds.) Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2025. P. 6795–6815 (Proceedings of the Annual Meeting of the Association for Computational Linguistics). https://doi.org/10.18653/v1/2025.acl-long.337
6. Sorokin A., Kravtsova A. Deep convolutional networks for supervised morpheme segmentation of Russian language. In: Ustalov D., Filchenkov A., Pivovarova L., Zizka J. (Eds.) Artificial Intelligence and Natural Language. P. 3–10. Springer, Cham, 2018. https://doi.org/10.1007/978-3-030-01204-5_1
7. Selkirk E. The syntax of words. Camb. (Mass), 1982. 136 p.
8. Skalička V. Hyposyntax. In: Slovo a slovesnost. Vol. 31. 1970. P. 1–6.
9. Kubryakova E.S. Fundamentals of Morphological Analysis. Moscow, 1974. 320 p.
10. Lopatin V.V. Grammatical Description of Slavic Languages // Word Formation as an Object of Grammatical Description. Moscow, 1974.
11. Lees R. The Grammar of English nominalizations. The Hague, 1963.
12. Marchand H. The Categories and Types of Present-day English Word-Formation. Wiesbaden, 1960.
13. Fiveyskaya E.A. Word-Formation Modeling of the Semantics of Verbal Nouns in the Aspect of Proposition Theory // Siberian Philological Journal. 2010(3). P. 127–133.
14. Fillmore C. The Case for Case // New in Foreign Linguistics. Issue 10. Moscow, 1981.
15. Shadrin V.I. The Semantics of Morphological Components of Derived Words in the English Language in Light of the Categories of Case Grammar // Morphemics. Principles of Segmentation, Identification, and Classification of Morphological Units / Ed. by S.I. Bogdanov, A.S. Gerd. St. Petersburg, 1997. P. 171–177.
16. Morfessor. URL: https://github.com/aalto-speech/morfessor, last access 24.03.2026
17. RussianMorphParsing. URL: https://github.com/alesapin/RussianMorphParsing, last access 24.03.2026
18. ruMorpheme. URL: https://github.com/EvilFreelancer/ruMorpheme, last access 24.03.2026
19. Neuromodels. URL: https://ruscorpora.ru/license-content/neuromodels/, last access 24.03.2026
20. Asgari E., El Kheir Y., Sadraei Javaheri M.A. MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies, 2025. https://doi.org/10.48550/arXiv.2502.00894
21. Teklehaymanot et al. MoVoC: Morphology-Aware Subword Construction for Ge’ez Script Languages. In: Findings of the Association for Computational Linguistics: EMNLP 2025, p. 13131–13144, Suzhou, China. Association for Computational Linguistics, 2025. https://doi.org/10.48550/arXiv.2509.08812
22. Nzeyimana A., Niyongabo Rubungo A. KinyaBERT: a Morphology-aware Kinyarwanda Language Model. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), P. 5347–5363, Dublin, Ireland. Association for Computational Linguistics, 2022.
https://doi.org/10.48550/arXiv.2203.08459
23. Potikha Z.A. School Dictionary of Word Structure of the Russian Language: A Guide for Students. 2nd ed., revised. Moscow: Prosveshchenie, 1999. 318 p.
24. Tikhonov A.N. Morphemic-Orthographic Dictionary of the Russian Language. Moscow: AST: Astrel, 2002. 704 p.
25. .Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching Word Vectors with Subword Information. In: Transactions of the Association for Computational Linguistics, 2017. P. 135–146. https://doi.org/10.48550/arXiv.2309.10931
26. RusVectōrēs. URL: https://rusvectores.org/ru/models/, last access 24.03.2026
27. Zmitrovich D., Abramov A., Kalmykov A., Kadulin V., Tikhonova M., Taktasheva E., Astafurov D., Baushenko M., Snegirev A., Shavrina T., Markov S., Mikhailov V., Fenogenova A. A Family of Pretrained Transformer Language Models for Russian. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, 2024. P. 507–524. https://doi.org/10.48550/arXiv.2309.10931
28. ruRoBERTa-large. URL: https://huggingface.co/ai-forever/ruRoBERTa-large, last access 24.03.2026