Types of Embeddings and their Application in Intellectual Academic Genealogy
Main Article Content
Abstract
The paper addresses the problem of constructing interpretable vector representations of scientific texts for intellectual academic genealogy. A typology of embeddings is proposed, comprising three classes: statistical, learned neural, and structured symbolic. The study argues for combining the strengths of neural embeddings (high semantic accuracy) with those of symbolic embeddings (interpretable dimensions). To operationalize this hybrid approach, an algorithm for learned symbolic embeddings is introduced, which utilizes a regression-based mapping from a model’s internal representation to an interpretable vector of scores.
The approach is evaluated on a corpus of fragments from dissertation abstracts in pedagogy. A compact transformer encoder with a regression head was trained to reproduce topic relevance scores produced by a state-of-the-art generative language model. A comparison of six training setups (three regression-head architectures and two encoder settings) shows that fine-tuning the upper encoder layers is the primary driver of quality improvements. The best configuration achieves R² = 0.57 and a Top-3 accuracy of 74% in identifying the most relevant concepts. These results suggest that, for tasks requiring formalized output representations, a compact encoder with a regression head can approximate a generative model’s behavior at substantially lower computational cost. More broadly, the further development of algorithms for constructing learned symbolic embeddings contributes to building a model of formal knowledge representation in which the convergence of neural and symbolic methods ensures both the scalability of scientific text processing and the interpretability of vector representations that encode their content.
Article Details
References
2. David S.V., Hayden B.Y. Neurotree: A Collaborative, Graphical Database of the Academic Genealogy of Neuroscience // PLoS ONE. 2012. Vol. 7. No. 10. e46608. https://doi.org/10.1371/journal.pone.0046608
3. Lerner I.M., Marinosyan A.Kh., Grigoriev S.G., Yusupov A.R., Anikieva M.A., Garifullina G.A. An Approach to the Formation of Intellectual Academic Genealogy Using Large Language Models // Journal Electromagnetic Waves and Electronic Systems. 2024. Vol. 29. No. 4. P. 108–120.
https://doi.org/10.18127/j5604128-202404-09 (In Russ.)
4. Grigoriev S.G., Lerner I.M., Marinosyan A.Kh., Grigorieva M.A. On the Issue of Educational and Methodological Information Selection for Implementing an Adaptive Learning Management System: Algorithm of A Priori Authors Classification // Informatics and Education / Informatika i obrazovanie. 2025. Vol. 40. No. 2. P. 66–78. https://doi.org/10.32517/0234-0453-2025-40-2-66-78 (In Russ.)
5. Marinosyan A.Kh., Grigoriev S.G. Scientific Publications and the Embedding Space of Knowledge // Electronic Libraries / Elektronnye biblioteki. 2026. Vol. 29. No. 2. (In press.) (In Russ.)
6. Salton G., Buckley C. Term-Weighting Approaches in Automatic Text Retrieval // Information Processing & Management. 1988. Vol. 24. No. 5. P. 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
7. Sparck Jones K. A Statistical Interpretation of Term Specificity and Its Application in Retrieval // Journal of Documentation. 1972. Vol. 28. No. 1. P. 11–21. https://doi.org/10.1108/eb026526
8. Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space // arXiv preprint. 2013. arXiv:1301.3781.
9. Pennington J., Socher R., Manning C.D. GloVe: Global Vectors for Word Representation // Proceedings of EMNLP. 2014. P. 1532–1543.
10. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019. Vol. 1. P. 4171–4186.
https://doi.org/10.18653/v1/N19-1423
11. Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks // Proceedings of EMNLP. 2019. P. 3982–3992.
https://doi.org/10.18653/v1/D19-1410
12. Beltagy I., Lo K., Cohan A. SciBERT: A Pretrained Language Model for Scientific Text // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2019. P. 3615–3620.
https://doi.org/10.18653/v1/D19-1371
13. Wang L., Yang N., Huang X., Yang L., Majumder R., Wei F. Multilingual E5 Text Embeddings: A Technical Report // arXiv preprint. 2024. arXiv:2402.05672.
14. Marinosyan A.Kh., Grigoriev S.G., Lerner I.M., Anikieva M.A. Automated Comparison of Scientific Research Based on Academic Genealogy // Informatics and Education / Informatika i obrazovanie. 2025. Vol. 40. No. 6. P. 16–27.
https://doi.org/10.32517/0234-0453-2025-40-6-16-27 (In Russ.)
15. Elizarov A.M., Kirillovich A.V., Lipachev E.K., Nevzorova O.A., Solovyev V.D., Zhiltsov N.G. Mathematical Knowledge Representation: Semantic Models and Formalisms // Lobachevskii Journal of Mathematics. 2014. Vol. 35. No. 4. P. 348–354. https://doi.org/10.1134/S1995080214040143
16. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention Is All You Need // Advances in Neural Information Processing Systems. 2017. Vol. 30. P. 5998–6008.
17. Shimanaka H., Kajiwara T., Komachi M. Machine Translation Evaluation with BERT Regressor // arXiv preprint. 2019. arXiv:1907.12679.
18. Viskov V., Kokush G., Larionov D., Eger S., Panchenko A. Semantically-Informed Regressive Encoder Score // Proceedings of the Eighth Conference on Machine Translation (WMT). 2023. P. 815–821.
https://doi.org/10.18653/v1/2023.wmt-1.69
19. Gombert S., Menzel L., Di Mitri D., Drachsler H. Predicting Item Difficulty and Item Response Time with Scalar-Mixed Transformer Encoder Models and Rational Network Regression Heads // Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024. P. 483–492. URL: https://aclanthology.org/2024.bea-1.40/ (date accessed: 02.02.2026).
20. Alain G., Bengio Y. Understanding Intermediate Layers Using Linear Classifier Probes // arXiv preprint. 2017. arXiv:1610.01644.
21. Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting // Journal of Machine Learning Research. 2014. Vol. 15. No. 1. P. 1929–1958.
22. Hoerl A.E., Kennard R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems // Technometrics. 1970. Vol. 12. No. 1. P. 55–67.
https://doi.org/10.1080/00401706.1970.10488634
23. Pichai S., Hassabis D., Kavukcuoglu K. A new era of intelligence with Gemini 3 // Google. The Keyword. URL: https://blog.google/products-and-platforms/products/gemini/gemini-3/#note-from-ceo (date accessed: 02.02.2026).
24. Elizarov A.M., Kirillovich A.V., Lipachev E.K., Nevzorova O.A. Digital Ecosystem OntoMath as an Approach to Building the Space of Mathematical Knowledge // Russian Digital Library Journal. 2023. Vol. 26. No. 2. P. 154–202. https://doi.org/10.26907/1562-5419-2023-26-2-154-202 (In Russ.)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Presenting an article for publication in the Russian Digital Libraries Journal (RDLJ), the authors automatically give consent to grant a limited license to use the materials of the Kazan (Volga) Federal University (KFU) (of course, only if the article is accepted for publication). This means that KFU has the right to publish an article in the next issue of the journal (on the website or in printed form), as well as to reprint this article in the archives of RDLJ CDs or to include in a particular information system or database, produced by KFU.
All copyrighted materials are placed in RDLJ with the consent of the authors. In the event that any of the authors have objected to its publication of materials on this site, the material can be removed, subject to notification to the Editor in writing.
Documents published in RDLJ are protected by copyright and all rights are reserved by the authors. Authors independently monitor compliance with their rights to reproduce or translate their papers published in the journal. If the material is published in RDLJ, reprinted with permission by another publisher or translated into another language, a reference to the original publication.
By submitting an article for publication in RDLJ, authors should take into account that the publication on the Internet, on the one hand, provide unique opportunities for access to their content, but on the other hand, are a new form of information exchange in the global information society where authors and publishers is not always provided with protection against unauthorized copying or other use of materials protected by copyright.
RDLJ is copyrighted. When using materials from the log must indicate the URL: index.phtml page = elbib / rus / journal?. Any change, addition or editing of the author's text are not allowed. Copying individual fragments of articles from the journal is allowed for distribute, remix, adapt, and build upon article, even commercially, as long as they credit that article for the original creation.
Request for the right to reproduce or use any of the materials published in RDLJ should be addressed to the Editor-in-Chief A.M. Elizarov at the following address: amelizarov@gmail.com.
The publishers of RDLJ is not responsible for the view, set out in the published opinion articles.
We suggest the authors of articles downloaded from this page, sign it and send it to the journal publisher's address by e-mail scan copyright agreements on the transfer of non-exclusive rights to use the work.