Types of Embeddings and their Application in Intellectual Academic Genealogy

Main Article Content

Abstract

The paper addresses the problem of constructing interpretable vector representations of scientific texts for intellectual academic genealogy. A typology of embeddings is proposed, comprising three classes: statistical, learned neural, and structured symbolic. The study argues for combining the strengths of neural embeddings (high semantic accuracy) with those of symbolic embeddings (interpretable dimensions). To operationalize this hybrid approach, an algorithm for learned symbolic embeddings is introduced, which utilizes a regression-based mapping from a model’s internal representation to an interpretable vector of scores.


The approach is evaluated on a corpus of fragments from dissertation abstracts in pedagogy. A compact transformer encoder with a regression head was trained to reproduce topic relevance scores produced by a state-of-the-art generative language model. A comparison of six training setups (three regression-head architectures and two encoder settings) shows that fine-tuning the upper encoder layers is the primary driver of quality improvements. The best configuration achieves = 0.57 and a Top-3 accuracy of 74% in identifying the most relevant concepts. These results suggest that, for tasks requiring formalized output representations, a compact encoder with a regression head can approximate a generative model’s behavior at substantially lower computational cost. More broadly, the further development of algorithms for constructing learned symbolic embeddings contributes to building a model of formal knowledge representation in which the convergence of neural and symbolic methods ensures both the scalability of scientific text processing and the interpretability of vector representations that encode their content.

Article Details

How to Cite
Marinosyan, A. K. “Types of Embeddings and Their Application in Intellectual Academic Genealogy”. Russian Digital Libraries Journal, vol. 29, no. 1, Feb. 2026, pp. 240-61, doi:10.26907/1562-5419-2026-29-1-240-261.

References

1. Mulcahy C. The Mathematics Genealogy Project comes of age at twenty-one // Notices of the AMS. 2017. Vol. 64. No. 5. P. 466–470.
2. David S.V., Hayden B.Y. Neurotree: A Collaborative, Graphical Database of the Academic Genealogy of Neuroscience // PLoS ONE. 2012. Vol. 7. No. 10. e46608. https://doi.org/10.1371/journal.pone.0046608
3. Lerner I.M., Marinosyan A.Kh., Grigoriev S.G., Yusupov A.R., Anikieva M.A., Garifullina G.A. An Approach to the Formation of Intellectual Academic Genealogy Using Large Language Models // Journal Electromagnetic Waves and Electronic Systems. 2024. Vol. 29. No. 4. P. 108–120.
https://doi.org/10.18127/j5604128-202404-09 (In Russ.)
4. Grigoriev S.G., Lerner I.M., Marinosyan A.Kh., Grigorieva M.A. On the Issue of Educational and Methodological Information Selection for Implementing an Adaptive Learning Management System: Algorithm of A Priori Authors Classification // Informatics and Education / Informatika i obrazovanie. 2025. Vol. 40. No. 2. P. 66–78. https://doi.org/10.32517/0234-0453-2025-40-2-66-78 (In Russ.)
5. Marinosyan A.Kh., Grigoriev S.G. Scientific Publications and the Embedding Space of Knowledge // Electronic Libraries / Elektronnye biblioteki. 2026. Vol. 29. No. 2. (In press.) (In Russ.)
6. Salton G., Buckley C. Term-Weighting Approaches in Automatic Text Retrieval // Information Processing & Management. 1988. Vol. 24. No. 5. P. 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
7. Sparck Jones K. A Statistical Interpretation of Term Specificity and Its Application in Retrieval // Journal of Documentation. 1972. Vol. 28. No. 1. P. 11–21. https://doi.org/10.1108/eb026526
8. Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space // arXiv preprint. 2013. arXiv:1301.3781.
9. Pennington J., Socher R., Manning C.D. GloVe: Global Vectors for Word Representation // Proceedings of EMNLP. 2014. P. 1532–1543.
10. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019. Vol. 1. P. 4171–4186.
https://doi.org/10.18653/v1/N19-1423
11. Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks // Proceedings of EMNLP. 2019. P. 3982–3992.
https://doi.org/10.18653/v1/D19-1410
12. Beltagy I., Lo K., Cohan A. SciBERT: A Pretrained Language Model for Scientific Text // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2019. P. 3615–3620.
https://doi.org/10.18653/v1/D19-1371
13. Wang L., Yang N., Huang X., Yang L., Majumder R., Wei F. Multilingual E5 Text Embeddings: A Technical Report // arXiv preprint. 2024. arXiv:2402.05672.
14. Marinosyan A.Kh., Grigoriev S.G., Lerner I.M., Anikieva M.A. Automated Comparison of Scientific Research Based on Academic Genealogy // Informatics and Education / Informatika i obrazovanie. 2025. Vol. 40. No. 6. P. 16–27.
https://doi.org/10.32517/0234-0453-2025-40-6-16-27 (In Russ.)
15. Elizarov A.M., Kirillovich A.V., Lipachev E.K., Nevzorova O.A., Solovyev V.D., Zhiltsov N.G. Mathematical Knowledge Representation: Semantic Models and Formalisms // Lobachevskii Journal of Mathematics. 2014. Vol. 35. No. 4. P. 348–354. https://doi.org/10.1134/S1995080214040143
16. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention Is All You Need // Advances in Neural Information Processing Systems. 2017. Vol. 30. P. 5998–6008.
17. Shimanaka H., Kajiwara T., Komachi M. Machine Translation Evaluation with BERT Regressor // arXiv preprint. 2019. arXiv:1907.12679.
18. Viskov V., Kokush G., Larionov D., Eger S., Panchenko A. Semantically-Informed Regressive Encoder Score // Proceedings of the Eighth Conference on Machine Translation (WMT). 2023. P. 815–821.
https://doi.org/10.18653/v1/2023.wmt-1.69
19. Gombert S., Menzel L., Di Mitri D., Drachsler H. Predicting Item Difficulty and Item Response Time with Scalar-Mixed Transformer Encoder Models and Rational Network Regression Heads // Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 2024. P. 483–492. URL: https://aclanthology.org/2024.bea-1.40/ (date accessed: 02.02.2026).
20. Alain G., Bengio Y. Understanding Intermediate Layers Using Linear Classifier Probes // arXiv preprint. 2017. arXiv:1610.01644.
21. Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting // Journal of Machine Learning Research. 2014. Vol. 15. No. 1. P. 1929–1958.
22. Hoerl A.E., Kennard R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems // Technometrics. 1970. Vol. 12. No. 1. P. 55–67.
https://doi.org/10.1080/00401706.1970.10488634
23. Pichai S., Hassabis D., Kavukcuoglu K. A new era of intelligence with Gemini 3 // Google. The Keyword. URL: https://blog.google/products-and-platforms/products/gemini/gemini-3/#note-from-ceo (date accessed: 02.02.2026).
24. Elizarov A.M., Kirillovich A.V., Lipachev E.K., Nevzorova O.A. Digital Ecosystem OntoMath as an Approach to Building the Space of Mathematical Knowledge // Russian Digital Library Journal. 2023. Vol. 26. No. 2. P. 154–202. https://doi.org/10.26907/1562-5419-2023-26-2-154-202 (In Russ.)