Search

Types of Embeddings and their Application in Intellectual Academic Genealogy

Andreas Khachaturovich Marinosyan

240-261

Abstract:

The paper addresses the problem of constructing interpretable vector representations of scientific texts for intellectual academic genealogy. A typology of embeddings is proposed, comprising three classes: statistical, learned neural, and structured symbolic. The study argues for combining the strengths of neural embeddings (high semantic accuracy) with those of symbolic embeddings (interpretable dimensions). To operationalize this hybrid approach, an algorithm for learned symbolic embeddings is introduced, which utilizes a regression-based mapping from a model’s internal representation to an interpretable vector of scores.

The approach is evaluated on a corpus of fragments from dissertation abstracts in pedagogy. A compact transformer encoder with a regression head was trained to reproduce topic relevance scores produced by a state-of-the-art generative language model. A comparison of six training setups (three regression-head architectures and two encoder settings) shows that fine-tuning the upper encoder layers is the primary driver of quality improvements. The best configuration achieves R² = 0.57 and a Top-3 accuracy of 74% in identifying the most relevant concepts. These results suggest that, for tasks requiring formalized output representations, a compact encoder with a regression head can approximate a generative model’s behavior at substantially lower computational cost. More broadly, the further development of algorithms for constructing learned symbolic embeddings contributes to building a model of formal knowledge representation in which the convergence of neural and symbolic methods ensures both the scalability of scientific text processing and the interpretability of vector representations that encode their content.

Keywords: embeddings, academic genealogy, transformer encoder, regression head, symbolic embeddings, topic profile, natural language processing, interpretability, large language models, scientometrics.

Representation of Intraword Syntagmatic Relations in Vector Language Models

Daria Kirillovna Rodionova, Olga Aleksandrovna Mitrofanova

898-918

Abstract:

The paper discusses semantic structure representation of derivatives in language models, taking into account the intraword syntagmatic relations between derivational morphemes. Experiments were conducted using morphemic models developed by the Russian National Corpus (RNC), as well as fastText and ruRoBERTa models. The study is aimed at the verification of the hypothesis dealing with compositionality of derived words which are represented as aggregated morpheme vectors. In experiments we explore the representation of semantic relationships using fastText morpheme vectors and standard subword vectors in ruRoBERTa. The results indicate moderate sensitivity of fastText vectors to syntagmatic relations between morphemes as well as to derivational types. At the same time, it was found that aggregating morpheme vectors in fastText provides better representation of semantic relations between words compared to aggregating subword vectors in ruRoBERTa.

Standard BPE (Byte-Pair Encoding) and WordPiece tokenizers used in Transformer-based models are poorly interpretable with respect to linguistic data, as word segments do not always correspond to morphemes. The research problem lies in the need to assess the extent to which modern language models can capture linguistic features that characterize the relationships of derived words within word-formation families. The aim of the study is to evaluate the ability of predictive distributed vector embedding models to reproduce syntagmatic connections between morphemes within derived words and at the level of word-formation families in the Russian language.

The obtained results encourage the development of neural network architectures that take into account syntagmatic relations between morphemes, the improvement of morpheme tokenizers, and their integration into language models.

Keywords: language models, morphemic analysis, word-formation methods, compositionality.

Development of an Intelligent Search System for the Mathematical Archive of Publications

Aleksey Alekseevich Nasibulin, Olga Muratovna Ataeva

860-876

Abstract:

A study was conducted on searching for similar documents. The goal was to create a recommendation algorithm for finding similar scientific articles in mathematics using a prioritized search of mathematical formulas with textual support.

The text was converted from graphical to textual representation using OCR technology for subsequent analysis and indexing. During the analysis process, the text was divided into blocks, followed by the extraction of significant formulas, keywords, and phrases from the text. During the indexing process, a vector database was formed based on vector representations of formulas obtained through the embedding process. The indexing results were used to search for articles that are similar to the document submitted by the user to the algorithm input. A list of similar articles is displayed with results sorted by the metric of closeness of vector representations of formulas.

The source data consisted of approximately 5,000 scientific articles devoted to various studies on mathematical topics and presented as PDF files. The experiment was conducted based on data from specific library system content, but the proposed technology can be extended to other library systems, including those containing articles on other topics, such as physics and other exact sciences.

Keywords: formula search, semantics, knowledge extraction, mathematical search, semantic search.

Method of Pre-Assessment of Students' Answers Based on the Vector Model of Documents

Chulpan Bakievna Minnegalieva, Gulshat Alfisovna Sabitova, Almaz Maratovich Gayaliev

324-339

Abstract:

This article discusses the application of vector models for the preliminary analysis of students' free-form answers. Vector representations of words and documents were obtained using word2vec, doc2vec, and BERT models. The similarity between the answer given by the student and the correct answer was determined using the cosine measure. It was found that vector models allow identifying obviously incorrect answers with sufficient accuracy. For answers that are close in wording, an additional verification step is proposed. Using word2vec, binary classification of answers to certain questions was performed, and accuracy, precision, recall and F1-measure estimates were given.

Keywords: vector model, word2vec, doc2vec, BERT, cosine similarity, vector representation.

Word Search in Handwritten Text Based on Stroke Segmentation

Ivan Dmitrievich Morozov, Leonid Moiseevich Mestetskiy

1435-1453

Abstract:

Handwritten archival documents form a fundamental part of humanity's cultural heritage. However, their analysis remains a labor-intensive task for professional researchers, such as historians, philologists, and linguists. Unlike commercial OCR applications, working with historical manuscripts requires a fundamentally different approach due to the extreme diversity of handwriting, the presence of corrections, and material degradation.

This paper proposes a method for searching within handwritten texts based on stroke segmentation. Instead of performing full text recognition, which is often unattainable for historical documents, this method allows for efficiently answering researcher search queries. The key idea involves decomposing the text into elementary strokes, forming semantic vector representations using contrastive learning, followed by clustering and classification to create an adaptive handwriting dictionary.

It is experimentally shown that search by comparing tuples of reduced sequences of the most informative strokes using the Levenshtein distance provides sufficient quality for the task at hand. The method demonstrates resilience to individual handwriting characteristics and writing variations, which is particularly important for working with authors' archives and historical documents.

The proposed approach opens up new possibilities for accelerating scientific research in the humanities, reducing the time required to find relevant information from weeks to minutes, thereby qualitatively transforming research capabilities when working with large archives of handwritten documents.

Keywords: handwritten text, search, stroke analysis, segmentation, vector representation, contrastive learning, clustering.

Archival Handwritten Letter Attribution using Siamese Neural Networks

Nataliia Mikhailovna Pronina

1454-1480

Abstract:

This paper presents a method for the automated attribution of archival handwritten letters based on a Siamese neural network, addressing a key challenge in digital humanities – the authentication of historical documents. The research is motivated by the mass digitization of 17th to 19th-century archives, where attribution is often hindered by incomplete or inaccurate metadata about the authors.

The method is designed for real-world document collections and accounts for challenges typical of archival materials: poor-quality scans, significant handwriting variation, and substantial class imbalance (from 1 to over 50 samples per author). The use of a Siamese network architecture enables the extraction of discriminative vector representations (embeddings). Based on these embeddings, the method not only classifies documents by known authors but also effectively identifies manuscripts that do not match any known author in the archive. This significantly narrows down the pool of candidates for subsequent expert verification.

The study introduces a data preprocessing algorithm and provides a comparative analysis of two approaches to text analysis: at the image fragment level (300×300 px) and at the individual text line level. The developed tool offers archivists and philologists an effective solution for the preliminary sorting and attribution of handwritten documents large collections.

Keywords: siamese neural network, identification, verification, attribution, handwritten text, archival documents, convolutional neural network, recurrent neural network.

On the Synonym Search Model

Olga Muratovna Ataeva, Vladimir Alekseevich Serebriakov, Natalia Pavlovna Tuchkova

1006-1022

Abstract:

The problem of finding the most relevant documents as a result of an extended and refined query is considered. For this, a search model and a text preprocessing mechanism are proposed, as well as the joint use of a search engine and a neural network model built on the basis of an index using word2vec algorithms to generate an extended query with synonyms and refine search results based on a selection of similar documents in a digital semantic library. The paper investigates the construction of a vector representation of documents based on paragraphs in relation to the data array of the digital semantic library LibMeta. Each piece of text is labeled. Both the whole document and its separate parts can be marked. The problem of enriching user queries with synonyms was solved, then when building a search model together with word2vec algorithms, an approach of "indexing first, then training" was used to cover more information and give more accurate search results. The model was trained on the basis of the library's mathematical content. Examples of training, extended query and search quality assessment using training and synonyms are given.

Keywords: search model, word2vec algorithm, synonyms, information query, query extension.

Hiding in Meaning: Semantic Encoding for Generative Text Steganography

Oleg Yurievich Rogov, Dmitrii Evgenievich Indenbom, Dmitrii Sergeevich Korzh, Darya Valeryaevna Pugacheva, Vsevolod Alexandrovich Voronov, Elena Viktorovna Tutubalina

1165-1185

Abstract:

We propose a novel framework for steganographic text generation that hides binary messages within semantically coherent natural language using latent-space conditioning of large language models (LLMs). Secret messages are first encoded into continuous vectors via a learned binary-to-latent mapping, which is used to guide text generation through prefix tuning. Unlike prior token-level or syntactic steganography, our method avoids explicit word manipulation and instead operates entirely within the latent semantic space, enabling more fluent and less detectable outputs. On the receiver side, the latent representation is recovered from the generated text and decoded back into the original message. As a key theoretical contribution, we provide a robustness guarantee: if the recovered latent vector lies within a bounded distance of the original, exact message reconstruction is ensured, with the bound determined by the decoder’s Lipschitz continuity and the minimum logit margin. This formal result offers a principled view of the reliability–capacity trade-off in latent steganographic systems. Empirical evaluation on both synthetic data and real-world domains such as Amazon reviews shows that our method achieves high message recovery accuracy (above 91%), strong text fluency and competitive capacity up to 6 bits per sentence element while maintaining resilience against neural steganalysis. These findings demonstrate that latent conditioned generation offers a secure and practical pathway for embedding information in modern LLMs.

Keywords: steganography, semantic encoding, language models, prefix tuning, knowledge graphs, natural language generation, latent conditioning, neural steganalysis.

A Recommendation System for Finding Semantically Similar Fragments of Program Code

Vitaly Ivanovich Zorin, Evgeny Konstantinovich Lipachev

751-781

Abstract:

Recommendation systems in the scientific information space serve as essential tools for search and navigation when working with scientific documents. Software code is currently considered as an object of scientific knowledge and, as a result, an important task is to create software lifecycle support systems, in particular, to find similar software solutions, detect code borrowings, analyze and evaluate code quality.

This paper proposes a content-based recommender system that provides users with a personalized list of code fragments that are functionally equivalent to the input query code presented in one of the programming languages from the established set.

The basic algorithm of the system is based on the representation of the program code in the form of an abstract syntax tree followed by the construction of a vector space of program codes. The semantic similarity of program codes is determined by the distance between code vectors in a multidimensional space.

The personalization of recommendations is achieved through a filtering module that ranks the retrieved fragments taking into account the user's profile. The factors under consideration are the language preferences of the user and his areas of scientific interests, extracted through integration with ORCID.

To ensure the system's operation, a specialized dataset was created based on the CodeNet corpus. The problem of automated language detection from a snippet of the presented code in one of the 19 languages included in the current rating list of programming languages has also been solved.

Keywords: abstract syntax tree, code embedding, content-based filtering, cross-language clone, cross language code search, code similarity, recommender system.

Semantic similarity for aspect-based sentiment analysis

Евгений Вячеславович Котельников, Павел Дмитриевич Блинов

120-137

Abstract:

The article investigates the problem of aspect-based sentiment analysis. Such version of analysis is more challenging compared to general task of sentiment detection problem. It implies the solutions to the number of related subtasks such as aspect term extraction, aspect term polarity detection and aspect category polarity detection. The solution of aspect-based sentiment analysis problem significantly extends the capabilities of natural language processing systems.

The article gives the overview of previous works in the field and describes the train and test data from the Russian evaluation workshop SentiRuEval. For the task of aspect term extraction the vector space of distributed representations of words was used. Aspect term detection is based on mutual information method and semantic similarity. The paper contains the number of experimental results. At the end the final conclusions are drawn.

Keywords: aspect-based sentiment analysis, mutual information, distributed representations of words, machine learning, SentiRuEval.

Formation of Structured Representations of Scientific Journals for Integration into a Knowledge Graph and Semantic Search

Olga Muratovna Ataeva, Mikhail Gennadievich Kobuk

1306-1323

Abstract:

This paper examines the development of the SciLibRu library of scientific subject areas, as a continuation of the semantic description of scientific works from the library LibMeta project. This library is based on a conceptual data model, the structure and semantics of which are formed based on the principles of ontological modeling. This approach ensures a strict description of the subject area, formalization of the relationships between entities, and the possibility of further automated data analysis. The goal of the study is to develop and experimentally apply methods for structuring scientific journal data in LaTeX format for their integration into the library ontology and to support semantic search.

An algorithm for translating data represented by multiple files into XML format is proposed for integration into the library ontology. A vector search module based on embedding calculation using language models is implemented. Patterns in the distribution of embeddings and factors influencing the accuracy of search results ranking are identified. Testing of the two components is conducted.

The developed method forms the basis for automatically incorporating scientific journal data into the SciLibRu knowledge graph and creating training corpora for language models limited to scientific subject areas. The obtained results contribute to the development of journal knowledge graph navigation systems, recommendation engines, and intelligent search tools for Russian-language scientific texts.

Keywords: semi-structured data, text structuring, LaTeX, vector representations of text, full-text search, semantic search.

Search Results