Настоящий тематический выпуск журнала «Электронные библиотеки» включает статьи, подготовленные на основе докладов, представленных на Всероссийской конференции с международным участием «Актуальные проблемы семантического анализа данных», которая прошла 16–19 февраля 2026 г. в г. Коломна. Конференция по такой тематике была проведена впервые и была посвящена 80-летию со дня рождения Владимира Алексеевича Серебрякова (1946–2024).
Ключевые направления конференции:
o интеллектуальный анализ данных;
o извлечение знаний, анализ научных данных;
o онтологии, графы знаний и управление знаниями;
o нейросимволические и LLM-ориентированные методы анализа данных;
o информационный поиск и анализ текстов;
o цифровые библиотеки, метаданные и научная коммуникация;
o семантический поиск, оценка качества, безопасность;
o интеллектуальный анализ данных в задачах информационной безопасности.
Основными целями проведенной конференции были объединение специалистов, исследователей и студентов для обсуждения современных актуальных задач семантического анализа данных, обмена результатами и опытом, а также содействие междисциплинарному научному диалогу. Конференция организована сотрудниками Федерального исследовательского центра «Информатика и управление» Российской академии наук и поддержана Российской ассоциацией искусственного интеллекта.
Редакторы-составители: К. В. Воронцов, Е. К. Липачёв, Н. П. Тучкова
Published: 16.06.2026
Full Issue
Part 1: Special issue "Semantic Data Analysis: Models and Applications"
Orchestration of Methods of Scientific Data Analysis in the Review Processes
This paper explores the problem of combining methods in the semantic analysis of scientific data and publications during review. At different stages of data processing in the SciLibRu system, various methods are used, a multi-level ontology is constructed, and a knowledge graph is populated, resulting in the formation of a new data structure distinct from the original. Each method individually serves its purpose in such a system, while their combined use leads to the emergence of new properties, which became the subject of this research. An example of an automatic peer review agent with explainable results is provided.
Improving Short Text Classification Robustness to Stochastic Noise Based on Density-Driven Training Data Cleaning
The paper addresses the problem of short text request classification under conditions of significant class imbalance and high noise levels in real-world communication flows. The limited effectiveness of synthetic oversampling techniques when dealing with noisy labeling is demonstrated. A hybrid method is proposed, combining preliminary density-based data cleaning and multi-level model ensembling. The application of a density-based clustering algorithm enabled the exclusion of 16.5% of informational noise from the total sample volume. The final model features a two-level architecture and is optimized using Bayesian hyperparameter search. A Recall@3 (R@3) metric of 97.4% was achieved on a hold-out test set. The proposed method allows for the automation of the request distribution process, significantly reducing operator workload and decreasing dispatch time.
Methods for Automatic Assignment of UDC Codes to Mathematical Articles: an Evaluation of Classical and Neural Approaches
Universal Decimal Classification (UDC) is a hierarchical indexing system in which a publication may be assigned one or several codes. Manual UDC indexing is labor-intensive and often inconsistent. This paper addresses the automatic assignment of UDC codes to Russian-language mathematical research articles. The aim is to compare combinations of text representations and classification models on a unified corpus and to identify the most effective configurations. A corpus of 4194 articles was collected from Math-Net.Ru, including full texts, abstracts, metadata, and UDC codes. The preprocessing pipeline comprised PDF text extraction, removal of layout artifacts, and normalization of UDC labels. We compared TF-IDF, Word2Vec, SciRus-tiny, and SciRus-tiny3.5 representations combined with logistic regression, Complement Naive Bayes (CNB), and CatBoost. In both the single-label and multi-label settings, the best performance was achieved by TF-IDF + LogReg, while TF-IDF + CNB showed closely competitive results. The proposed approach can be used in automatic subject indexing systems for digital libraries and scientific archives, in UDC recommendation tools for authors and editors, and in metadata quality control workflows.
Ontological Approach to Knowledge Graph Assessment in the Domain of Mechanical Product Lifecycle Management Systems
This paper examines the application of an ontological approach to constructing a dataset for evaluating and comparing context enrichment systems for large language models using knowledge graphs in the domain of mechanical product lifecycle management systems. In this domain, obtaining the required amount of textual data with a formal logical structure to form an evaluation set without using generated synthetic data is challenging. To avoid introducing distortions and hallucinations when generating evaluation data, a novel solution to the data deficiency is proposed. This solution involves extracting ontology directly from product and assembly files compliant with the STandard for Exchange of Product Model Data. This potentially enables the use of all product data as a source for scaling evaluation data. The goal of this paper is to create a dataset of structured textual data in the domain of mechanical product lifecycle management systems, develop an evaluation methodology, and implement context enrichment pipelines for large language models with and without knowledge graphs to analyze the contribution of data-structure-extracting systems to the quality of generated responses. In this paper: a new source of evaluation data is proposed, a new methodology for generating text evaluation data while preserving the logical structure is developed, a pipeline for using the generated evaluation data is implemented, and evaluation results are obtained that confirm the positive contribution of systems with the extraction of structured data to the quality of generated responses in the domain of mechanical product lifecycle management systems.
AI Copilot for Designing Radiation Protection Shield
This paper examines the current challenge of developing an intelligent agent for modeling the characteristics of electronic equipment protection shields.
The aim is developing a methodology and software implementation for an intelligent agent that will simplify the analysis of various design solutions and provide decision support for design engineers. An intelligent agent has been developed that automates the process of preparing a description of an alternative design solution for subsequent modeling using the Geant4 software package. Integrating the software module into computing platforms will improve the work of design engineers by reducing routine manual operations, minimizing human error, and ensuring reproducible results.
A Recommendation System for Finding Semantically Similar Fragments of Program Code
Recommendation systems in the scientific information space serve as essential tools for search and navigation when working with scientific documents. Software code is currently considered as an object of scientific knowledge and, as a result, an important task is to create software lifecycle support systems, in particular, to find similar software solutions, detect code borrowings, analyze and evaluate code quality.
This paper proposes a content-based recommender system that provides users with a personalized list of code fragments that are functionally equivalent to the input query code presented in one of the programming languages from the established set.
The basic algorithm of the system is based on the representation of the program code in the form of an abstract syntax tree followed by the construction of a vector space of program codes. The semantic similarity of program codes is determined by the distance between code vectors in a multidimensional space.
The personalization of recommendations is achieved through a filtering module that ranks the retrieved fragments taking into account the user's profile. The factors under consideration are the language preferences of the user and his areas of scientific interests, extracted through integration with ORCID.
To ensure the system's operation, a specialized dataset was created based on the CodeNet corpus. The problem of automated language detection from a snippet of the presented code in one of the 19 languages included in the current rating list of programming languages has also been solved.
Construction and Annotation of a Russian-Language News Corpus for Automated Detection of Political Manipulation
This paper addresses the challenge of developing specialized corpus resources for the automated analysis of political manipulation in Russian-language media discourse. Although semantic text analysis and computational discourse analysis have advanced substantially in recent years, most existing corpora and annotation schemes are designed for English-language data and do not adequately capture the linguistic and discursive characteristics of Russian-language news media. The objective of this study is to construct a specialized corpus of Russian-language news texts and to develop an annotation scheme tailored to the automated analysis of political manipulation, with explicit consideration of the linguistic and discursive features of the Russian-language media environment. The study introduces a corpus of sentence-level fragments extracted from Russian-language news texts published between 2010 and 2019, together with an annotation scheme for manipulative techniques. The scheme is based on an adaptation of established international classifications of manipulative strategies and is reduced to a limited set of interpretable techniques relevant to Russian-language news discourse. The proposed framework covers emotional, argumentative, and contextual forms of manipulative influence. The resulting corpus and annotation scheme provide an empirical foundation for the development and evaluation of automated methods for analyzing political manipulation in Russian-language news media and may also support further research in media and political discourse.
Engineering and Automatic Construction of a Knowledge Graph “Mathematical Equations”
We propose an approach to engineering and implementing a knowledge graph for representing and storing knowledge about mathematical equations. We have developed a knowledge graph prototype that represents knowledge about the main types of mathematical equations, including algebraic equations, ordinary differential equations, partial differential equations, and integral equations. We designed the knowledge graph of mathematical equations as a mathematical artifact. We are integrating this artifact into the digital ecosystem of the Lobachevskii Digital Mathematical Library, therefore, we took into account the ecosystem's general compatibility requirements during the design. We have developed software tools for extracting and processing information about equations presented in digital libraries and electronic scientific resources. The current version of the knowledge graph prototype is based on the OntoMathPRO ontology of professional mathematics and a taxonomy of equations, built on information extracted from the web pages of the portal EqWorld "The World of Mathematical Equations." We expanded the OntoMathPRO ontology with new equation classes and new relationships to align with the equation type hierarchy presented on the EqWorld portal. We implemented a set of software modules that support the full cycle of knowledge graph generation, including a module for automatically extracting entities from external sources, a module for linking entities to OntoMathPRO ontology concepts, and a module for converting the acquired knowledge into an RDF representation and then storing it in a data warehouse. The knowledge graph supports SPARQL queries.
An Ontological Approach to Designing a Microservice Architecture
Despite the widespread use of microservice architecture in the development of software systems, there is no formalized approach that ensures consistent and guaranteed interaction of microservices at the level of transmitted data, which leads to integration errors and complicates the maintenance of distributed systems. The purpose of the study is to develop an approach to the organization of microservices interaction based on ontological modeling, providing formalization of data structures and automated validation of messages. The paper presents a method for converting formal descriptions of data schemas into ontological models based on the GraphQL schema specification. This method allows you to automate the data validation process and reduce the number of integration errors. An ontological model has been developed that provides an analysis of dependencies between microservices and a mechanism for validating message contracts.
The practical significance of the work lies in achieving a consistent description of microservices, operations, and message formats as a result of using an ontological approach. The representation of the ontology in the form of a graph makes it possible to analyze the dependencies between microservices and simplifies the maintenance of large distributed systems.
Integration of Semantic Mathematical Modeling for the Analysis of Energy Security Problems
The study addresses the problem of integrating cognitive and mathematical modeling in research on the development directions of the fuel and energy complex, taking into account energy security requirements. The relevance of the work is due to the fact that in the existing two-level research methodology, the transition from the results of qualitative analysis using cognitive modeling to the parameters of the mathematical model is largely performed manually, which reduces the reproducibility of numerical experiments and limits the efficiency of accumulated knowledge usage. The aim of the work is to develop a software component that ensures the combined use of cognitive and mathematical models within an Energy Knowledge Ecosystem. A software component is proposed, implemented as part of the INTEC‑SAW suite, which provides the transformation of changes in the cognitive model into the parameters of the economic-mathematical model, as well as the reverse interpretation of calculation results. Technology for conducting numerical experiments has been developed, including the construction of semantic (ontological and cognitive) models, formation of computational scenarios, execution of optimization calculations, and presentation of results, distinguished by the automation of the joint use of ontological, cognitive, and economic-mathematical models. To account for uncertainty, a numerical method of stochastic parameter adjustment based on cognitive weights is proposed. The effectiveness of the approach is demonstrated through a numerical experiment investigating the impact of CO₂ emission constraints on the energy balances of the Siberian Federal District. The practical significance of the work lies in increasing the validity and reproducibility of research on the development of the fuel and energy complex through the coordinated use of qualitative and quantitative analysis tools.
Development of an Intelligent Search System for the Mathematical Archive of Publications
A study was conducted on searching for similar documents. The goal was to create a recommendation algorithm for finding similar scientific articles in mathematics using a prioritized search of mathematical formulas with textual support.
The text was converted from graphical to textual representation using OCR technology for subsequent analysis and indexing. During the analysis process, the text was divided into blocks, followed by the extraction of significant formulas, keywords, and phrases from the text. During the indexing process, a vector database was formed based on vector representations of formulas obtained through the embedding process. The indexing results were used to search for articles that are similar to the document submitted by the user to the algorithm input. A list of similar articles is displayed with results sorted by the metric of closeness of vector representations of formulas.
The source data consisted of approximately 5,000 scientific articles devoted to various studies on mathematical topics and presented as PDF files. The experiment was conducted based on data from specific library system content, but the proposed technology can be extended to other library systems, including those containing articles on other topics, such as physics and other exact sciences.
Model and Architecture of Multi-Level Similarity Analysis of Android Applications based on Static Features
The paper addresses the problem of multi-level similarity analysis of Android applications based on static features in digital application collections. Such collections may contain duplicates, forks, repackaged builds, and other modified variants; malicious payloads are treated as a special case of modification rather than as a synonym of repackaging. The paper formulates a similarity function for Android applications, introduces a static application model as the working object of comparison, and presents a multi-level pipeline that separates candidate screening, in-depth pairwise analysis, result interpretation, and a decision layer. Meaningful similarity signals are sought not only in classes.dex bytecode, but also in AndroidManifest.xml, resources, APK-internal metadata, and library dependencies. A numerical similarity score is computed only when static models are built successfully; otherwise the pipeline records a dedicated technical failure status together with a normalized failure reason. Preliminary evidence is reported on a local pilot set of five core pairs and two boundary cases. These results indicate that explicit handling of shared library code may improve interpretability, but they do not yet constitute a full validation of the proposed architecture on large collections.
Representation of Intraword Syntagmatic Relations in Vector Language Models
The paper discusses semantic structure representation of derivatives in language models, taking into account the intraword syntagmatic relations between derivational morphemes. Experiments were conducted using morphemic models developed by the Russian National Corpus (RNC), as well as fastText and ruRoBERTa models. The study is aimed at the verification of the hypothesis dealing with compositionality of derived words which are represented as aggregated morpheme vectors. In experiments we explore the representation of semantic relationships using fastText morpheme vectors and standard subword vectors in ruRoBERTa. The results indicate moderate sensitivity of fastText vectors to syntagmatic relations between morphemes as well as to derivational types. At the same time, it was found that aggregating morpheme vectors in fastText provides better representation of semantic relations between words compared to aggregating subword vectors in ruRoBERTa.
Standard BPE (Byte-Pair Encoding) and WordPiece tokenizers used in Transformer-based models are poorly interpretable with respect to linguistic data, as word segments do not always correspond to morphemes. The research problem lies in the need to assess the extent to which modern language models can capture linguistic features that characterize the relationships of derived words within word-formation families. The aim of the study is to evaluate the ability of predictive distributed vector embedding models to reproduce syntagmatic connections between morphemes within derived words and at the level of word-formation families in the Russian language.
The obtained results encourage the development of neural network architectures that take into account syntagmatic relations between morphemes, the improvement of morpheme tokenizers, and their integration into language models.
Methods for the Automated Extraction of Program Parameters and Descriptions for their Integration into Computing Systems
This article addresses the problem of coordinating heterogeneous software tools in heterogeneous distributed application execution environments. Here, manually configuring launch parameters for newly installed programs on a computing cluster (such as command-line switches, environment variable values, and configuration file settings) poses significant challenges for domain researchers due to the large volume of utility information and the need to store and aggregate information in a fixed format. We propose a method for the automated extraction of launch parameters based on a hybrid neural network training architecture that combines the generation of training samples using large language models with the subsequent fine-tuning of a compact transformer encoder. This approach eliminates the need for expensive graphics accelerators by applying the Low-Rank Adaptation (LoRA) technique to models with up to 1 billion parameters, enabling model execution (inference) on standard CPUs in control nodes. To formalize the quality of extraction, a two-component metric has been developed that aggregates the structural correctness of the output JSON schema (the presence of required fields and program parameter types in the obtained data) and the semantic accuracy of parameter values (correspondence with the description in the documentation). The experimental evaluation of the method focuses on a corpus of software package documentation (man pages, README files). The design results confirm the possibility of approximating the documentation analysis process with a compact model, which contributes to the automation of the software deployment lifecycle and the reduction of task flow management errors in distributed computing systems.
The system for the automatic generation, processing, and management of document metadata in digital collections
The publishing cycle is currently undergoing significant technological changes: automated publication management systems are being implemented, neural network technologies are being used for content processing, and tools for the intelligent analysis of scientific data are being actively developed. One of the key trends is the automation of the publishing cycle, aimed at accelerating manuscript processing, improving the quality of metadata, and ensuring the interoperability of information resources. In this context, metadata serves as a connecting element for machine processing and navigation within the scientific knowledge space, ensuring the structuring, interpretation, and integration of information into digital library systems. However, metadata for scientific publications often contain errors, inaccuracies, or are incomplete, and their manual creation and refinement are time-consuming and do not ensure high accuracy. The aim of this work is to design and develop a system for the automatic generation, processing, and management of metadata for scientific documents based on data obtained from scientific publication search services and open knowledge bases. The system can be used to automate the process of extracting, refining, and supplementing the metadata of scientific publications for the purpose of subsequently creating electronic collections of scientific documents.
On the Applicability of Neural Networks in the Publishing Industry
The paper assesses the limits of applicability of large language models in editorial tasks within the publishing process and identifies the optimal format of interaction between humans and algorithmic systems.
The methodological basis of the study is a comparative experiment in which several popular neural network models — Alice AI, GigaChat, DeepSeek, Gemini, and ChatGPT — performed a statistical analysis of a control text in Russian. The quantitative characteristics of the text were determined: the number of words, characters with and without spaces, and the number of paragraphs. The obtained results were compared with reference values established using the MS Word text editor, which applies a deterministic character-counting algorithm.
The results of the experiment showed that neural network models demonstrate varying degrees of accuracy when performing tasks of quantitative text analysis. The main reason for such errors lies in the architecture of large language models and the use of tokenization algorithms, which break the direct connection between characters and the model’s internal representation of the text.
Based on the results obtained, the paper proposes the concept of a hybrid architecture for publishing information systems, in which generative language models are used to perform creative and analytical tasks, while operations requiring strict formal accuracy are assigned to specialized deterministic microservices. The proposed approach makes it possible to improve the reliability and predictability of intelligent publishing systems.
Part 2. Original articles
Cognitive Model for Control of a Peltier Thermoelement
The article presents an ontological model of a control system for a Peltier thermoelectric element. The ontology describes the structure of the system by identifying objects, transformation processes within these objects, and the attributes of the relationships between them. Based on the developed ontological model, a cascade control system has been designed, integrating a PID controller, a fuzzy-digital filter, and an exponential-averaging filter, with its cognitive behavior governed by fuzzy logic rules. Improvement of the dynamic characteristics of transient processes in the Peltier element control system is achieved through the application of the mathematical and ontological solutions specified in the model. The cascade control system reduces the amplitude of the first harmonic of the control signal by 12% and decreases the transient response time by 31.9%.
Algorithms for Individualizing Learning based on the Composition of the Results of Pedagogical Experiments
This paper presents various aspects of the practical implementation of individualized learning algorithms (based on the results of pedagogical experiments) for both teacher-led instruction (in the classroom, remotely, or in a hybrid mode) and independent student work. The described system simultaneously teaches students course materials and independent learning techniques — that is, educational technologies that shape an individualized educational trajectory. A subset of educational technologies is determined individually for each student in the group. The educational technologies are independent of the course and universal, so they can be applied in subsequent or parallel courses. Teachers can describe new educational technologies as Python scripts without the involvement for developers. The proposed implementation integrates with the Mirera digital educational platform to expand the platform's capabilities.
Administration of the Scientific Heritage of Russia Electronic Library Content
The electronic library "Scientific Heritage of Russia" (EL SHR ) has been operating in open Internet access since 2010. The library integrates information about scientists who have contributed to the development of Russian science, their scientific publications, related archival materials, online resources and museum objects. The modern version of the NPR EB is developing as a model of a fragment of the Common Digital Space of Scientific Knowledge (CDSSK) and includes a number of functional blocks (metadata generation, publication of digitized documents and museum objects, organization of collections and exhibitions, content administration). The article describes the functionality of the electronic library's administrative block. The block is accessible to authorized users with the appropriate permissions. The block allows you to edit the metadata elements of each object type and the relationships between them, monitor the processing stages of specific objects entered in the electronic library, and export a specified set of related objects.
On the Integration of Museum Objects into the Common Digital Space of Scientific Knowledge
The work addresses the issues of integrating museum objects into the Common Digital Space of Scientific Knowledge (CDSSK). It examines the evolution of a museum item from an isolated artifact to an "intelligent interface" – a linked element of a knowledge network. The technology for digitizing three-dimensional museum objects using spin-scanning is described. Using the collection of mushroom models from the State Biological Museum as an example, the process of incorporating objects into the CDSSK using structured data and interactive 3D models is demonstrated. The work is carried out within the framework of a state assignment and demonstrates the potential of the CDSSK as a universal environment for preserving and disseminating scientific heritage.
Generating Temporal Signals from Static Images for Spiking Neural Networks
Spiking neural networks (SNNs), i.e., neural architectures that represent and transmit information in the form of temporally distributed spikes, require time-dependent input, whereas data in computer vision are most commonly available as static images. This study addresses the transformation pipeline “image → temporal signal → spikes” and examines how the choice of input encoding influences SNN training dynamics, spike activity density, and computational cost. The experimental section implements and compares two encoding families: first-spike-time encoding (Latency) and intensity-based Poisson encoding (Poisson). Within these families, four operating modes are considered: baseline Latency without background suppression, modified Latency with a silence threshold, stochastic Poisson, and deterministic Poisson. The evaluation employs the following metrics: the average number of spikes per sample, the number of synaptic operations, an energy-related proxy metric, and indicators characterizing competition among hidden-layer neurons. Experiments conducted on the MNIST dataset (60000 training and 10000 test images) using a network with a hidden layer of 100 neurons and a simulation horizon of 200 time steps demonstrate that all examined modes support stable learning without activity collapse. Among them, the modified Latency mode with a silence threshold of = 0.05 achieves the most favorable balance between useful activity and computational cost: at 323.41 spikes per sample, it requires 14925.09 synaptic operations, whereas the baseline Latency mode without background filtering, despite exhibiting a comparable level of output activity (311.22 spikes per sample), requires 78400 synaptic operations.