Vol. 28 No. 6 (2025): Special issue "Actual tasks in semantic analysis"

Detection of Hallucinations Based on the Internal States of Large Language Models

Timur Rustemovich Aisin, Tatiana Vyacheslavovna Shamardina

1282-1305

Abstract:

In recent years, large language models (LLMs) have achieved substantial progress in natural language processing tasks and have become key instruments for addressing a wide range of applied and research problems. However, as their scale and capabilities grow, the issue of hallucinations — i.e., the generation of false, unreliable, or nonexistent information presented in a credible manner—has become increasingly acute. Consequently, analyzing the nature of hallucinations and developing methods for their detection has acquired both scientific and practical significance.

This study examines the phenomenon of hallucinations in large language models, reviews their existing classification, and investigates potential causes. Using the Flan-T5 model, we analyze differences in the model’s internal states when generating hallucinations versus correct responses. Based on these discrepancies, we propose two approaches for hallucination detection: one leveraging attention maps and the other utilizing the model’s hidden states. These methods are evaluated on data from HaluEval and Shroom 2024 benchmarks in tasks such as summarization, question answering, paraphrasing, machine translation, and definition generation. Additionally, we assess the transferability of the trained detectors across different hallucination types, in order to evaluate the robustness of the proposed methods.

Keywords: large language models, hallucinations, detection, Flan-T5, natural language processing, attention maps, hidden states, HaluEval, Shroom.

PDF (Русский)

Formation of Structured Representations of Scientific Journals for Integration into a Knowledge Graph and Semantic Search

Olga Muratovna Ataeva, Mikhail Gennadievich Kobuk

1306-1323

Abstract:

This paper examines the development of the SciLibRu library of scientific subject areas, as a continuation of the semantic description of scientific works from the library LibMeta project. This library is based on a conceptual data model, the structure and semantics of which are formed based on the principles of ontological modeling. This approach ensures a strict description of the subject area, formalization of the relationships between entities, and the possibility of further automated data analysis. The goal of the study is to develop and experimentally apply methods for structuring scientific journal data in LaTeX format for their integration into the library ontology and to support semantic search.

An algorithm for translating data represented by multiple files into XML format is proposed for integration into the library ontology. A vector search module based on embedding calculation using language models is implemented. Patterns in the distribution of embeddings and factors influencing the accuracy of search results ranking are identified. Testing of the two components is conducted.

The developed method forms the basis for automatically incorporating scientific journal data into the SciLibRu knowledge graph and creating training corpora for language models limited to scientific subject areas. The obtained results contribute to the development of journal knowledge graph navigation systems, recommendation engines, and intelligent search tools for Russian-language scientific texts.

Keywords: semi-structured data, text structuring, LaTeX, vector representations of text, full-text search, semantic search.

PDF (Русский)

SciLibRu, the Library of Scientific Subject Domains

Olga Muratovna Ataeva, Natalia Pavlovna Tuchkova, Kirill Borisovich Teymurazov, Aidin Abdyshov, Mikhail Gennadievich Kobuk

1324-1345

Abstract:

The work is devoted to the problem of data integration for representing scientific subject areas based on their semantic description in the SciLibRu digital library. The LibMeta library's ontology and knowledge graph are used as the data model. SciLibRu is populated by adding data from scientific journals. The paper demonstrates how the stages of processing semi-structured scientific publications for their integration into the library's ontology are implemented. Completing all data preprocessing stages yields a dataset that can be used to train language models for queries in Russian-language scientific subject areas.

Keywords: applied ontology, knowledge graph, data sources, analysis of semi-structured scientific publications.

PDF (Русский)

A Tool for Rapid Diagnostics of Memory in Neural Network Architectures of Language Models

Pavel Andreevich Gavrikov, Azamat Komiljon ugli Usmanov, Dmitriy Revayev, Sergey Nikolaevich Buzykanov

1346-1367

Abstract:

Large Language Models (LLMs) have evolved from simple n-gram systems to modern universal architectures; however, a key limitation remains the quadratic complexity of the self-attention mechanism with respect to input sequence length. This significantly increases memory consumption and computational costs, and with the emergence of tasks requiring extremely long contexts, creates the need for new architectural solutions. Since evaluating a proposed architecture typically requires long and expensive full-scale training, it is necessary to develop a tool that allows for a rapid preliminary assessment of a model’s internal memory capacity.

This paper presents a method for quantitative evaluation of the internal memory of neural network architectures based on synthetic tests that do not require large data corpora. Internal memory is defined as the amount of information a model can reproduce without direct access to its original inputs.

To validate the approach, a software framework was developed and tested on the GPT-2 and Mamba architectures. The experiments employed copy, inversion, and associative retrieval tasks. Comparison of prediction accuracy, error distribution, and computational cost enables a fast assessment of the efficiency and potential of various LLM architectures.

Keywords: large language models, neural network architecture, internal memory, long-term information retention, sequence processing, functional memory measurement, architecture comparison.

PDF (Русский)

A System for Testing Controllers Based on On-Screen Text Recognition

Aleksandr Aleksandrovich Dokukin

1368-1384

Abstract:

A solution for the problem of testing controllers based on reading information from their screens is described. A hardware and software system has been developed for this purpose, consisting of a camera and software modules implementing the necessary algorithms and methods: an image preprocessing module; a menu type detection module; a font character processing module; a text reading module, including one written in various fonts; and the testing module itself. The system has been developed for a specific type of controller with a monochrome 128x64 pixel display. All methods are implemented in Python using popular libraries. The system has been launched into test operation and currently automates several of the most labor-intensive tests. The test set can be expanded using plugins.

Keywords: computer vision, text recognition, controller testing.

PDF (Русский)

Post-Correction of Weak Transcriptions by Large Language Models in the Iterative Process of Handwritten Text Recognition

Valerii Pavlovich Zykov, Leonid Moiseevich Mestetskiy

1385-1414

Abstract:

This paper addresses the problem of accelerating the construction of accurate editorial annotations for handwritten archival texts within an incremental training cycle based on weak transcription. Unlike our previously published results, the present work focuses on integrating automatic post-correction of weak transcriptions using large language models (LLMs). We propose and implement a protocol for applying LLMs at the line level in a few-shot setup with carefully designed prompts and strict output format control (preservation of pre-reform orthography, protection of proper names and numerals, prohibition of structural changes to lines). Experiments are conducted on the corpus of diaries by A.V. Sukhovo-Kobylin. As the base recognition model, we use the line-level variant of the Vertical Attention Network (VAN). Results show that LLM post-correction–exemplified by the ChatGPT-4o service–substantially improves the readability of weak transcriptions and significantly reduces the word error rate (in our experiments by about −12 percentage points), without degrading the character error rate. Another service tested, DeepSeek-R1, demonstrated less stable behavior. We discuss practical prompt engineering, limitations (context length limits, risk of “hallucinations”), and provide recommendations for the safe integration of LLM post-correction into an iterative annotation pipeline to reduce expert annotators’ workload and speed up the digitization of historical archives.

Keywords: handwritten text recognition, weak markup, Vertical Attention Network (VAN), large language models (LLM), post-correction, iterative retraining.

PDF (Русский)

Some Approaches to Improving Prediction Accuracy using Ensemble Methods

Xinyue Ma, Oleg Valentinovich Sen’ko

1415-1434

Abstract:

This study presents the results of an experimental analysis evaluating the effectiveness of Extra Trees within gradient boosting models, as well as in a newly proposed ensemble framework where the forest is generated under conditions of enhanced internal divergence. Additionally, the paper explores the performance of Extra Trees when applied to novel feature representations computed as distances to a selected set of reference examples. It has been shown that the use of Extra Randomized Trees in gradient boosting and divergent forest models improves generalization ability. The use of expanded feature sets leads to even greater generalization ability.

Keywords: regression modeling, ensemble learning, metric space, extremely randomized trees method.

PDF (Русский)

Word Search in Handwritten Text Based on Stroke Segmentation

Ivan Dmitrievich Morozov, Leonid Moiseevich Mestetskiy

1435-1453

Abstract:

Handwritten archival documents form a fundamental part of humanity's cultural heritage. However, their analysis remains a labor-intensive task for professional researchers, such as historians, philologists, and linguists. Unlike commercial OCR applications, working with historical manuscripts requires a fundamentally different approach due to the extreme diversity of handwriting, the presence of corrections, and material degradation.

This paper proposes a method for searching within handwritten texts based on stroke segmentation. Instead of performing full text recognition, which is often unattainable for historical documents, this method allows for efficiently answering researcher search queries. The key idea involves decomposing the text into elementary strokes, forming semantic vector representations using contrastive learning, followed by clustering and classification to create an adaptive handwriting dictionary.

It is experimentally shown that search by comparing tuples of reduced sequences of the most informative strokes using the Levenshtein distance provides sufficient quality for the task at hand. The method demonstrates resilience to individual handwriting characteristics and writing variations, which is particularly important for working with authors' archives and historical documents.

The proposed approach opens up new possibilities for accelerating scientific research in the humanities, reducing the time required to find relevant information from weeks to minutes, thereby qualitatively transforming research capabilities when working with large archives of handwritten documents.

Keywords: handwritten text, search, stroke analysis, segmentation, vector representation, contrastive learning, clustering.

PDF (Русский)

Archival Handwritten Letter Attribution using Siamese Neural Networks

Nataliia Mikhailovna Pronina

1454-1480

Abstract:

This paper presents a method for the automated attribution of archival handwritten letters based on a Siamese neural network, addressing a key challenge in digital humanities – the authentication of historical documents. The research is motivated by the mass digitization of 17th to 19th-century archives, where attribution is often hindered by incomplete or inaccurate metadata about the authors.

The method is designed for real-world document collections and accounts for challenges typical of archival materials: poor-quality scans, significant handwriting variation, and substantial class imbalance (from 1 to over 50 samples per author). The use of a Siamese network architecture enables the extraction of discriminative vector representations (embeddings). Based on these embeddings, the method not only classifies documents by known authors but also effectively identifies manuscripts that do not match any known author in the archive. This significantly narrows down the pool of candidates for subsequent expert verification.

The study introduces a data preprocessing algorithm and provides a comparative analysis of two approaches to text analysis: at the image fragment level (300×300 px) and at the individual text line level. The developed tool offers archivists and philologists an effective solution for the preliminary sorting and attribution of handwritten documents large collections.

Keywords: siamese neural network, identification, verification, attribution, handwritten text, archival documents, convolutional neural network, recurrent neural network.

PDF (Русский)

Automatic and Semi-Automatic Methods for Domain Knowledge-Graph Construction and Ontology Expansion

Andrey Petrovich Khalov, Olga Muratovna Ataeva

1481-1519

Abstract:

We present a combined pipeline for knowledge-graph construction and ontology expansion. The approach builds a BIO-tagged corpus via fully automatic LLM-based pseudo-annotation and introduces dedicated UNK reserve categories to capture previously unseen classes and relations. A specialized NER/RE model is trained on a 3-million-token dataset with 92 labels. The model exhibits a conservative quality profile – high precision with moderate recall – suited for safe graph enrichment: integrating the extracted facts expands the graph to ~0.98 million triples, while the expansion ratio (total inferred facts to explicit triples) increases from 2.65 to 3.52, with logical consistency preserved. UNK label pools are converted into stable synsets, enabling semiautomatic ontology expansion; 12 new classes derived from unstructured texts were added. We also demonstrate practical value for querying and analytics using an LLM + SPARQL setup.

Keywords: ontology, DOLCE, knowledge graph, NER, BIO tagging, RDF/OWL, SPARQL.

PDF (Русский)

Full Issue

Articles

Detection of Hallucinations Based on the Internal States of Large Language Models

Formation of Structured Representations of Scientific Journals for Integration into a Knowledge Graph and Semantic Search

SciLibRu, the Library of Scientific Subject Domains

A Tool for Rapid Diagnostics of Memory in Neural Network Architectures of Language Models

A System for Testing Controllers Based on On-Screen Text Recognition

Post-Correction of Weak Transcriptions by Large Language Models in the Iterative Process of Handwritten Text Recognition

Some Approaches to Improving Prediction Accuracy using Ensemble Methods

Word Search in Handwritten Text Based on Stroke Segmentation

Archival Handwritten Letter Attribution using Siamese Neural Networks

Automatic and Semi-Automatic Methods for Domain Knowledge-Graph Construction and Ontology Expansion