Detection of Hallucinations Based on the Internal States of Large Language Models
Main Article Content
Abstract
In recent years, large language models (LLMs) have achieved substantial progress in natural language processing tasks and have become key instruments for addressing a wide range of applied and research problems. However, as their scale and capabilities grow, the issue of hallucinations — i.e., the generation of false, unreliable, or nonexistent information presented in a credible manner—has become increasingly acute. Consequently, analyzing the nature of hallucinations and developing methods for their detection has acquired both scientific and practical significance.
This study examines the phenomenon of hallucinations in large language models, reviews their existing classification, and investigates potential causes. Using the Flan-T5 model, we analyze differences in the model’s internal states when generating hallucinations versus correct responses. Based on these discrepancies, we propose two approaches for hallucination detection: one leveraging attention maps and the other utilizing the model’s hidden states. These methods are evaluated on data from HaluEval and Shroom 2024 benchmarks in tasks such as summarization, question answering, paraphrasing, machine translation, and definition generation. Additionally, we assess the transferability of the trained detectors across different hallucination types, in order to evaluate the robustness of the proposed methods.
Article Details
References
2. Huang L., Yu W., Ma W. et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions // ACM Transactions on Information Systems. 2025. Vol. 43, No. 2. P. 1–55. https://doi.org/10.1145/3703155
3. Li J., Cheng X., Zhao W. X. et al. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 6449–6464. https://doi.org/10.18653/v1/2023.emnlp-main.397
4. Mickus T., Zosa E., Vázquez R. et al. SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes // International Workshop on Semantic Evaluation. 2024. https://doi.org/10.18653/v1/2024.semeval-1.273
5. Carlini N., Ippolito D., Jagielski M. et al. Quantifying Memorization Across Neural Language Models // The Eleventh International Conference on Learning Representations. 2023. https://doi.org/10.48550/arXiv.2202.07646
6. Lin S., Hilton J., Evans Q. TruthfulQA: Measuring How Models Mimic Human Falsehoods // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022. Vol. 1. P. 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229
7. Li D., Rawat A.S., Zaheer M. et al. Large Language Models with Controllable Working Memory // Findings of the Association for Computational Linguistics: ACL 2023. 2023. P. 1774–1793. https://doi.org/10.18653/v1/2023.findings-acl.112
8. Sharma M., Tong M., Korbak T. et al. Towards Understanding Sycophancy in Language Models // The Twelfth International Conference on Learning Representations. 2024. https://doi.org/10.48550/arXiv.2310.13548
9. Reinforcement Learning from Human Feedback: Progress and Challenges // YouTube. URL: https://www.youtube.com/watch?v=hhiLw5Q_UFg (дата обращения: 04.05.2025)
10. Chuang Y.S., Xie Y., Luo H. et al. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models // ArXiv. 2023. Vol. abs/2309.03883. https://doi.org/10.48550/arXiv.2309.03883
11. Voita E., Talbot D., Moiseev F. et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. P. 5797–5808. https://doi.org/10.18653/v1/P19-1580
12. Min S., Krishna K., Lyu X. et al. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 12076–12100. https://doi.org/10.18653/v1/2023.emnlp-main.741
13. Luo Z., Xie Q., Ananiadou S. ChatGPT as a Factual Inconsistency Evaluator for Text Summarization // ArXiv. 2023. Vol. abs/2303.15621. https://doi.org/10.48550/arXiv.2303.15621
14. Manakul P., Liusie A., Gales M.J. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 9004–9017. https://doi.org/10.18653/v1/2023.emnlp-main.557
15. Cohen R., Hamri M., Geva M. et al. LM vs LM: Detecting Factual Errors via Cross Examination // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 12621–12640. https://doi.org/10.18653/v1/2023.emnlp-main.778
16. Xiao Y., Wang W.Y. On Hallucination and Predictive Uncertainty in Conditional Language Generation // Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. P. 2734–2744. https://doi.org/10.18653/v1/2021.eacl-main.236
17. Miao N., Teh Y.W., Rainforth T. SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning // The Twelfth International Conference on Learning Representations. 2024. https://doi.org/10.48550/arXiv.2308.00436
18. Adlakha V., BehnamGhader P., Lu X.H. et al. Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering // Transactions of the Association for Computational Linguistics. 2024. Vol. 12. P. 681–699. https://doi.org/10.1162/tacl_a_00667
19. Lin Chin-Yew. ROUGE: A Package for Automatic Evaluation of Summaries // Text Summarization Branches Out. 2004. P. 74-81. ISBN: 9781932432466
20. Venkit P.N., Gautam S., Panchanadikar R. et al. Nationality Bias in Text Generation // Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. P. 116–122. https://doi.org/10.18653/v1/2023.eacl-main.9
21. Goodrich B., Rao V., Liu P.J. et al. Assessing The Factual Accuracy of Generated Text // Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2019. P. 166–175. https://doi.org/10.1145/3292500.3330955
22. Laban P., Schnabel T., Bennett P.N. et al. SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization // Transactions of the Association for Computational Linguistics. 2022. Vol. 10. P. 163–177. https://doi.org/10.1162/tacl_a_00453
23. Xu W., Agrawal S., Briakou E. et al. Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection // Transactions of the Association for Computational Linguistics. 2023. Vol. 11. P. 546–564. https://doi.org/10.1162/tacl_a_00563
24. Zhang T., Qiu L., Guo Q. et al. Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus // Conference on Empirical Methods in Natural Language Processing. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.58
25. Chuang Y.S., Qiu L., Hsieh C.Y. et al. Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. P. 1419–1436. https://doi.org/10.18653/v1/2024.emnlp-main.84
26. Yin Z., Sun Q., Guo Q. et al. Do Large Language Models Know What They Don't Know? // Annual Meeting of the Association for Computational Linguistics. 2023. https://doi.org/10.18653/v1/2023.findings-acl.551
27. Marks S., Tegmark M. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets // First Conference on Language Modeling. 2024. https://doi.org/10.48550/arXiv.2310.06824
28. Su W., Wang C., Ai Q. et al. Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models // Annual Meeting of the Association for Computational Linguistics. 2024. https://doi.org/10.48550/arXiv.2403.06448
29. Chung H.W., Hou L., Longpre S. et al. Scaling Instruction-Finetuned Language Models // Journal of Machine Learning Research. 2024. Vol. 25, No. 70. P. 1–53. https://doi.org/10.5555/3722577.3722647
30. Hochreiter Sepp, Schmidhuber Jürgen. Long Short-Term Memory // Neural Computation. 1997. Vol. 9, No. 8. P. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
31. PCA // Wikipedia. URL: https://en.wikipedia.org/wiki/Principal_component_analysis (дата обращения: 13.06.2025).

This work is licensed under a Creative Commons Attribution 4.0 International License.
Presenting an article for publication in the Russian Digital Libraries Journal (RDLJ), the authors automatically give consent to grant a limited license to use the materials of the Kazan (Volga) Federal University (KFU) (of course, only if the article is accepted for publication). This means that KFU has the right to publish an article in the next issue of the journal (on the website or in printed form), as well as to reprint this article in the archives of RDLJ CDs or to include in a particular information system or database, produced by KFU.
All copyrighted materials are placed in RDLJ with the consent of the authors. In the event that any of the authors have objected to its publication of materials on this site, the material can be removed, subject to notification to the Editor in writing.
Documents published in RDLJ are protected by copyright and all rights are reserved by the authors. Authors independently monitor compliance with their rights to reproduce or translate their papers published in the journal. If the material is published in RDLJ, reprinted with permission by another publisher or translated into another language, a reference to the original publication.
By submitting an article for publication in RDLJ, authors should take into account that the publication on the Internet, on the one hand, provide unique opportunities for access to their content, but on the other hand, are a new form of information exchange in the global information society where authors and publishers is not always provided with protection against unauthorized copying or other use of materials protected by copyright.
RDLJ is copyrighted. When using materials from the log must indicate the URL: index.phtml page = elbib / rus / journal?. Any change, addition or editing of the author's text are not allowed. Copying individual fragments of articles from the journal is allowed for distribute, remix, adapt, and build upon article, even commercially, as long as they credit that article for the original creation.
Request for the right to reproduce or use any of the materials published in RDLJ should be addressed to the Editor-in-Chief A.M. Elizarov at the following address: amelizarov@gmail.com.
The publishers of RDLJ is not responsible for the view, set out in the published opinion articles.
We suggest the authors of articles downloaded from this page, sign it and send it to the journal publisher's address by e-mail scan copyright agreements on the transfer of non-exclusive rights to use the work.