Detection of Hallucinations Based on the Internal States of Large Language Models

Main Article Content

Timur Rustemovich Aisin
Tatiana Vyacheslavovna Shamardina

Abstract

In recent years, large language models (LLMs) have achieved substantial progress in natural language processing tasks and have become key instruments for addressing a wide range of applied and research problems. However, as their scale and capabilities grow, the issue of hallucinations — i.e., the generation of false, unreliable, or nonexistent information presented in a credible manner—has become increasingly acute. Consequently, analyzing the nature of hallucinations and developing methods for their detection has acquired both scientific and practical significance.


This study examines the phenomenon of hallucinations in large language models, reviews their existing classification, and investigates potential causes. Using the Flan-T5 model, we analyze differences in the model’s internal states when generating hallucinations versus correct responses. Based on these discrepancies, we propose two approaches for hallucination detection: one leveraging attention maps and the other utilizing the model’s hidden states. These methods are evaluated on data from HaluEval and Shroom 2024 benchmarks in tasks such as summarization, question answering, paraphrasing, machine translation, and definition generation. Additionally, we assess the transferability of the trained detectors across different hallucination types, in order to evaluate the robustness of the proposed methods.

Article Details

How to Cite
Aisin, T. R., and T. V. Shamardina. “Detection of Hallucinations Based on the Internal States of Large Language Models”. Russian Digital Libraries Journal, vol. 28, no. 6, Dec. 2025, pp. 1282-05, doi:10.26907/1562-5419-2025-28-6-1282-1305.

References

1. Vaswani A., Shazeer N., Parmar N. et al. Attention is all you need // Advances in Neural Information Processing Systems. 2017. Vol. 30.
2. Huang L., Yu W., Ma W. et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions // ACM Transactions on Information Systems. 2025. Vol. 43, No. 2. P. 1–55. https://doi.org/10.1145/3703155
3. Li J., Cheng X., Zhao W. X. et al. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 6449–6464. https://doi.org/10.18653/v1/2023.emnlp-main.397
4. Mickus T., Zosa E., Vázquez R. et al. SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes // International Workshop on Semantic Evaluation. 2024. https://doi.org/10.18653/v1/2024.semeval-1.273
5. Carlini N., Ippolito D., Jagielski M. et al. Quantifying Memorization Across Neural Language Models // The Eleventh International Conference on Learning Representations. 2023. https://doi.org/10.48550/arXiv.2202.07646
6. Lin S., Hilton J., Evans Q. TruthfulQA: Measuring How Models Mimic Human Falsehoods // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022. Vol. 1. P. 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229
7. Li D., Rawat A.S., Zaheer M. et al. Large Language Models with Controllable Working Memory // Findings of the Association for Computational Linguistics: ACL 2023. 2023. P. 1774–1793. https://doi.org/10.18653/v1/2023.findings-acl.112
8. Sharma M., Tong M., Korbak T. et al. Towards Understanding Sycophancy in Language Models // The Twelfth International Conference on Learning Representations. 2024. https://doi.org/10.48550/arXiv.2310.13548
9. Reinforcement Learning from Human Feedback: Progress and Challenges // YouTube. URL: https://www.youtube.com/watch?v=hhiLw5Q_UFg (дата обращения: 04.05.2025)
10. Chuang Y.S., Xie Y., Luo H. et al. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models // ArXiv. 2023. Vol. abs/2309.03883. https://doi.org/10.48550/arXiv.2309.03883
11. Voita E., Talbot D., Moiseev F. et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. P. 5797–5808. https://doi.org/10.18653/v1/P19-1580
12. Min S., Krishna K., Lyu X. et al. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 12076–12100. https://doi.org/10.18653/v1/2023.emnlp-main.741
13. Luo Z., Xie Q., Ananiadou S. ChatGPT as a Factual Inconsistency Evaluator for Text Summarization // ArXiv. 2023. Vol. abs/2303.15621. https://doi.org/10.48550/arXiv.2303.15621
14. Manakul P., Liusie A., Gales M.J. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 9004–9017. https://doi.org/10.18653/v1/2023.emnlp-main.557
15. Cohen R., Hamri M., Geva M. et al. LM vs LM: Detecting Factual Errors via Cross Examination // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 12621–12640. https://doi.org/10.18653/v1/2023.emnlp-main.778
16. Xiao Y., Wang W.Y. On Hallucination and Predictive Uncertainty in Conditional Language Generation // Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. P. 2734–2744. https://doi.org/10.18653/v1/2021.eacl-main.236
17. Miao N., Teh Y.W., Rainforth T. SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning // The Twelfth International Conference on Learning Representations. 2024. https://doi.org/10.48550/arXiv.2308.00436
18. Adlakha V., BehnamGhader P., Lu X.H. et al. Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering // Transactions of the Association for Computational Linguistics. 2024. Vol. 12. P. 681–699. https://doi.org/10.1162/tacl_a_00667
19. Lin Chin-Yew. ROUGE: A Package for Automatic Evaluation of Summaries // Text Summarization Branches Out. 2004. P. 74-81. ISBN: 9781932432466
20. Venkit P.N., Gautam S., Panchanadikar R. et al. Nationality Bias in Text Generation // Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. P. 116–122. https://doi.org/10.18653/v1/2023.eacl-main.9
21. Goodrich B., Rao V., Liu P.J. et al. Assessing The Factual Accuracy of Generated Text // Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2019. P. 166–175. https://doi.org/10.1145/3292500.3330955
22. Laban P., Schnabel T., Bennett P.N. et al. SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization // Transactions of the Association for Computational Linguistics. 2022. Vol. 10. P. 163–177. https://doi.org/10.1162/tacl_a_00453
23. Xu W., Agrawal S., Briakou E. et al. Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection // Transactions of the Association for Computational Linguistics. 2023. Vol. 11. P. 546–564. https://doi.org/10.1162/tacl_a_00563
24. Zhang T., Qiu L., Guo Q. et al. Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus // Conference on Empirical Methods in Natural Language Processing. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.58
25. Chuang Y.S., Qiu L., Hsieh C.Y. et al. Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. P. 1419–1436. https://doi.org/10.18653/v1/2024.emnlp-main.84
26. Yin Z., Sun Q., Guo Q. et al. Do Large Language Models Know What They Don't Know? // Annual Meeting of the Association for Computational Linguistics. 2023. https://doi.org/10.18653/v1/2023.findings-acl.551
27. Marks S., Tegmark M. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets // First Conference on Language Modeling. 2024. https://doi.org/10.48550/arXiv.2310.06824
28. Su W., Wang C., Ai Q. et al. Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models // Annual Meeting of the Association for Computational Linguistics. 2024. https://doi.org/10.48550/arXiv.2403.06448
29. Chung H.W., Hou L., Longpre S. et al. Scaling Instruction-Finetuned Language Models // Journal of Machine Learning Research. 2024. Vol. 25, No. 70. P. 1–53. https://doi.org/10.5555/3722577.3722647
30. Hochreiter Sepp, Schmidhuber Jürgen. Long Short-Term Memory // Neural Computation. 1997. Vol. 9, No. 8. P. 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
31. PCA // Wikipedia. URL: https://en.wikipedia.org/wiki/Principal_component_analysis (дата обращения: 13.06.2025).