Exploring Post-Training Quantization of Large Language Models with a Focus on Russian Evaluation
Main Article Content
Abstract
The rapid adoption of large language models (LLMs) has made quantization a central technique for enabling efficient deployment under real-world hardware and memory constraints. While English-centric evaluations of low-bit quantization are increasingly available, much less is known about its effects on morphologically rich and resource-diverse languages such as Russian. This gap is particularly important given the recent emergence of high-performing Russian and multilingual LLMs. In this work, we conduct a systematic study of 2-, 3-, and 4-bit post-training quantization (PTQ) for state-of-the-art Russian LLMs across different model scales (4B and 32B). Our experimental setup covers both standard uniform quantization and specialized low-bit formats, as well as lightweight finetuning for recovery in the most extreme 2-bit setting. Our findings highlight several important trends: (i) the tolerance of Russian LLMs to quantization differs across model families and scales; (ii) 4-bit quantization is generally robust, especially when advanced formats are used; (iii) 3-bit models expose sensitivity to calibration data and scaling strategies; and (iv) 2-bit models, while severely degraded under naive PTQ, can be partially restored through short finetuning. Empirical results show that the model's domain must be considered when using different quantization techniques.
Article Details
References
2. Mendonça J., Lavie A., Trancoso I. On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation // Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024). 2024. P. 1–12. https://doi.org/10.48550/arXiv.2407.03841
3. Liu J. et al. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation //Advances in Neural Information Processing Systems. 2023. Vol. 36. P. 21558–21572. https://doi.org/10.48550/arXiv.2305.01210
4. Hendrycks D. et al. Measuring massive multitask language understanding, 2021 // International Conference on Learning Representations. 2021. https://doi.org/10.48550/arXiv.2009.03300
5. Clark P. et al. Think you have solved question answering? try arc, the ai2 reasoning challenge // arXiv preprint arXiv:1803.05457. 2018. https://doi.org/10.48550/arXiv.1803.05457
6. Zellers R. et al. HellaSwag: Can a Machine Really Finish Your Sentence? // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. P. 4791–4800. https://doi.org/10.48550/arXiv.1905.07830
7. Dettmers T. et al. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale // Advances in neural information processing systems. 2022. Vol. 35, P. 30318–30332. https://doi.org/10.48550/arXiv.2208.07339
8. Frantar E. et al. OPTQ: Accurate post-training quantization for generative pre-trained transformers // 11th International Conference on Learning Representations. 2023. https://doi.org/10.48550/arXiv.2210.17323
9. Lin J. et al. Awq: Activation-aware weight quantization for on-device llm compression and acceleration // Proceedings of machine learning and systems. 2024. Vol. 6. P. 87–100. https://doi.org/10.1145/3714983.3714987
10. Xiao G. et al. Smoothquant: Accurate and efficient post-training quantization for large language models // International conference on machine learning. PMLR, 2023. P. 38087 –38099. https://doi.org/10.48550/arXiv.2211.10438
11. Tseng A. et al. Qtip: Quantization with trellises and incoherence processing // Advances in Neural Information Processing Systems. 2024. Vol. 37. P. 59597–59620. https://doi.org/10.48550/arXiv.2406.11235
12. T-Tech. T-pro-2.0. – Hybrid reasoning model based on Qwen3-32B // HuggingFace.co: The collaboration platform. 2025. URL: https://huggingface.co/t-tech/T-pro-it-2.0
13. Yandex company. YandexGPT // HuggingFace.co: The collaboration platform. 2025. URL: https://huggingface.co/yandex/YandexGPT-5-Lite-8B-instruct
14. Tikhomirov M., Chernyshev D. Facilitating large language model russian adaptation with learned embedding propagation // Journal of Language and Education. 2024. Vol. 10. No. 4 (40). P. 130–145. https://doi.org/10.48550/arXiv.2412.21140
15. Team Q. et al. Qwen2 technical report // arXiv preprint arXiv:2407.10671. 2024. Vol. 2. P. 3. https://doi.org/10.48550/arXiv.2407.10671
16. Agarwal S. et al. gpt-oss-120b & gpt-oss-20b Model Card // arXiv e-prints. 2025. P. arXiv: 2508.10925. https://doi.org/10.48550/arXiv.2508.10925
17. Liu A. et al. DeepSeek-V3 Technical Report // arXiv e-prints. 2024. P. arXiv: 2412.19437. https://doi.org/10.48550/arXiv.2412.19437
18. Chee J. et al. Quip: 2-bit quantization of large language models with guarantees // Advances in Neural Information Processing Systems. 2023. Vol. 36, P. 4396 –4429. https://doi.org/10.48550/arXiv.2307.13304
19. Chen M. et al. Efficientqat: Efficient quantization-aware training for large language models // Annual Meeting of the Association for Computational Linguistics. 2025. Vol. 1. P. 10081–10100. https://doi.org/10.48550/arXiv.2407.11062
20. Shao W. et al. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models // The Twelfth International Conference on Learning Representations. 2024. https://doi.org/10.48550/arXiv.2308.13137
21. Hu E. J. et al. Lora: Low-rank adaptation of large language models // International Conference on Machine Learning. 2022. Vol. 1, No. 2. P. 3. https://doi.org/10.48550/arXiv.2106.09685
22. Han Z. et al. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey // arXiv e-prints. 2024. P. arXiv: 2403.14608. https://doi.org/10.48550/arXiv.2403.14608
23. Egiazarian V. et al. Extreme compression of large language models via additive quantization // Proceedings of the 41st International Conference on Machine Learning. 2024. P. 12284–12303. https://doi.org/10.48550/arXiv.2401.06118
24. Tseng A. et al. QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks // International Conference on Machine Learning. PMLR, 2024. P. 48630–48656. https://doi.org/10.48550/arXiv.2402.04396
25. Tseng A. et al. Qtip: Quantization with trellises and incoherence processing // Advances in Neural Information Processing Systems. 2024. Vol. 37. P. 59597–59620. https://doi.org/10.48550/arXiv.2406.11235
26. Yang A. et al. Qwen3 technical report // arXiv e-prints. 2025. P. arXiv: 2505.09388. https://doi.org/10.48550/arXiv.2505.09388
27. Achiam J. et al. GPT-4 Technical Report // arXiv e-prints. 2023. arXiv: 2303.08774. https://doi.org/10.48550/arXiv.2303.08774
28. Darvish Rouhani B. et al. Microscaling data formats for deep learning // arXiv e-prints. 2023. P. arXiv: 2310.10537. https://doi.org/10.48550/arXiv.2310.10537
29. Weber M. et al. Redpajama: an open dataset for training large language models // Advances in neural information processing systems. 2024. Vol. 37. P. 116462–116492. https://doi.org/10.52202/079017-3697
30. Potapov A. T‑Wix – Russian supervised fine‑tuning (SFT) dataset // HuggingFace.co: The collaboration platform. 2025. URL: https://huggingface.co/datasets/t-tech/T-Wix
31. Merity S. et al. Pointer Sentinel Mixture Models // International Conference on Learning Representations. 2017. https://doi.org/10.48550/arXiv.1609.07843
32. Korablinov V., Braslavski P. RuBQ: A Russian dataset for question answering over Wikidata // International Semantic Web Conference. Cham: Springer International Publishing. 2020. P. 97–110. https://doi.org/10.1007/978-3-030-62466-8_7
33. Li H. et al. CMMLU: Measuring massive multitask language understanding in Chinese // Findings of the Association for Computational Linguistics. 2024. P. 11260–11285. https://doi.org/10.48550/arXiv.2306.09212
34. Bisk Y. et al. Piqa: Reasoning about physical commonsense in natural language // Proceedings of the AAAI conference on artificial intelligence. 2020. Vol. 34. №. 05. P. 7432–7439. https://doi.org/10.1609/aaai.v34i05.6239
35. Fenogenova A. et al. MERA: A Comprehensive LLM Evaluation in Russian //Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. P. 9920–9948. https://doi.org/10.18653/v1/2024.acl-long.534
36. Chirkin A. et al. RusConText Benchmark: A Russian Language Evaluation Benchmark for Understanding Context // ACL 2025 Student Research Workshop. 2025. https://aclanthology.org/2025.acl-srw.91/
37. EleutherAI. Language Model Evaluation Harness // Zenodo. 2024. v0.4.3. https://zenodo.org/records/10256836

This work is licensed under a Creative Commons Attribution 4.0 International License.
Presenting an article for publication in the Russian Digital Libraries Journal (RDLJ), the authors automatically give consent to grant a limited license to use the materials of the Kazan (Volga) Federal University (KFU) (of course, only if the article is accepted for publication). This means that KFU has the right to publish an article in the next issue of the journal (on the website or in printed form), as well as to reprint this article in the archives of RDLJ CDs or to include in a particular information system or database, produced by KFU.
All copyrighted materials are placed in RDLJ with the consent of the authors. In the event that any of the authors have objected to its publication of materials on this site, the material can be removed, subject to notification to the Editor in writing.
Documents published in RDLJ are protected by copyright and all rights are reserved by the authors. Authors independently monitor compliance with their rights to reproduce or translate their papers published in the journal. If the material is published in RDLJ, reprinted with permission by another publisher or translated into another language, a reference to the original publication.
By submitting an article for publication in RDLJ, authors should take into account that the publication on the Internet, on the one hand, provide unique opportunities for access to their content, but on the other hand, are a new form of information exchange in the global information society where authors and publishers is not always provided with protection against unauthorized copying or other use of materials protected by copyright.
RDLJ is copyrighted. When using materials from the log must indicate the URL: index.phtml page = elbib / rus / journal?. Any change, addition or editing of the author's text are not allowed. Copying individual fragments of articles from the journal is allowed for distribute, remix, adapt, and build upon article, even commercially, as long as they credit that article for the original creation.
Request for the right to reproduce or use any of the materials published in RDLJ should be addressed to the Editor-in-Chief A.M. Elizarov at the following address: amelizarov@gmail.com.
The publishers of RDLJ is not responsible for the view, set out in the published opinion articles.
We suggest the authors of articles downloaded from this page, sign it and send it to the journal publisher's address by e-mail scan copyright agreements on the transfer of non-exclusive rights to use the work.