Development of an Intelligent Search System for the Mathematical Archive of Publications

Main Article Content

Aleksey Alekseevich Nasibulin
Olga Muratovna Ataeva

Abstract

A study was conducted on searching for similar documents. The goal was to create a recommendation algorithm for finding similar scientific articles in mathematics using a prioritized search of mathematical formulas with textual support.


The text was converted from graphical to textual representation using OCR technology for subsequent analysis and indexing. During the analysis process, the text was divided into blocks, followed by the extraction of significant formulas, keywords, and phrases from the text. During the indexing process, a vector database was formed based on vector representations of formulas obtained through the embedding process. The indexing results were used to search for articles that are similar to the document submitted by the user to the algorithm input. A list of similar articles is displayed with results sorted by the metric of closeness of vector representations of formulas.


The source data consisted of approximately 5,000 scientific articles devoted to various studies on mathematical topics and presented as PDF files. The experiment was conducted based on data from specific library system content, but the proposed technology can be extended to other library systems, including those containing articles on other topics, such as physics and other exact sciences.

Article Details

How to Cite
Nasibulin, A. A., and O. M. Ataeva. “Development of an Intelligent Search System for the Mathematical Archive of Publications”. Russian Digital Libraries Journal, vol. 29, no. 3, June 2026, pp. 860-76, doi:10.26907/1562-5419-2026-29-3-860-876.

References

1. Stuhlmann L., Saxer M.A., Fürst J. Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation // arXiv:2505.07917v2.
https://doi.org/10.48550/arXiv.2505.07917
2. Polyanin A.D., Shingareva I.K. The similarity index of mathematical and other scientific publications with equations and formulas and the problem of self-plagiarism identification // arXiv:2110.03872.
https://doi.org/10.48550/ arXiv.2110.03872
3. Wang R. et al. Evaluation of LLMs for mathematical problem solving // arXiv:2506.00309. https://doi.org/10.48550/arXiv.2506.00309
4. Forootani A.A. Survey on mathematical reasoning and optimization with Large Language Models // arXiv:2503.17726.
https://doi.org/10.48550/arXiv.2503.17726
5. Zanibbi R. et al. Mathematical Information Retrieval: Search and Question Answering // arXiv:2408.11646v3. https://doi.org/10.48550/arXiv.2408.11646
6. Nevzorova O.A., Nikolaev K.S. Semantic Annotation of Mathematical Formulas in PDF-Documents // Russian Digital Libraries. 2022. Vol. 25. No. 6. P. 616–639. https://doi.org/10.26907/1562-5419-2022-25-6-616-639
7. Chen E. et al. Comparing RAG and GraphRAG for Page-Level Retrieval Question Answering on Math Textbook // arXiv:2509.16780.
https://doi.org/10.48550/arXiv.2509.16780
8. Feng X. et al. Ontology-grounded automatic Knowledge Graph construction by LLM under wikidata schema // arXiv:2412.20942.
https://doi.org/10.48550/arXiv.2412.20942
9. Lippolis A.S. et al. Ontology Generation using Large Language Models // arXiv:2503.05388. https://doi.org/10.48550/arXiv.2503.05388
10. Khasanshin A. et al. Indexing mathematical scholarly papers as linked open data // Proceedings of the Sixth Russian Young Scientists Conference in Information Retrieval (VI Russian Summer School in Information Retrieval), 2012. P. 24–34. https://doi.org/10.18653/v1/P19-1023
11. Trisedya B.D. et al. Neural relation extraction for knowledge base enrichment, in: A. Korhonen, D. Traum, L. Màrquez (Eds.) // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, P. 229–240.
https://doi.org/10.18653/v1/P19-1023
12. Zhong W., Xie Y., Lin J. et al. Applying Structural and Dense Semantic Matching for the ARQMath Lab 2022, CLEF // CLEF (Working Notes). 2022. P. 147-170.
13. Shen J.T. et al. MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education // arXiv:2106.07340.
https://doi.org/10.48550/arXiv.2106.07340
14. Mansouri B. et al. Tangent-CFT: An embedding model for mathematical formulas //Proceedings of the 2019 ACM SIGIR international conference on theory of information retrieval. 2019. P. 11–18. https://doi.org/10.1145/3341981.3344235
15. Kumar P., Agarwal A., Bhagvati C. A structure based approach for mathematical expression retrieval // A Structure Based Approach for Mathematical Expression Retrieval // In: Sombattheera C., Loi N.K., Wankar R., Quan T. (Eds.) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2012. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2012. Vol. 7694. P. 23–34.
https://doi.org/10.1007/978-3-642-35455-7_3
16. Isele M.R. Analyzing Similarity in Mathematical Content to Enhance the Detection of Academic Plagiarism // arXiv:1801.08439.
https://doi.org/10.48550/arXiv.1801.08439
17. Li I.R. Towards Lightweight and LLM-Free Semantic Search for mathlib4 // AITP. 2025.
18. Wei Zhong et al. One Blade for One Purpose: Advancing Math Information Retrieval using Hybrid Search // In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23). 2023. P. 141–151. https://doi.org/10.1145/3539618.3591746
19. Scharpf P. et al. ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open? // arXiv:2012.02413. https://doi.org/10.48550/arXiv.2012.02413
20. Zanibbi R. et al. NTCIR-12 MathIR Task Overview // NTCIR. 2016.
21. Ouyang L. et al. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations // arXiv:2412.07626.
https://doi.org/10.48550/arXiv.2412.07626
22. Vera H.S. et al. EmbeddingGemma: Powerful and Lightweight Text Representations // arXiv:2509.20354. https://doi.org/10.48550/arXiv.2509.20354
23. Ataeva O.M. et al. Data mining when constructing a knowledge graph of a multidisciplinary journal // Information and mathematical technologies in science and management. 2024. Vol. 3 (35). P. 5–19.
24. Refahi S.M. et al. Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS // arXiv:2507.16978. https://doi.org/10.48550/arXiv.2507.16978
25. Python developers. Documentation of library difflib // Python 3.14.3 documentation.


Most read articles by the same author(s)

1 2 > >>