Development of an Intelligent Search System for the Mathematical Archive of Publications
Main Article Content
Abstract
A study was conducted on searching for similar documents. The goal was to create a recommendation algorithm for finding similar scientific articles in mathematics using a prioritized search of mathematical formulas with textual support.
The text was converted from graphical to textual representation using OCR technology for subsequent analysis and indexing. During the analysis process, the text was divided into blocks, followed by the extraction of significant formulas, keywords, and phrases from the text. During the indexing process, a vector database was formed based on vector representations of formulas obtained through the embedding process. The indexing results were used to search for articles that are similar to the document submitted by the user to the algorithm input. A list of similar articles is displayed with results sorted by the metric of closeness of vector representations of formulas.
The source data consisted of approximately 5,000 scientific articles devoted to various studies on mathematical topics and presented as PDF files. The experiment was conducted based on data from specific library system content, but the proposed technology can be extended to other library systems, including those containing articles on other topics, such as physics and other exact sciences.
Article Details
References
https://doi.org/10.48550/arXiv.2505.07917
2. Polyanin A.D., Shingareva I.K. The similarity index of mathematical and other scientific publications with equations and formulas and the problem of self-plagiarism identification // arXiv:2110.03872.
https://doi.org/10.48550/ arXiv.2110.03872
3. Wang R. et al. Evaluation of LLMs for mathematical problem solving // arXiv:2506.00309. https://doi.org/10.48550/arXiv.2506.00309
4. Forootani A.A. Survey on mathematical reasoning and optimization with Large Language Models // arXiv:2503.17726.
https://doi.org/10.48550/arXiv.2503.17726
5. Zanibbi R. et al. Mathematical Information Retrieval: Search and Question Answering // arXiv:2408.11646v3. https://doi.org/10.48550/arXiv.2408.11646
6. Nevzorova O.A., Nikolaev K.S. Semantic Annotation of Mathematical Formulas in PDF-Documents // Russian Digital Libraries. 2022. Vol. 25. No. 6. P. 616–639. https://doi.org/10.26907/1562-5419-2022-25-6-616-639
7. Chen E. et al. Comparing RAG and GraphRAG for Page-Level Retrieval Question Answering on Math Textbook // arXiv:2509.16780.
https://doi.org/10.48550/arXiv.2509.16780
8. Feng X. et al. Ontology-grounded automatic Knowledge Graph construction by LLM under wikidata schema // arXiv:2412.20942.
https://doi.org/10.48550/arXiv.2412.20942
9. Lippolis A.S. et al. Ontology Generation using Large Language Models // arXiv:2503.05388. https://doi.org/10.48550/arXiv.2503.05388
10. Khasanshin A. et al. Indexing mathematical scholarly papers as linked open data // Proceedings of the Sixth Russian Young Scientists Conference in Information Retrieval (VI Russian Summer School in Information Retrieval), 2012. P. 24–34. https://doi.org/10.18653/v1/P19-1023
11. Trisedya B.D. et al. Neural relation extraction for knowledge base enrichment, in: A. Korhonen, D. Traum, L. Màrquez (Eds.) // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, P. 229–240.
https://doi.org/10.18653/v1/P19-1023
12. Zhong W., Xie Y., Lin J. et al. Applying Structural and Dense Semantic Matching for the ARQMath Lab 2022, CLEF // CLEF (Working Notes). 2022. P. 147-170.
13. Shen J.T. et al. MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education // arXiv:2106.07340.
https://doi.org/10.48550/arXiv.2106.07340
14. Mansouri B. et al. Tangent-CFT: An embedding model for mathematical formulas //Proceedings of the 2019 ACM SIGIR international conference on theory of information retrieval. 2019. P. 11–18. https://doi.org/10.1145/3341981.3344235
15. Kumar P., Agarwal A., Bhagvati C. A structure based approach for mathematical expression retrieval // A Structure Based Approach for Mathematical Expression Retrieval // In: Sombattheera C., Loi N.K., Wankar R., Quan T. (Eds.) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2012. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2012. Vol. 7694. P. 23–34.
https://doi.org/10.1007/978-3-642-35455-7_3
16. Isele M.R. Analyzing Similarity in Mathematical Content to Enhance the Detection of Academic Plagiarism // arXiv:1801.08439.
https://doi.org/10.48550/arXiv.1801.08439
17. Li I.R. Towards Lightweight and LLM-Free Semantic Search for mathlib4 // AITP. 2025.
18. Wei Zhong et al. One Blade for One Purpose: Advancing Math Information Retrieval using Hybrid Search // In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23). 2023. P. 141–151. https://doi.org/10.1145/3539618.3591746
19. Scharpf P. et al. ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open? // arXiv:2012.02413. https://doi.org/10.48550/arXiv.2012.02413
20. Zanibbi R. et al. NTCIR-12 MathIR Task Overview // NTCIR. 2016.
21. Ouyang L. et al. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations // arXiv:2412.07626.
https://doi.org/10.48550/arXiv.2412.07626
22. Vera H.S. et al. EmbeddingGemma: Powerful and Lightweight Text Representations // arXiv:2509.20354. https://doi.org/10.48550/arXiv.2509.20354
23. Ataeva O.M. et al. Data mining when constructing a knowledge graph of a multidisciplinary journal // Information and mathematical technologies in science and management. 2024. Vol. 3 (35). P. 5–19.
24. Refahi S.M. et al. Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS // arXiv:2507.16978. https://doi.org/10.48550/arXiv.2507.16978
25. Python developers. Documentation of library difflib // Python 3.14.3 documentation.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Presenting an article for publication in the Russian Digital Libraries Journal (RDLJ), the authors automatically give consent to grant a limited license to use the materials of the Kazan (Volga) Federal University (KFU) (of course, only if the article is accepted for publication). This means that KFU has the right to publish an article in the next issue of the journal (on the website or in printed form), as well as to reprint this article in the archives of RDLJ CDs or to include in a particular information system or database, produced by KFU.
All copyrighted materials are placed in RDLJ with the consent of the authors. In the event that any of the authors have objected to its publication of materials on this site, the material can be removed, subject to notification to the Editor in writing.
Documents published in RDLJ are protected by copyright and all rights are reserved by the authors. Authors independently monitor compliance with their rights to reproduce or translate their papers published in the journal. If the material is published in RDLJ, reprinted with permission by another publisher or translated into another language, a reference to the original publication.
By submitting an article for publication in RDLJ, authors should take into account that the publication on the Internet, on the one hand, provide unique opportunities for access to their content, but on the other hand, are a new form of information exchange in the global information society where authors and publishers is not always provided with protection against unauthorized copying or other use of materials protected by copyright.
RDLJ is copyrighted. When using materials from the log must indicate the URL: index.phtml page = elbib / rus / journal?. Any change, addition or editing of the author's text are not allowed. Copying individual fragments of articles from the journal is allowed for distribute, remix, adapt, and build upon article, even commercially, as long as they credit that article for the original creation.
Request for the right to reproduce or use any of the materials published in RDLJ should be addressed to the Editor-in-Chief A.M. Elizarov at the following address: amelizarov@gmail.com.
The publishers of RDLJ is not responsible for the view, set out in the published opinion articles.
We suggest the authors of articles downloaded from this page, sign it and send it to the journal publisher's address by e-mail scan copyright agreements on the transfer of non-exclusive rights to use the work.