Extraction Of Wikidata Knowledge For The Metadata Formation For Documents of Digital Mathematical Collections

Main Article Content

Abstract

Methods for creating digital mathematical collections that include unstructured sets of documents are presented. These sets contain materials from scientific conferences, as well as articles from the archives of mathematical journals of the "pre-digital" period.


Using the software tools of the metadata factory of the digital mathematical library Lobachevskii DML, a mandatory set of metadata for digital collection documents was formed. To refine and replenish the metadata sets, knowledge extraction methods from Wikidata were used.


To search Wikidata for information about digital collection documents and their authors, a system of SPARQL queries has been developed. A set of Wikidata entities is defined, which determine the features of the search, as well as the subsequent filtering of the results.


Methods for clarifying and supplementing the bibliographic references given in the articles are proposed. When forming the metadata of documents of retrocollections, a search was made in Wikidata for information about the years of life of the authors of articles, as well as URLs of web pages with information about articles and their authors. The results of the formation of several new digital collections of the Lobachevskii-DML digital library are presented.

Article Details

References

1. Bartling S., Friesike S. Towards Another Scientific Revolution // In: Bartling S., Friesike S. (Eds.) Opening Science. The Evolving Guide on How the Internet is Changing Research, Collaboration and Scholarly Publishing. Springer International Publishing. 2014. P. 3–15 (2014). https://doi.org/10.1007/978-3-319-00026-8_1.
2. Carette J., Farmer W.M., Kohlhase M., Rabe F. Big Math and the One-Brain Barrier: The Tetrapod Model of Mathematical Knowledge // Math. Intelligencer. 2021. Vol. 43. P. 78–87 (2021). https://doi.org/10.1007/s00283-020-10006-0.
3. Елизаров А.М., Зуев Д.С., Липачёв Е.К. Управление жизненным циклом электронных публикаций в информационной системе научного журнала // Вестник Воронежского государственного университета. Серия: Системный анализ и информационные технологии. 2014. № 4. С. 81–88.
4. Binfield P. Novel Scholarly Journal Concepts // In: Bartling S., Friesike S. (Eds.) Opening Science. The Evolving Guide on How the Internet is Changing Research, Collaboration and Scholarly Publishing. Springer International Publishing, 2014. P. 155–163. https://doi.org/10.1007/978-3-319-00026-8_10.
5. Ataeva O., Kalenov N., Serebriakov V., Sotnikov A. Informational Infrastructure of the Common Digital Space of Scientific Knowledge // CEUR Workshop Proceedings. 2021. Vol. 2990. P. 1–10. URL: http://ceur-ws.org/Vol-2990/rpaper1.pdf, last accessed 2021/11/07.
6. Ion P.D.F. Mathematics and the World Wide Web // In: Carette J., Aspinall D., Lange C., Sojka P., Windsteiger W. (Eds.) Intelligent Computer Mathematics. CICM 2013. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. 2013. Vol 7961. https://doi.org/10.1007/978-3-642-39320-4_15.
7. Ion P.D.F., Watt S.M. The Global Digital Mathematics Library and the International Mathematical Knowledge Trust // ICM 2017: Intelligent Computer Mathematics, 2017. Lecture Notes in Artificial Intelligence. 2017. V. 10383. P. 56–69. https://doi.org/10.1007/978-3-319-62075-6_5.
8. Developing a 21st Century Global Library for Mathematics Research. Washington: The National Academies Press, 2014. 142 p. https://doi.org/10.17226/18619.
9. Xie I., Matusiak K. Discover Digital Libraries: Theory and Practice. Elsevier Inc., 2016.
10. Born-digital. URL: https://en.wikipedia.org/wiki/Born-digital, last accessed 2021/11/07.
11. Author Guide – ScholarOne Manuscripts. Clarivate Analytics. 2019. P. 1–70. URL: https://clarivate.com/webofsciencegroup/wp-content/uploads/sites/2/ dlm_uploads/2019/10/ScholarOne-Manuscripts-Author-Guide.pdf, last accessed 2021/11/07.
12. Author tutorials. Writing a journal manuscript. Springer Nature Switzerland AG, 2021. URL: https://www.springernature.com/gp/authors/campaigns/writing-a-manuscript, last accessed 2021/11/07.
13. Gafurova P., Elizarov A., Lipachev E. Algorithms for Integration of Unstructured Mathematical Documents into the Common Digital Space of Scientific Knowledge // CEUR Workshop Proceedings.2021. Vol. 2990. P. 39–49. URL: http://ceur-ws.org/Vol-2990/rpaper4.pdf, last accessed 2021/11/07.
14. Биряльцев Е.В., Елизаров А.М., Жильцов Н.Г., Липачёв Е.К., Невзорова О.А., Соловьев В.Д. Методы анализа семантических данных математических электронных коллекций // Научно-техническая информация. Серия 2: Информационные процессы и системы. 2014. № 4. С. 12–17.
15. Tkaczyk D., Tarnawski B., Bolikowski Ł. Structured Affiliations Extraction from Scientific Literature // D-Lib Magazine. 2015. Vol. 21, No. 11/12. https://doi.org/10.1045/november2015-tkaczyk.
16. Elizarov A.M., Lipachev E.K., Khaydarov S.M. Automated system of services for processing of large collections of scientific documents // CEUR Workshop Proceedings. 2016. Vol. 1752. P. 58–64.
17. Tkaczyk D. New Methods for Metadata Extraction from Scientific Literature. arXiv:1710.10201v1. 2017. URL: https://arxiv.org/pdf/1710.10201v1.pdf, last accessed 2021/09/09.
18. Universal Decimal Classification. URL: https://udcc.org/index.php, last accessed 2021/09/09.
19. MSC2020–Mathematics Subject Classification System. URL: https://mathscinet.ams.org/msnhtml/msc2020.pdf, last accessed 2021/09/09.
20. Řehůřek R., Sojka P. Automated Classification and Categorization of Mathematical Knowledge // In: Autexier S., Campbell J., Rubio J., Sorge V., Suzuki M., Wiedijk F. (Eds.) Intelligent Computer Mathematics. CICM 2008. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. 2008. Vol. 5144. P. 543–557. https://doi.org/10.1007/978-3-540-85110-3_44.
21. Хайдаров Ш.М., Ямалутдинова Г.Ш. Рекомендательная система классификации физико-математических документов // Научный сервис в сети Интернет: труды XX Всероссийской научной конференции (17–22 сентября 2018 г., г. Новороссийск). М.: ИПМ им. М.В. Келдыша, 2018. С. 480–486. URL: https://doi.org/ 10.20948/abrau-2018-57. http://keldysh.ru/abrau/2018/theses/ 57.pdf.
22. Schubotz M., Scharpf P., Teschke O., Kühnemund A., Breitinger C., Gipp B. AutoMSC: Automatic Assignment of Mathematics Subject Classification Labels // In: Proceedings of the 13th Conference on Intelligent Computer Mathematics. 2020. arXiV:2005.12099v1. 25 May 2020.
23. Nevzorova O., Almukhametov D. Towards a Recommender System for the Choice of UDC Code for Mathematical Articles // CEUR Workshop Proceedings. 2021. Vol. 3036. P. 54–62. URL: http://ceur-ws.org/Vol-3036/paper04.pdf, last accessed 2021/11/07.
24. Rocha E.M., Rodrigues J.F. Disseminating and preserving mathematical knowledge // In: Borwein J.M., Rocha E.M., Rodrigues J.F. (Eds.). Communicating Mathematics in the Digital Era. A K Peters, Ltd., 2008. P. 3–21.
25. Elizarov A.M., Lipachev E.K., Zuev D.S. Digital Mathematical Libraries: Overview of Implementations and Content Management Services // CEUR Workshop Proceedings. 2017. Vol. 2022. P. 317–325.
26. Elizarov A.M., Lipachev E.K. Lobachevskii DML: Towards a Semantic Digital Mathematical Library of Kazan University // CEUR Workshop Proceedings. 2017. Vol. 2022. P. 326–333. URL: http://ceur-ws.org/Vol-2022/paper50.pdf, last accessed 2021/11/07.
27. Elizarov A.M., Lipachev E.K. Big Math Methods in Lobachevskii-DML Digital Library // CEUR Workshop Proсeedings. 2019. Vol. 2523. P. 59–72. URL: http://ceur-ws.org/Vol-2523/invited08.pdf, last accessed 2021/11/21.
28. Гафурова П.О., Елизаров А.М., Липачёв Е.К. Базовые сервисы фабрики метаданных цифровой математической библиотеки Lobachevskii-DML // Электронные библиотеки. 2020. Т. 23, №3. С. 336–381.
https://doi.org/10.26907/1562-5419-2020-23-3-336-381.
29. EuDML metadata schema specification (v2.0–final). https://initiative.eudml.org/eudml-metadata-schema-specification-v20-final, last accessed 2021/11/11.
30. Elizarov A., Lipachev E. Digital Libraries and the Common Digital Space of Mathematical Knowledge // CEUR Workshop Proceedings. 2021. Vol. 2990. P. 25–38. URL: http://ceur-ws.org/Vol-2990/rpaper3.pdf, last accessed 2021/11/07.
31. Электронная коллекция: Труды математического центра им. Н. И. Лобачевского. URL: https://lobachevskii-dml.ru/journal/tmt, last accessed 2021/11/07.
32. Электронная коллекция: «Известия физико-математического общества при Казанском университете». URL: https://lobachevskii-dml.ru/journal/izfmo2, https://lobachevskii-dml.ru/journal/izfmo3, last accessed 2021/11/07.
33. Elizarov A., Lipachev E. Digital Library Metadata Factories // CEUR Workshop Proceedings. 2021. Vol. 2813. P. 13–21. URL: http://ceur-ws.org/Vol-2813/rpaper01.pdf, last accessed 2021/11/07.
34. Elizarov A.M., Khaydarov Sh.M., Lipachev E.K. Scientific Documents Ontologies for Semantic Representation of Digital Libraries // In: Proceedings of the 2nd Russia and Pacific Conference on Computer Technology and Applications (RPC 2017). IEEE. 2017. P. 1–5. https://doi.org/10.1109/RPC.2017.8168064.
35. Elizarov A., Lipachev E. Methods of Processing Large Collections of Scientific Documents and the Formation of Digital Mathematical Library // CEUR Workshop Proceedings. 2020. V. 2543. P. 354–360. URL: http://ceur-ws.org/Vol-2543/spaper05.pdf, last accessed 2021/11/07.
36. Lane H., Hapke H., Howard C. Natural Language Processing in Action: Under-standing, analyzing, and generating text with Python. Manning Publications, 2019.
37. Natasha. URL: https://github.com/natasha/natasha, last accessed 2021/11/07.
38. Проект Natasha. Набор качественных открытых инструментов для обработки естественного русского языка (NLP). URL: https://habr.com/ru/post/516098/, last accessed 2021/11/07.
39. Bouche T., Rákosník J. Report on the EuDML External Cooperation Model // in: Kaiser K., Krantz S.G., Wegner B. (Eds.) Topics and Issues in Electronic Publishing, JMM, Special Session. San Diego, 2013. P. 99–10. URL: https://www.emis.de/proceedings/TIEP2013/07bouche_rakosnik.pdf, last accessed 2021/11/11.
40. Journal Article Tag Suite. URL: https://jats.nlm.nih.gov/about.html, last accessed 2021/01/05.
41. Gafurova P.O., Elizarov A.M., Lipachev E.K., Khammatova D.M. Metadata Normalization Methods in the Digital Mathematical Library // CEUR Workshop Proceedings. 2020. Vol. 2543. P. 136–148. URL: http://ceur-ws.org/Vol-2543/rpaper13.pdf, last accessed 2021/11/07.
42. Гафурова П.О., Елизаров А.М., Липачёв Е.К. Lobachevskii-DML: формирование архивных математических коллекций // Научный сервис в сети Интернет: труды XXII Всероссийской научной конференции. М.: ИПМ им. М.В. Келдыша, 2020. С. 171–183. https://doi.org/10.20948/abrau-2020-23.
43. Gafurova P.O., Elizarov A.M., Lipachev E.K. Metadata Extraction Methods for Organizing a Retro-Collection in the Lobachevskii Digital Mathematical Library // CEUR Workshop Proceedings. 2020. Vol. 2784. P. 62–71. URL: http://ceur-ws.org/Vol-2784/rpaper06.pdf, last accessed 2021/11/07.
44. Гафурова П.О., Елизаров А.М., Липачёв Е.К. Алгоритмы формирования метаданных математических ретро-коллекций на основе анализа структурных особенностей документов // Электронные библиотеки. 2021. Т. 24, №2. С. 238–271. https://doi.org/10.26907/1562-5419-2021-24-2-238-270.
45. Jost M., Bouche T., Goutorbe C., Jorda J.P. D3.2: The EuDML metadata schema. Revision: 1.6 as of 15th December 2010. URL: http://www.mathdoc.fr/publis/d3.2-v1.6.pdf, last accessed 2021/11/11.
46. Vrandečić D., Krötzsch M. Wikidata: a free collaborative knowledgebase // Communications of the ACM. October 2014. Vol. 57, Issue 10. P. 78–85. https://doi.org/10.1145/2629489.
47. Wikipedia: Wikidata. URL: https://en.wikipedia.org/wiki/Wikidata, last accessed 2021/11/07.
48. Statistics – Wikidata. URL: https://www.Wikidata.org/wiki/Special:Statistics, last accessed 2021/11/07.
49. Wikidata: Glossary. URL: https://www.Wikidata.org/wiki/Wikidata:Glossary, last accessed 2021/11/07.
50. Erxleben F., Günther M., Krötzsch M., Mendez J., Vrandečić D. Introducing Wikidata to the Linked Data Web // In: Mika P. et al. (Eds.) The Semantic Web – ISWC 2014. ISWC 2014. Lecture Notes in Computer Science. Springer, Cham. 2014. Vol. 8796. P. 50–65. https://doi.org/10.1007/978-3-319-11964-9_4.
51. Geiß J., Spitz A., Gertz M. NECKAr: A Named Entity Classifier for Wikidata // In: Rehm G., Declerck T. (Eds.) Language Technologies for the Challenges of the Digital Age. GSCL 2017. Lecture Notes in Computer Science. Springer, Cham. 2018. Vol 10713. P. 115–129. https://doi.org/10.1007/978-3-319-73706-5_10.
52. Scharpf Ph., Schubotz M., Gipp B. Mathematics in Wikidata // CEUR Workshop Proceedings. 2021. Vol. 2982. P. 1–14.
URL: http://ceur-ws.org/Vol-2982/paper-1.pdf, last accessed 2021/11/07.
53. Knoblock C.A., Szekely P. A scalable architecture for extracting, aligning, link-ing, and visualizing multi-Int data // Proc. SPIE 9499, Next-Generation Analyst III, 949907 (15 May 2015). https://doi.org/10.1117/12.2177119.
54. Андреичев М.Д., Гафурова П.О., Елизаров А.М., Липачёв Е.К. Пополнение метаданных документов математических цифровых ретро-коллекций методом семантических сетей // Научный сервис в сети Интернет: труды XXIII Всероссийской научной конференции (20–23 сентября 2021 г., онлайн). М.: ИПМ им. М.В. Келдыша, 2021. С. 22–33. https://doi.org/10.20948/abrau-2021-22. URL: https://keldysh.ru/abrau/2021/theses/22.pdf, last accessed 2021/11/07.
55. Ayers P., Matthews C., Yates B. How Wikipedia Works: And How You Can Be a Part of It. No Starch Press, San Francisco, CA, 2008.
56. Wikipedia Documentation.
URL: https://wikipedia.readthedocs.io/en/latest/code.html, last accessed 2021/11/07.
57. Pywikibot Documentation. URL: https://doc.wikimedia.org/pywikibot/master/index.html, last accessed 2021/11/07.
58. SPARQL Query Language for RDF/W3C. URL: https://www.w3.org/TR/rdf-sparql-query/. last accessed 2021/11/07.
59. MediaWiki is a collaboration and documentation platform brought to you by a vibrant community. URL: https://www.mediawiki.org/wiki/MediaWiki, last ac-cessed 2021/11/07.