Data Extraction from Similarly Structured Scanned Documents

Main Article Content

Rustem Damirovich Saitgareev
Bulat Rifatovich Giniyatullin
Vladislav Yurievich Toporov
Artur Aleksandrovich Atnagulov
Farid Radikovich Aglyamov

Abstract

Currently, the major part of transmitted and stored data is unstructured, and the amount of unstructured data is growing rapidly each year, although it is hardly searchable, unqueryable, and its processing is not automated. At the same time, there is a growth of electronic document management systems. This paper proposes a solution for extracting data from paper documents considering their structure and layout based on document photos. By examining different approaches, including neural networks and plain algorithmic methods, we present their results and discuss them.

Article Details

Author Biographies

Rustem Damirovich Saitgareev

Master's student at the Department of Software Engineering, Institute of Information Technologies and Intelligent Systems, Kazan federal university.

Bulat Rifatovich Giniyatullin

Master's student at the Department of Software Engineering, Institute of Information Technologies and Intelligent Systems, Kazan federal university.

Vladislav Yurievich Toporov

Master's student at the Department of Software Engineering, Institute of Information Technologies and Intelligent Systems, Kazan federal university

Artur Aleksandrovich Atnagulov

Master's student at the Department of Software Engineering, Institute of Information Technologies and Intelligent Systems, Kazan federal university.

Farid Radikovich Aglyamov

Master's student at the Department of Software Engineering, Institute of Information Technologies and Intelligent Systems, Kazan federal university.

References

1. Развитие электронного документооборота в России. Статистика, факты, перспективы // Taxcom. URL: https://taxcom.ru/baza-znaniy/ elektronnyy-dokumentooborot/stati/razvitie-elektronnogo-dokumentooborota-v-rossii-statistika-fakty-perspektivy/ (дата обращения 24.02.2021).
2. СЭД (рынок России) // TAdviser. URL: https://www.tadviser.ru/index.php/Статья:СЭД_(рынок_России) (дата обращения 08.03.2021).
3. AI Unleashes the Power of Unstructured Data // CIO.
URL: https://www.cio.com/article/3406806/ai-unleashes-the-power-of-unstructured-data.html (дата обращения 23.03.2021).
4. Structured vs. Unstructured Data // Datamation. URL: https://www.datamation.com/big-data/structured-vs-unstructured-data/ (дата обращения 23.03.2021).
5. Structured and Unstructured Documents: What are the Differences? // Optiform
URL: https://www.optiform.com/news/structured-unstructured-documents/ (дата обращения 23.03.2021).
6. McKendrick J. The Post-Relational Reality Sets in: 2011 Survey on Unstructured Data // Unisphere Research. 2011.
7. Rusu O. and al. Converting unstructured and semi-structured data into knowledge // 2013 11th RoEduNet International Conference. IEEE, 2013. P. 1–4.
8. Mori S., Suen C. Y., Yamamoto K. Historical review of OCR research and development // Proceedings of the IEEE. 1992. V. 80, No. 7. P. 1029–1058.
9. Memon J. and al. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR) // IEEE Access. 2020. V. 8. P. 142642–142668.
10. Vihar Kurama. Table Detection, Information Extraction and Structuring using Deep Learning // Nanonets. URL: https://nanonets.com/blog/table-extraction-deep-learning/ (дата обращения 23.02.2021).
11. Hwang W. and al. Spatial Dependency Parsing for Semi-Structured Document Information Extraction // arXiv. 2020.
12. Xu Y. and al. Layoutlm: Pre-training of text and layout for document image understanding // Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020. P. 1192–1200.
13. Ye Y. and al. A unified scheme of text localization and structured data extraction for joint OCR and data mining // 2018 IEEE International Conference on Big Data (Big Data). IEEE. 2018. P. 2373–2382.
14. Luo S. and al. Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images // arXiv. 2021.
15. Haase F., Kirchhoff S. Taxy. io@ FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning // Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation. 2020. P. 163–168.
16. Rahman M. M., Finin T. Unfolding the Structure of a Document using Deep Learning // arXiv. 2019.
17. Dos Santos J. E. B. Automatic content extraction on semi-structured documents //2011 International Conference on Document Analysis and Recognition. IEEE. 2011. P. 1235–1239.
18. Alexander Jung. Imgaug Documentation Release 0.4.0 // Readthedocs. URL: https://imgaug.readthedocs.io/en/latest/ (дата обращения 02.27.2021).
19. Visvalingam M., Whyatt J. D. The Douglas‐Peucker algorithm for line simplification: re‐evaluation through visualization // Computer Graphics Forum. Oxford, UK: Blackwell Publishing Ltd, 1990. V. 9, No. 3. P. 213–225.
20. Intersection over Union (IoU) for object detection // PyImageSearch. URL: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ (дата обращения 27.02.2021).


Most read articles by the same author(s)