Automatic Annotation of HTML Documents using the Microdata Standard

Main Article Content

Abstract

The development of an application based on machine learning methods for automatic annotation of web pages according to the Microdata standard is described, with the possibility of extension to other standards and injecting data to JSX files. Datasets were collected and prepared for training Machine Learning (ML) models. The ML model metrics were collected and analyzed.

Article Details

References

1. HTML5 (HyperText Markup Language). URL: https://html.spec.whatwg.org/multipage/introduction.html.
2. JSX. URL: https://www.typescriptlang.org/docs/handbook/jsx.html.
3. Microdata. URL: https://html.spec.whatwg.org/multipage/microdata.html.
4. JSON-LD. URL: https://json-ld.org.
5. Brinkmann A., Primpeli A., Bizer Ch. The Web Data Commons Schema.org Data Set Series. URL: https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/Brinkmann-etal-TheWDCSchemaorgDataSetSeries-WWW2023.pdf.
6. Schemas. URL: https://schema.org/docs/schemas.html.
7. RDFa. URL: https://www.w3.org/TR/html-rdfa.
8. Microformats. URL: https://microformats.org.
9. Local Business Schema Generator – MicroData & JSON-LD. URL: https://microdatagenerator.org/localbusiness-microdata-generator.
10. Structured Data Markup Helper. URL: https://www.google.com/webmasters/markup-helper/u/0.
11. Entity SEO Tools. URL: https://inlinks.com/.
12. Web-segment. URL: https://github.com/liaocyintl/web-segment.
13. Utiu N., Ionescu V.-S. Learning Web Content Extraction with DOM Features. URL: http://dx.doi.org/10.1109/ICCP.2018.8516632.
14. Peters M. E., Lecocq D. Content extraction using diverse feature sets // WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, May 13–17, 2013. Association for Computing Machinery, New York, NY, United States: 2013, pages 89–90.
15. Gongqing Wu, Li Li, Xuegang Hu, Xindong Wu Web news extraction via path ratios. URL: https://dl.acm.org/doi/abs/10.1145/2505515.2505558.
16. Vadrevu S., Gelgi F., Davulcu H. Semantic partitioning of web pages // Web Information Systems Engineering–WISE 2005: 6th International Conference on Web Information Systems Engineering, New York, NY, USA, November 20–22, 2005. Proceedings 6. – Springer Berlin Heidelberg, 2005. P. 107–118.
17. Extraction Results from the October 2022 Common Crawl Corpus. URL: https://webdatacommons.org/structureddata/#results-2022-1.
18. Common Crawl September/October 2022 Crawl Archive (CC-MAIN-2022-40). URL: https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-40/index.html.
19. SPARQL Query Language. URL: https://www.w3.org/TR/sparql11-query.
20. BERT: https://arxiv.org/abs/1810.04805.
21. Babel. URL: https://babeljs.io/.
22. TypeScript. URL: https://www.typescriptlang.org/.