Main Navigation
Main Content
Sidebar

Russian Digital Libraries Journal

Home
About
Current
Archives
Register
Login
Search

Published since 1998

ISSN 1562-5419

16+

Language

Русский
English

Search

Search articles for

Advanced filters

Published After

Published Before

By Author

Search Results

An Algorithmic Framework for Accurately Extracting Main Content from News Websites

Hamza Salem, Alexander Sergeevich Toschev

931-942

Abstract:

A new precise MCE algorithm for extracting the main content from news websites is presented. The proposed algorithm uses analysis of the Document Object Model (DOM) structure and content density metrics to identify and extract the informational core of a web page. The implemented approach combines three key features: the maximum number of direct child elements containing text, the maximum textual content without child elements containing text, and the closest position to the average node depth. The algorithm demonstrated superior performance compared to existing solutions such as Boilerpipe and Readability, achieving 99.96% precision, 99.69% recall, and 99.80% F1-score on a comprehensive dataset of 500 diverse web pages. Its language-independent design makes the algorithm particularly effective for extracting multilingual content, including languages with complex structures such as Arabic.

Keywords: NLP, Data Extraction, Language-Independent Algorithm, RAG (Retrieval-Augmented Generation).

Taking into Account the Structure of the Document in the Method of Automatic Annotation of Mathematical Concepts in Educational Texts

Konstantin Sergeevich Nikolaev

558-577

Abstract:

The enrichment of educational texts with semantic content (in particular, adding hyperlinks to the pages of the service that displays detailed information about concepts in the text) helps to increase the efficiency of students' assimilation of the material. The existing methods of semantic markup of educational texts do not take into account the structural features of such documents, which leads to excessive recognition of concepts. This article describes the development of the method of automatic annotation of mathematical concepts in educational mathematical texts by adding functionality to account for the structure of an educational document. The main purpose of the method is to process educational materials of the distance education course "Technology for solving planimetric problems". Following a single template when creating course pages allows you to apply an analysis of the web page markup and keywords used by the course creators. The main task in this process is to determine the type of table cell containing text fragments of educational materials. In accordance with the recommendations of the course creators, definitions should be highlighted in the cells containing the task statement, as well as in those blocks where the input data of the task is indicated. The type of table cells is determined by analyzing their attributes and searching for keywords in their contents. This limitation of recognizable text fragments will improve the student's perception of the course pages and improve the quality of learning.

Keywords: semantic analysis, mathematical ontology, didactic relations, mathematical education, document markup.

1 - 2 of 2 items

Information