• Main Navigation
  • Main Content
  • Sidebar

Russian Digital Libraries Journal

  • Home
  • About
    • About the Journal
    • Aims and Scopes
    • Themes
    • Editor-in-Chief
    • Editorial Team
    • Submissions
    • Open Access Statement
    • Privacy Statement
    • Contact
  • Current
  • Archives
  • Register
  • Login
  • Search
Published since 1998
ISSN 1562-5419
16+
Language
  • Русский
  • English

Search

Advanced filters

Search Results

An Algorithmic Framework for Accurately Extracting Main Content from News Websites

Hamza Salem, Alexander Sergeevich Toschev
931-942
Abstract:

A new precise MCE algorithm for extracting the main content from news websites is presented. The proposed algorithm uses analysis of the Document Object Model (DOM) structure and content density metrics to identify and extract the informational core of a web page. The implemented approach combines three key features: the maximum number of direct child elements containing text, the maximum textual content without child elements containing text, and the closest position to the average node depth. The algorithm demonstrated superior performance compared to existing solutions such as Boilerpipe and Readability, achieving 99.96% precision, 99.69% recall, and 99.80% F1-score on a comprehensive dataset of 500 diverse web pages. Its language-independent design makes the algorithm particularly effective for extracting multilingual content, including languages with complex structures such as Arabic.

Keywords: NLP, Data Extraction, Language-Independent Algorithm, RAG (Retrieval-Augmented Generation).

Taking into Account the Structure of the Document in the Method of Automatic Annotation of Mathematical Concepts in Educational Texts

Konstantin Sergeevich Nikolaev
558-577
Abstract:

The enrichment of educational texts with semantic content (in particular, adding hyperlinks to the pages of the service that displays detailed information about concepts in the text) helps to increase the efficiency of students' assimilation of the material. The existing methods of semantic markup of educational texts do not take into account the structural features of such documents, which leads to excessive recognition of concepts. This article describes the development of the method of automatic annotation of mathematical concepts in educational mathematical texts by adding functionality to account for the structure of an educational document. The main purpose of the method is to process educational materials of the distance education course "Technology for solving planimetric problems". Following a single template when creating course pages allows you to apply an analysis of the web page markup and keywords used by the course creators. The main task in this process is to determine the type of table cell containing text fragments of educational materials. In accordance with the recommendations of the course creators, definitions should be highlighted in the cells containing the task statement, as well as in those blocks where the input data of the task is indicated. The type of table cells is determined by analyzing their attributes and searching for keywords in their contents. This limitation of recognizable text fragments will improve the student's perception of the course pages and improve the quality of learning.

Keywords: semantic analysis, mathematical ontology, didactic relations, mathematical education, document markup.
1 - 2 of 2 items
Information
  • For Readers
  • For Authors
  • For Librarians
Make a Submission
Current Issue
  • Atom logo
  • RSS2 logo
  • RSS1 logo

Russian Digital Libraries Journal

ISSN 1562-5419

Information

  • About the Journal
  • Aims and Scopes
  • Themes
  • Author Guidelines
  • Submissions
  • Privacy Statement
  • Contact
  • eLIBRARY.RU
  • dblp computer science bibliography

Send a manuscript

Authors need to register with the journal prior to submitting or, if already registered, can simply log in and begin the five-step process.

Make a Submission
About this Publishing System

© 2015-2026 Kazan Federal University; Institute of the Information Society