Automatic Addition of Seo Metadata to News Articles using Qwen-Coder
Main Article Content
Abstract
A previously developed pipeline for enriching news articles with structured data is summarized, and an updated configuration is presented in which GPT-3–OpenAI’s third-generation natural language processing model – is replaced with Qwen-Coder. As before, the updated enrichment pipeline uses a dataset of 400 pages selected from Google News, a free news aggregator by Google, remains compatible with the Google Rich Results Test (Google’s tool for validating eligible structured results), and demonstrates that GPT-3-comparable output quality can be achieved on a low-power desktop PC. We describe how this substitution reduces dependence on paid GPT services and report an evaluation comparing the similarity of outputs produced by Qwen-Coder against the GPT-based baseline. The results also show higher performance of the new algorithm compared with the GPT version. The proposed tools lower the barrier to adopting semantic markup practices and thereby broaden their application in digital journalism. Overall, the findings support Qwen-Coder as a cost-effective alternative to large proprietary models for metadata enrichment tasks.
Keywords:
Article Details
References
2. Wang Q. Normalization and Differentiation in Google News: A Multi-Method Analysis of the World’s Largest News Aggregator: Thesis. Rutgers University, NJ, USA, 2020.
3. Rich Results Test. URL: https://search.google.com/test/rich-results (access date: 08.10.2024).
4. Bashir F., Warraich N.F. Systematic literature review of Semantic Web for distance learning // Interactive Learning Environments. 2020. Vol. 31. P. 527–543.
5. Breit A., Waltersdorfer L., Ekaputra F.J., Sabou M., Ekelhart A., Iana A., Paulheim H., Portisch J., Revenko A., Teije A.T., et al. Combining Machine Learning and Semantic Web: A Systematic Mapping Study // ACM Computing Surveys. 2023. Vol. 55. Art. 313.
6. Yu L. Introduction to the Semantic Web and Semantic Web Services. Boca Raton, FL, USA: Chapman and Hall/CRC, 2007.
7. Sporny M., Longley D., Kellogg G., Lanthaler M., Lindström N. JSON-LD 1.1: W3C Recommendation. 2020.
8. Salem H., Salloum H., Orabi O., Sabbagh K., Mazzara M. Enhancing News Articles: Automatic SEO Linked Data Injection for Semantic Web Integration // Applied Sciences. 2025. Vol. 15. Art. 1262. https://doi.org/10.3390/app15031262.
9. OpenAI. GPT-3 powers the next generation of apps. 2021. URL: https://openai.com/index/gpt-3-apps/ (access date: 16.01.2026)
10. Shadbolt N., Berners-Lee T., Hall W. The Semantic Web Revisited // IEEE Intelligent Systems. 2006. Vol. 21. P. 96–101.
11. Poturak M., Keco D., Tutnic E. Influence of search engine optimization (SEO) on business performance: Case study of private university in Sarajevo // International Journal of Research in Business and Social Science. 2022. Vol. 11. P. 59–68.
12. Chandrasekaran B., Josephson J.R., Benjamins V.R. What are ontologies, and why do we need them? // IEEE Intelligent Systems and Applications. 1999. Vol. 14. P. 20–26.
13. Sporny M., Longley D., Kellogg G., Lanthaler M., Lindström N. JSON-LD 1.0: W3C Recommendation. 2014.
14. Adida B., Birbeck M., McCarron S., Pemberton S. RDFa in XHTML: Syntax and processing: W3C Recommendation. 2008.
15. Iqbal M., Khalid M.N., Manzoor A.A., Malik M., Shaikh N.A. Search Engine Optimization (SEO): A Study of important key factors in achieving a better Search Engine Result Page (SERP) Position // Sukkur IBA Journal of Computing and Mathematical Sciences. 2022. Vol. 6. P. 1–15.
16. Alfiana F., Khofifah N., Ramadhan T., Septiani N., Wahyuningsih W., Azizah N.N., Ramadhona N. Apply the Search Engine Optimization (SEO) Method to determine Website Ranking on Search Engines // International Journal of Cyber Services and Management. 2023. Vol. 3. P. 65–73.
17. Mbonigaba C., Sujatha S., Kumar A.D., Vasuki M. Leveraging Digital Channels for Customer Engagement and Sales: Evaluating SEO, Content Marketing, and Social Media for Brand Growth // International Journal of Engineering Research and Modern Education. 2024. Vol. 9. P. 32–40.
18. Lew O.D., Kammerer Y. Factors influencing viewing behavior on search engine results pages: A review of eye-tracking research // Behavior & Information Technology. 2020. Vol. 40. P. 1485–1515.
19. Rahman A.F.R., Alam H., Hartono R. Content Extraction from HTML Documents // Proceedings of the 1st International Workshop on Web Document Analysis (WDA2001). Seattle, WA, USA, 8 September 2001.
20. Lima R., Espinasse B., Oliveira H., Pentagrossa L., Freitas F. Information Extraction from the Web: An Ontology-Based Method Using Inductive Logic Programming // Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence. Herndon, VA, USA, 4–6 November 2013. P. 951–958.
21. Zheng S., Song R., Wen J.-R. Template-Independent News Extraction Based on Visual Consistency // Proceedings of the 22nd National Conference on Artificial Intelligence. Vancouver, BC, Canada, 22–26 July 2007. Washington, DC, USA: AAAI Press, 2007. P. 1507–1512.
22. Zhu W., Dai S., Song Y., Lu Z. Extracting news content with visual unit of web pages // Proceedings of the 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). Takamatsu, Japan, 1–3 June 2015. P. 1–5.
23. Gupta S., Kaiser G., Neistadt D., Grimm P. DOM-based content extraction of HTML documents // Proceedings of the 12th International Conference on World Wide Web. Budapest, Hungary, 20–24 May 2003. P. 207–214.
24. Mirzaaghaei M., Mesbah A. DOM-based test adequacy criteria for web applications // Proceedings of the 2014 International Symposium on Software Testing and Analysis. San Jose, CA, USA, 21–26 July 2014. P. 71–81.
25. Lin J. Divergence Measures Based on the Shannon Entropy // IEEE Transactions on Information Theory. 1991. Vol. 37, No. 1. P. 145–151. https://doi.org/10.1109/18.61115.
26. Corander J., Remes U., Koski T. On the Jensen-Shannon divergence and the variation distance for categorical probability distributions // Kybernetika. 2021. Vol. 57. P. 879–907.
27. Nielsen F. Jensen–Shannon divergence and diversity index: Origins and some extensions. Preprint. 2021.
28. Menéndez M.L., Pardo J.A., Pardo L., Pardo M.C. The Jensen–Shannon divergence // Journal of the Franklin Institute. 1997. Vol. 334. P. 307–318.
29. Qwen Team. Qwen3-Coder: GitHub repository. URL: https://github.com/QwenLM/Qwen3-Coder (access date 11.11.2025).

This work is licensed under a Creative Commons Attribution 4.0 International License.
Presenting an article for publication in the Russian Digital Libraries Journal (RDLJ), the authors automatically give consent to grant a limited license to use the materials of the Kazan (Volga) Federal University (KFU) (of course, only if the article is accepted for publication). This means that KFU has the right to publish an article in the next issue of the journal (on the website or in printed form), as well as to reprint this article in the archives of RDLJ CDs or to include in a particular information system or database, produced by KFU.
All copyrighted materials are placed in RDLJ with the consent of the authors. In the event that any of the authors have objected to its publication of materials on this site, the material can be removed, subject to notification to the Editor in writing.
Documents published in RDLJ are protected by copyright and all rights are reserved by the authors. Authors independently monitor compliance with their rights to reproduce or translate their papers published in the journal. If the material is published in RDLJ, reprinted with permission by another publisher or translated into another language, a reference to the original publication.
By submitting an article for publication in RDLJ, authors should take into account that the publication on the Internet, on the one hand, provide unique opportunities for access to their content, but on the other hand, are a new form of information exchange in the global information society where authors and publishers is not always provided with protection against unauthorized copying or other use of materials protected by copyright.
RDLJ is copyrighted. When using materials from the log must indicate the URL: index.phtml page = elbib / rus / journal?. Any change, addition or editing of the author's text are not allowed. Copying individual fragments of articles from the journal is allowed for distribute, remix, adapt, and build upon article, even commercially, as long as they credit that article for the original creation.
Request for the right to reproduce or use any of the materials published in RDLJ should be addressed to the Editor-in-Chief A.M. Elizarov at the following address: amelizarov@gmail.com.
The publishers of RDLJ is not responsible for the view, set out in the published opinion articles.
We suggest the authors of articles downloaded from this page, sign it and send it to the journal publisher's address by e-mail scan copyright agreements on the transfer of non-exclusive rights to use the work.