A method for detecting artificial and non-scientific texts in the collection of documents

Main Article Content

Олег Юрьевич Бахтеев
Маргарита Валерьевна Кузнецова
Алексей Владимирович Романов
Юрий Викторович Чехович

Abstract

In this paper, we propose a method of machine-generated and non-scientific text detection in a collection of scientific papers. The method is based on lexical and morphological analysis of the document examined with the help of language modeling. This technique enables estimation of probability that the text belongs to the class of scientific documents. Experimental evidence shows feasibility of the approach.

Article Details

Author Biographies

Олег Юрьевич Бахтеев

Senior researcher, Antiplagiat Company.

Маргарита Валерьевна Кузнецова

Head of research department, Antiplagiat Company.

Алексей Владимирович Романов

Assistant, Abbyy Company.

Юрий Викторович Чехович

Chief Executive Officer, Antiplagiat Company, PhD (Mathematics).

References

1. Arase Y., Zhou M. Machine Translation Detection from Monolingual Web-Text // ACL (1). 2013. P. 1597–1607.
2. Labbé C., Labbé D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? //Scientometrics. 2013. V. 94, No 1. P. 379–396.
3. Van Noorden R. Publishers withdraw more than 120 gibberish papers //Nature. 2014. V. 24.
4. Гречников Е. А. и др. Поиск неестественных текстов // Тр. XI Всероссийской научной конференции «Электронные библиотеки: перспективные методы и технологии, электронные коллекции». Петрозаводск, 2009. С. 306–308.


Most read articles by the same author(s)