Abstract:
In this paper, we propose a method of machine-generated and non-scientific text detection in a collection of scientific papers. The method is based on lexical and morphological analysis of the document examined with the help of language modeling. This technique enables estimation of probability that the text belongs to the class of scientific documents. Experimental evidence shows feasibility of the approach.
Keywords:
natural language processing, document classification, text mining, statistical language models, machine-generated text detection.