Search

Study results for the detection of matching content using citation analysis

Вадим Николаевич Гуреев, Николай Алексеевич Мазов

322-331

Abstract:

Translated plagiarism has widely spread in a scientific world and posed a serious problem due to the challenges in its automatic detection. However, in the last five years some progress has been observed in this area. The authors of this paper, as well as foreign research team from several universities independently of each other proposed an approach to detect plagiarism based on citation analysis with search of initial source for analyzed suspected paper with the same or similar references. Developed methods of detection of illegal use of borrowed text successfully passed several tests. The report shows the results that we have obtained in the last four years.

Keywords: detection of matching content, translated plagiarism, plagiarism detection, citation analysis, bibliographic database.

The Use of Thematic Analysis Methods in Scientometric Systems

Alexander Sergeevich Kozitsyn, Sergey Alexandrovich Afonin, Dmitry Alekseevich Shachnev

315-338

Abstract:

Modern scientometric systems and citation systems use various mechanisms of thematic search and thematic filtering of information. In most cases, a full-text approach is used for thematic analysis of articles and journals, which has a number of limitations. The use of algorithms based on graph analysis, both independently and in conjunction with full-text algorithms, eliminates these limitations and improves the completeness and accuracy of subject search. The algorithm developed by the authors and presented in this work uses the co-authorship graph to analyze the thematic proximity of journals. The algorithm is insensitive to the language of the journal and selects similar journals in different languages, which is difficult to implement for algorithms based on the analysis of full-text information. The algorithm was tested in the scientometric system IAS ISTINA. In the interface developed for these purposes, the user can select one journal that is close to him on the subject, and the system will automatically generate a selection of journals that may be of interest to the user both in terms of studying the materials available in them and in terms of publishing his own articles. In the future, the developed algorithm can be adapted to search for similar conferences, collections of publications and scientific projects. The presence of such a tool will increase the publication activity of young employees, increase the citation rate of articles and the citation rate between journals. The results of the algorithm for determining thematic proximity between journals, collections, conferences and scientific projects can also be used to build rules in models of differentiating access to data based on domain ontologies.

Keywords: thematic classification, bibliographic data, co-authorship graph, information systems.

Description Of Context Free Grammars In Json Format For Parser Generators

Oleg Konstantinovich Osipov

1301-1323

Abstract:

Analysis of various presentations for context free grammars provided with parser generators. A new description format of context free grammars is proposed. Given a representation of context free grammar in JSON format. The concept of a new parser generator based on JSON data format of describing context free grammars is presented. Described a parser generation scheme based on that concept.

Keywords: JSON-document, context free grammars, lexeme, Backus Naur Form, parsing tree, terminal symbols (tokens), deterministic finite state automata, parser, Parglare, ANTLR.

Recommender system of text analytics of legal documents

Денис Сергеевич Зуев, Марат Фаритович Насрутдинов, Айрат Фаридович Хасьянов

435-449

Abstract:

The paper discusses the use of machine learning mechanisms, natural language analysis and intellectual search in the field of jurisprudence. The main expected results are the methodology for applying text-based analytics and semantic natural language processing (NLP) algorithms in knowledge management cases in different types of legal practice. The obtained results can be applied in the field of education and knowledge management in a wider context, since the study lies at the union of jurisprudence, mathematical and computer linguistics.

We describe a prototype of a multi-agent system of intellectual analysis of legal texts that is capable of identifying general dependencies on the existing database of legal documents, providing legal cases with similar topics, recommending the most likely outcomes of judicial review.

Keywords: data analytics and data mining, data intensive domains, digital libraries, clustering, classification of judicial acts, recommender system, micro-service architecture.

Analysis of Word Embeddings for Semantic Role Labeling of Russian Texts

Leysan Maratovna Kadermyatova, Elena Victorovna Tutubalina

1026-1043

Abstract: Currently, there are a huge number of works dedicated to semantic role labeling of English texts [1–3]. However, semantic role labeling of Russian texts was an unexplored area for many years due to the lack of train and test corpora. Semantic role labeling of Russian Texts was widely disseminated after the appearance of the FrameBank corpus [4]. In this approach, we analyzed the influence of the word embedding models on the quality of semantic role labeling of Russian texts. Micro- and macro- F1 scores on word2vec [5], fastText [6], ELMo [7] embedding models were calculated. The set of experiments have shown that fastText models averaged slightly better than word2vec models as applied to Russian FrameBank corpus. The higher micro- and macro- F1 scores were obtained on deep tokenized word representation model ELMo in relation to classical shallow embedding models.

Keywords: machine learning, ML-model, natural language processing, word embedding, semantic role labeling.

Taking into Account the Structure of the Document in the Method of Automatic Annotation of Mathematical Concepts in Educational Texts

Konstantin Sergeevich Nikolaev

558-577

Abstract:

The enrichment of educational texts with semantic content (in particular, adding hyperlinks to the pages of the service that displays detailed information about concepts in the text) helps to increase the efficiency of students' assimilation of the material. The existing methods of semantic markup of educational texts do not take into account the structural features of such documents, which leads to excessive recognition of concepts. This article describes the development of the method of automatic annotation of mathematical concepts in educational mathematical texts by adding functionality to account for the structure of an educational document. The main purpose of the method is to process educational materials of the distance education course "Technology for solving planimetric problems". Following a single template when creating course pages allows you to apply an analysis of the web page markup and keywords used by the course creators. The main task in this process is to determine the type of table cell containing text fragments of educational materials. In accordance with the recommendations of the course creators, definitions should be highlighted in the cells containing the task statement, as well as in those blocks where the input data of the task is indicated. The type of table cells is determined by analyzing their attributes and searching for keywords in their contents. This limitation of recognizable text fragments will improve the student's perception of the course pages and improve the quality of learning.

Keywords: semantic analysis, mathematical ontology, didactic relations, mathematical education, document markup.

Semantic Annotation of Mathematical Formulas in PDF-Documents

Olga Avenirovna Nevzorova, Konstantin Sergeevich Nikolaev

616-639

Abstract:

This article provides an overview of existing solutions for semantic analysis of mathematical documents, and also presents a method for automatic semantic analysis of documents in PDF format. This method searches for local variables in the text of the article, extracts their definitions and connects concepts with formulas. The advantage of the method over the existing ones is independence from the markup of the original PDF document, which expands the scope of the method. We provide estimates of recall, precision and F-measure for algorithms for finding variables and linking local variables with formulas. The resulting semantic markup of the document will be used to create a collection of documents suitable for the semantic formula search service, which is part of the set of services of the Lobachevskii-DML digital publishing system.

Keywords: semantic analysis, PDF, document processing, scientific journals, Lobachevskii-DML.

Software Tool for Videoproduction Optimisation

Rustem Faridovich Davletshin, Irina Sergeevna Shakhova

478-502

Abstract:

The paper proposes software mechanisms aimed at enhancing video production processes for the authors of artistic video materials. We propose a mechanism for creating animated three-dimensional shooting plans (storyboards) using augmented reality to position and animate the movement of actors. In order to overcome the limitations of the iOS operating system related to access to sensors, we developed a mechanism for separately capturing audio and video streams from device sensors for recording and their subsequent synchronization by timestamps for saving to device memory. Computer vision technologies are used to ensure compliance with the rules of compositional construction and image quality analysis. The paper also presents mechanisms for working with the script, including text processing algorithms for displaying subtitles on the screen, and speech recognition algorithms for comparing speech recognition of actors with the text of the script.

Keywords: video production, movie making, mobile cinema, augmented reality, storyboard, video recordings, automation, software solution.

A method for detecting artificial and non-scientific texts in the collection of documents

Олег Юрьевич Бахтеев, Маргарита Валерьевна Кузнецова, Алексей Владимирович Романов, Юрий Викторович Чехович

298-304

Abstract: In this paper, we propose a method of machine-generated and non-scientific text detection in a collection of scientific papers. The method is based on lexical and morphological analysis of the document examined with the help of language modeling. This technique enables estimation of probability that the text belongs to the class of scientific documents. Experimental evidence shows feasibility of the approach.

Keywords: natural language processing, document classification, text mining, statistical language models, machine-generated text detection.

Teaching Mathematical Disciplines Using the Mirera Digital Educational Platform

Alexander Georgievich Leonov

312-323

Abstract:

The article describes the experience of digital transformation of mathematical disciplines based on author's digital educational platform Mirera. The Mirera DEP is optimized for the Russian system of organization of higher education, focused on the development and delivery of courses that combine online and offline technologies for conducting the educational process. The Mirera DEP provides course authors with tools for developing computerized courses with automated verification of the correctness and independence of current and control tasks performed by students using artificial intelligence methods. Various original types of tests are built into the platform, supporting both in the description of tasks and in answer options, content in various formats, including TeX, sequences of elements (for automated testing of student knowledge of the structure of proof of course theorems or schemes for solving typical problems), semantic analysis of text responses, etc.

Keywords: adaptive learning, DEP Mirera, digital educational platform, programming, web applications.

Experimental Study of Cognitive Function of Generating Elliptical Sentences in Planimetric Tasks

Vladimir Andreevich Parkhomenko, Xenia Aleksandrovna Naidenova, Tan’yana Aleksandrovna Martirova, Alexander Valentinovich Schukin

316-335

Abstract:

The paper is devoted to the study of the cognitive function associated with the generation of elliptical sentences in the Russian language. The study is conducted by testing this cognitive ability using a computer system specially developed by the authors for this purpose. Testing of this cognitive ability is proposed and implemented for the first time. The system is an extension of Moodle and is openly hosted in the github repository. Elliptical constructions are limited to verbal and nominal ellipses, which are theoretically possible to be completely reconstructed based on the context of the sentence. The study is conducted with the participation of SPbPU students as respondents. The texts of planimetric tasks are chosen as the subject area. As a result of the analysis of testing data, the following results are obtained: the influence of the respondent’s knowledge of the subject area (planimetry) on the test results is established; a tendency towards self-study of respondents was discovered, which is manifested in a reduction in time and an increase in scores as they pass tests; it is shown that respondents are poorly motivated if they do not see feedback on the answer to the completed task. The paper discusses the problems of further development of the testing system and its use in adapting questionnaires (tasks) to assess the knowledge of SPbPU students in the field of automation of bug detection in programs, as well as for diagnosing the functional state of operator specialists and express diagnosis of dementia. It also seems promising to use the system to improve the processes of syntactic parsing of elliptic sentences and automate the restoration of ellipses in the subject area of planimetry.

Keywords: online testing system, development, experiments, cognitive function, ellipsis, planimetry.

Determining the Thematic Proximity of Scientific Journals and Conferences Using Big Data Technologies

Alexander Sergeevich Kozitsin, Sergey Alexandrovich Afonin, Dmitiy Alekseevich Shachnev

514-525

Abstract: The number of scientific journals published in the world is very large. In this regard, it is necessary to create software tools that will allow analyzing thematic links of journals. The algorithm presented in this paper uses graphs of co-authorship for analyzing the thematic proximity of journals. It is insensitive to the language of the journal and can find similar journals in different languages. This task is difficult for algorithms based on the analysis of full-text information. Approbation of the algorithm was carried out in the scientometric system IAS ISTINA. Using a special interface, a user can select one interesting journal. Then the system will automatically generate a selection of journals that may be of interest to the user. In the future, the developed algorithm can be adapted to search for similar conferences, collections of publications and research projects. The use of such tools will increase the publication activity of young employees, increase the citation of articles and quoting between journals. In addition, the results of the algorithm for determining thematic proximity between journals, collections, conferences and research projects can be used to build rules in the ontology models for access control systems.

Keywords: thematic classification, bibliographic data, graph of co-authorship, Information Systems.

Authors Identification within the Subject Area in the Semantic Library

Olga Muratovna Ataeva, Vladimir Alekseevich Serebriakov, Natalia Pavlovna Tuchkova

198-217

Abstract:

The peculiarities of the task of authors identifying and determining author's contribution to publications in digital bibliographic codes are considered. The features of the problem of insufficient identification are manifested in the repetition of information, doubling, the presence of authors with completely coincidental names, self-quotation, autoplagiate and plagiarism itself. It is proposed to use publication information that has already been accumulated in the digital library in the form of related object area data and a variety of target thesaurus data, as the author and user of the library. This information contains links whereby keyword contexts, multiple co-authors, and term associations in dictionaries and thesauruses can be used to identify authorship. It is important that an array of scientific publications is considered, since they have an established traditional structure, which allows comparing fixed text elements (annotations, keywords, classifier codes, etc.). Thus, even if the names in the publications are fully matched, the question of authorship can be raised if the publications in the digital library correspond to different subject areas. Resolution of such contradictions is accomplished by evaluating a plurality of links of all elements of secondary publication information. The result of the comparison could be the addition of the author to a specific area, i.e. the extension of the addressee's thesaurus and the author's personal thesaurus, or the appearance of full namesakes in the library, but from different areas of knowledge. It has been shown that modern data analysis tools allow you to evaluate the author's contribution to publication, despite the fact that of course, only the scientific community can evaluate the real contribution to scientific research.

Keywords: comparison of scientific texts, semantic search, thesaurus for the ontology of knowledge information, query using the thesaurus methods of authors identification, addressee thesaurus, secondary information, individual frequency dictionary, LibMeta.

Extraction of aspects of goods and services from consumers reviews using conditional random fields model

Юлия Владимировна Рубцова, Сергей Андреевич Кошельников

203-221

Abstract:

This paper describes the Information extraction system that was presented at SentiRuEval-2015: aspect-based sentiment analysis of users' reviews in Russian. The proposed system uses a conditional random field algorithm to extract aspect terms mentioned in the text. A set of morphological features was used for machine learning. The system intent to perform two subtasks, Task A – automatic extraction of explicit aspects and Task B – automatic extraction of all aspects (explicit, implicit and sentiment facts), and tested on two domains: restaurants and automobiles. Our systems performed competitively and showed the results comparable to those of the other 10 participants.

Keywords: information retrieval, CRF, aspect extraction, content analysis.

Interactive Structure Editor for Scenario Prototyping Tool

Gulnara Faritovna Sahibgareeva, Vlada Vladimirovna Kugurakova

1184-1202

Abstract:

The task of automating the routine work of computer game writers and narrative designers, set forth in earlier works, has been continued in the presented work. The issues of visualization of branching narrative structures of computer games are considered, the analysis of various approaches to visualization of the plot and other important components of a video game is performed, a technological stack is selected and specific solutions for storing in the form of a structured script, allowing the generation of continuing narrative branches and testing of the narrative prototyping stage using the automatically generated text novelette are given.

Keywords: interactive storytelling, computer games, game script, visualization, branched structures, graphs, narrative prototyping, script prototype, GPT-2, ruGPT3, python, unity.

Search Results