Search

Development a Data Validation Module to Satisfy the Retention Policy Metric

Aigul Ildarovna Sibgatullina, Azat Shavkatovich Yakupov

159-178

Abstract:

Every year the size of the global big data market is growing. Analysing these data is essential for good decision-making. Big data technologies lead to a significant cost reduction with use of cloud services, distributed file systems, when there is a need to store large amounts of information. The quality of data analytics is dependent on the quality of the data themselves. This is especially important if the data has a retention policy and migrates from one source to another, increasing the risk of a data loss. Prevention of negative consequences from data migration is achieved through the process of data reconciliation – a comprehensive verification of large amounts of information in order to confirm their consistency.

This article discusses probabilistic data structures that can be used to solve the problem, and suggests an implementation – data integrity verification module using a Counting Bloom filter. This module is integrated into Apache Airflow to automate its invocation.

Keywords: big data, retention policy, partition, parquet file, Bloom filter.

Intelligent search of complex objects in Big Data

Александр Михайлович Гусенков

40-76

Abstract: This article considers approach to intelligent search of complex objects in different types of texts with structural markup which can be used for Big Data processing. We research two types of data entry: relational databases, which use their schemes as structural markup, and full-text scientific documents containing mathematical expressions (formulae). For such full-text documents we suggest additory automated markup to allow formula search. In both cases we use natural language texts, which are semistructured data, as data source for building ontology and conducting search at a later stage. For relational databases those are comments to table and table attribute names; for scientific documents (articles, monographs, etc.) it is a text content of marked up documents.

Keywords: Big Data, semantic search, semi-structured data, ontology, relational databases, science texts, mathematical expressions markup.

Building Subject Domain Ontology on the Base of a Logical Data Mod

Alexander M. Gusenkov, Naille R. Bukharaev, Evgeny V. Biryaltsev

390-417

Abstract: The technology of automated construction of the subject domain ontology, based on information extracted from the comments of the TATNEFT oil company relational databases, is considered. The technology is based on building a converter (compiler) translating the logical data model of Epicenter Petrotechnical Open Software Corporation (POSC), presented in the form of ER diagrams and a set of the EXPRESS object-oriented language descriptions, into the OWL ontology description language, recommended by the W3C consortium. The basic syntactic and semantic aspects of the transformation are described.

Keywords: subject domain ontology, relational databases, POSC, OWL.

Technology trends handling of big data and tools storage of multiformat data and analytics

Марат Рамилевич Биктимиров, Александр Михайлович Елизаров, Андрей Юрьевич Щербаков

390-407

Abstract: This article analyzes the development trends of processing Big Data tools and multi-format data storage and analysis. This analysis was carried out as part of our program of basic research of the Department of Mathematical Sciences, Russian Academy of Sciences “Algebraic and Combinatorial Methods of Mathematical Cybernetics and information systems of the new generation", as well as RFBR grant number 14-07-00783 “Way to store and process a large volume of scientific and reference data modern hardware platforms”.

Keywords: Big Data, storage systems, analysis, information, software, grid computing, cloud computing.

The History Of Genius Discovery Software

Roman Valerʹevich Mosolov

1239-1278

Abstract:

This article description the conception of History of Genius Discovery (History GD) software. The software has few similarities with GitHub software that have got wide famous at the professional developer’s community. The software appealed to solve two main science issues. History GD will save science and cultural heritage of Russian scientists and accumulate initial data for measuring tendencies of science theorems formation. The last will give the probability for appending The Structure of Scientific Revolutions by Thomas S. Kuhn by using numeric big data. Also, the software will minimise probability of losing scientific manuscripts by reason of scientists deaths. Software engineering, sociology, philosophy, law, and history are five scientific directions that are used as base for creating this software. The idea of creation have got at Kazan Federal University when we learned Big Data Science.

Keywords: The History of Genius Discovery, History GD, scientific heritage, cultural heritage, genius patterns, scientific software, software for scientists, GitHub for scientists.

Semantic analysis of documents in the control system of digital scientific collections

Шамиль Махмутович Хайдаров

61-85

Abstract: Methods of the semantic documents parsing in digital control system of scientific collections, including electronic journals, offered. The methods of processing documents containing mathematical formulas and methods for the conversion of documents from the OpenXML-format in ТеХ-format considered. The search algorithm for the mathematical formulas in the collections of documents stored in OpenXML-format designed. The algorithm is implemented as online-service on platform science.tatarstan.

Keywords: semantic analysis, publishing systems.

Towards Virtual Data Centres for Remote Sensing

Е.Б. Кудашев, М.А. Попов

Abstract: Remote Sensing from satellites allow a global perspective on observations of the Earth to be developed. This paper gives an overview of some of the international initiatives that have been created to improve the exploitation of remotely sensed data for environmental studies. The focus is on the activities and scientific challenges facing GEO/GEOSS on Earth Observation. Other relevant international initiatives are also presented, such as CEOS, GMES and APARSEN. The benefits of creating a Virtual Centre of Remote Sensing Data in are also discussed.

Keywords: Remote Sensing, Infrastructure for Scientific Information Resources, GeoPortal, CEOS - Committee on Earth Observation Satellite, GMES - Global Monitoring for Environment and Security, APARSEN - Alliance for Permanent Access to Records of Science.

Determining the Thematic Proximity of Scientific Journals and Conferences Using Big Data Technologies

Alexander Sergeevich Kozitsin, Sergey Alexandrovich Afonin, Dmitiy Alekseevich Shachnev

514-525

Abstract: The number of scientific journals published in the world is very large. In this regard, it is necessary to create software tools that will allow analyzing thematic links of journals. The algorithm presented in this paper uses graphs of co-authorship for analyzing the thematic proximity of journals. It is insensitive to the language of the journal and can find similar journals in different languages. This task is difficult for algorithms based on the analysis of full-text information. Approbation of the algorithm was carried out in the scientometric system IAS ISTINA. Using a special interface, a user can select one interesting journal. Then the system will automatically generate a selection of journals that may be of interest to the user. In the future, the developed algorithm can be adapted to search for similar conferences, collections of publications and research projects. The use of such tools will increase the publication activity of young employees, increase the citation of articles and quoting between journals. In addition, the results of the algorithm for determining thematic proximity between journals, collections, conferences and research projects can be used to build rules in the ontology models for access control systems.

Keywords: thematic classification, bibliographic data, graph of co-authorship, Information Systems.

Search Results