Main Navigation
Main Content
Sidebar

Russian Digital Libraries Journal

Home
About
Current
Archives
Register
Login
Search

Published since 1998

ISSN 1562-5419

16+

Language

Русский
English

Search

Search articles for

Advanced filters

Published After

Published Before

By Author

Search Results

Development a Data Validation Module to Satisfy the Retention Policy Metric

Aigul Ildarovna Sibgatullina, Azat Shavkatovich Yakupov

159-178

Abstract:

Every year the size of the global big data market is growing. Analysing these data is essential for good decision-making. Big data technologies lead to a significant cost reduction with use of cloud services, distributed file systems, when there is a need to store large amounts of information. The quality of data analytics is dependent on the quality of the data themselves. This is especially important if the data has a retention policy and migrates from one source to another, increasing the risk of a data loss. Prevention of negative consequences from data migration is achieved through the process of data reconciliation – a comprehensive verification of large amounts of information in order to confirm their consistency.

This article discusses probabilistic data structures that can be used to solve the problem, and suggests an implementation – data integrity verification module using a Counting Bloom filter. This module is integrated into Apache Airflow to automate its invocation.

Keywords: big data, retention policy, partition, parquet file, Bloom filter.

Creating a comparison method for relational tables

Azat Shavkatovich Yakupov, Daniil Andreevich Klinov

173-183

Abstract: The article is devoted to creating a quick method of comparing a huge amount of data tables in relational database management systems. Creating an effective method for comparing relational systems is really relevant today. The study of existing solutions was conducted. The algorithm in this article was created using the probabilistic data structure «Countable Bloom filter» and the Monte Carlo Method. The proposed solution is unique in its direction, as it uses the least amount of temporary resources. A probabilistic model of the created algorithm is constructed, this algorithm can be used for parallelization.

Keywords: multiset, comparison of relational tables, heterogeneous system, Countable Bloom filter, Monte Carlo method, replication, Oracle, PostgreSQL, Probabilistic data structure.

1 - 2 of 2 items

Information

For Readers
For Authors
For Librarians

Make a Submission

Current Issue

Russian Digital Libraries Journal

ISSN 1562-5419

Information

About the Journal
Aims and Scopes
Themes
Author Guidelines
Submissions
Privacy Statement
Contact
eLIBRARY.RU
dblp computer science bibliography

Send a manuscript

Authors need to register with the journal prior to submitting or, if already registered, can simply log in and begin the five-step process.

Make a Submission