Missing Data Filling Algorithm for Big Data-Based Map-Reduce Technology

Missing Data Filling Algorithm for Big Data-Based Map-Reduce Technology

Fugui Li, Ashutosh Sharma
Copyright: © 2022 |Pages: 11
DOI: 10.4018/IJeC.304036
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In big data, the large number of missing values has a serious problem to compute the correct decision. This problem seriously affects the quality of information query, distorts data mining and analysis, and misleads the decisions. Therefore, in order to solve the missing values in the real database, we have pre populated the missing data, and filled in the classification attributes based on the probabilistic reasoning. The reasoning process is completed in Bayesian network to realize the parallelization of big data processing. The proposed algorithm has been presented in the Map-Reduce framework. The experimental results show that the Bayesian network construction method and probabilistic inference are effective for the classification data processing, and the parallelism of algorithm in Hadoop.
Article Preview
Top

1. Introduction

Map-Reduce technology for Map-Reduce is a parallel programming framework proposed by Google. It is the basic computing model of cloud computing platform. Its basic idea is to divide and conquer massive data. Parallelization has been implemented for this process of imputation of data. The data format of the operation is Key Value form Map-Reduce (Liu, 2019). The basic process is as follows: first, the input data are partitioned into small files of equal size and assigned to different sub nodes; all the nodes assigned to the input slice execute Map operation at the same time; then the output of the Map function is sorted and merged. The Reduce operation of the Value set under the same Key value is carried out. The core design of the model is Map operation and Reduce operation. These two functions are the key. Map to realize parallelization. The design of Reduce key pair determines the concurrency degree of the algorithm and the load balance (Fan, 2018).

Due to various subjective and objective reasons such as storage device failure, data entry violation or data acquisition device capability limitation, missing values often occur in real databases. For missing values, the traditional approach is to retain complete records for analyzing queries, but this applies only to situations where the rate of deletion is very low. A large amount of information discarded will skew data distribution and mislead data analysis conclusions (Zhai, Zhang, Wang, Shen, & Liu, 2018). A more reasonable method should be to fill in missing values and restore lost information as much as possible. Due to its importance, researchers have proposed some missing value filling methods (Amit Sharma, Singh, Sharma, & Kumar, 2019). The simplest filling method is to replace them with mean or most frequently occurring values. Some filling algorithms use this filling method to preprocess the data to achieve the purpose of improving the effectiveness of the algorithm. Although this interpolation method is simple and easy to operate, it ignores the relationship among attributes. It is not advisable to uniformly fill a fixed value for all the missing values of the same attribute (Wei, Zeng, & Zhou, 2018). Many models in statistics and machine learning are used to solve this problem more effectively. The most commonly used statistical filling methods are EM algorithm, regression prediction method, sampling method and multiple rub interpolation method. The first 3 are single interpolation, which will reduce the variance of estimation. Multiple interpolations produces multiple reasonable interpolating values for each missing value, and obtains some complete data sets. Then, the different results obtained from these data sets are analyzed, and comprehensive inference is given. Statistical filling methods deal with pure numerical data sets. Most of these data are used to analyze estimation Statistics (Khalil, Alshayeji, & Ahmad, 2019). In machine learning, KNN clustering, classification algorithm and neural network have been extensively studied on missing value filling problems. These algorithms train relevant models based on complete data sets, and then estimate incomplete records in this model. In order to enhance the information mining ability of the following algorithm on the data set, many of these algorithms are effective and feasible when dealing with a small amount of data. However, when the amount of data increases, the cost of the algorithm will make the interpolation operation outweigh the gains (Zheng & Wang, 2014). Map-Reduce are the most frequently studied parallel framework at present, and many algorithms have realized Map-Reduce parallelization. Such as decision tree, neural network, frequent pattern mining and so on.

The innovation of this paper is to propose a Bayesian network inference filling algorithm based on Map-Reduce. The experiment proves the effectiveness and parallel efficiency of the algorithm. The contributions of this paper are: 1) the filling method of missing data for large data under Map-Reduce is given. Experiments show that near linear acceleration ratio can be obtained under big data; 2) probabilistic filling algorithm can provide uncertain information for filling values. It can evaluate the quality of data after filling. 3) the algorithm constructs Bayesian network based on correlation, which is suitable for datasets without obvious causal relationship.

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 7 Issues (2023)
Volume 18: 6 Issues (2022): 3 Released, 3 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing