Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Missing Data Filling Algorithm for Big Data-Based Map-Reduce Technology

Fugui Li, Ashutosh Sharma

Source Title: International Journal of e-Collaboration (IJeC) 18(2)

DOI: 10.4018/IJeC.304036

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

In big data, the large number of missing values has a serious problem to compute the correct decision. This problem seriously affects the quality of information query, distorts data mining and analysis, and misleads the decisions. Therefore, in order to solve the missing values in the real database, we have pre populated the missing data, and filled in the classification attributes based on the probabilistic reasoning. The reasoning process is completed in Bayesian network to realize the parallelization of big data processing. The proposed algorithm has been presented in the Map-Reduce framework. The experimental results show that the Bayesian network construction method and probabilistic inference are effective for the classification data processing, and the parallelism of algorithm in Hadoop.

Article Preview

Top

1. Introduction

Map-Reduce technology for Map-Reduce is a parallel programming framework proposed by Google. It is the basic computing model of cloud computing platform. Its basic idea is to divide and conquer massive data. Parallelization has been implemented for this process of imputation of data. The data format of the operation is Key Value form Map-Reduce (Liu, 2019). The basic process is as follows: first, the input data are partitioned into small files of equal size and assigned to different sub nodes; all the nodes assigned to the input slice execute Map operation at the same time; then the output of the Map function is sorted and merged. The Reduce operation of the Value set under the same Key value is carried out. The core design of the model is Map operation and Reduce operation. These two functions are the key. Map to realize parallelization. The design of Reduce key pair determines the concurrency degree of the algorithm and the load balance (Fan, 2018).

Due to various subjective and objective reasons such as storage device failure, data entry violation or data acquisition device capability limitation, missing values often occur in real databases. For missing values, the traditional approach is to retain complete records for analyzing queries, but this applies only to situations where the rate of deletion is very low. A large amount of information discarded will skew data distribution and mislead data analysis conclusions (Zhai, Zhang, Wang, Shen, & Liu, 2018). A more reasonable method should be to fill in missing values and restore lost information as much as possible. Due to its importance, researchers have proposed some missing value filling methods (Amit Sharma, Singh, Sharma, & Kumar, 2019). The simplest filling method is to replace them with mean or most frequently occurring values. Some filling algorithms use this filling method to preprocess the data to achieve the purpose of improving the effectiveness of the algorithm. Although this interpolation method is simple and easy to operate, it ignores the relationship among attributes. It is not advisable to uniformly fill a fixed value for all the missing values of the same attribute (Wei, Zeng, & Zhou, 2018). Many models in statistics and machine learning are used to solve this problem more effectively. The most commonly used statistical filling methods are EM algorithm, regression prediction method, sampling method and multiple rub interpolation method. The first 3 are single interpolation, which will reduce the variance of estimation. Multiple interpolations produces multiple reasonable interpolating values for each missing value, and obtains some complete data sets. Then, the different results obtained from these data sets are analyzed, and comprehensive inference is given. Statistical filling methods deal with pure numerical data sets. Most of these data are used to analyze estimation Statistics (Khalil, Alshayeji, & Ahmad, 2019). In machine learning, KNN clustering, classification algorithm and neural network have been extensively studied on missing value filling problems. These algorithms train relevant models based on complete data sets, and then estimate incomplete records in this model. In order to enhance the information mining ability of the following algorithm on the data set, many of these algorithms are effective and feasible when dealing with a small amount of data. However, when the amount of data increases, the cost of the algorithm will make the interpolation operation outweigh the gains (Zheng & Wang, 2014). Map-Reduce are the most frequently studied parallel framework at present, and many algorithms have realized Map-Reduce parallelization. Such as decision tree, neural network, frequent pattern mining and so on.

The innovation of this paper is to propose a Bayesian network inference filling algorithm based on Map-Reduce. The experiment proves the effectiveness and parallel efficiency of the algorithm. The contributions of this paper are: 1) the filling method of missing data for large data under Map-Reduce is given. Experiments show that near linear acceleration ratio can be obtained under big data; 2) probabilistic filling algorithm can provide uncertain information for filling values. It can evaluate the quality of data after filling. 3) the algorithm constructs Bayesian network based on correlation, which is suitable for datasets without obvious causal relationship.

Complete Article List

Search this Journal:

Reset

Volume 20: 1 Issue (2024)

Volume 19: 7 Issues (2023)

Volume 18: 6 Issues (2022): 3 Released, 3 Forthcoming

Volume 17: 4 Issues (2021)

Volume 16: 4 Issues (2020)

Volume 15: 4 Issues (2019)

Volume 14: 4 Issues (2018)

Volume 13: 4 Issues (2017)

Volume 12: 4 Issues (2016)

Volume 11: 4 Issues (2015)

Volume 10: 4 Issues (2014)

Volume 9: 4 Issues (2013)

Volume 8: 4 Issues (2012)

Volume 7: 4 Issues (2011)

Volume 6: 4 Issues (2010)

Volume 5: 4 Issues (2009)

Volume 4: 4 Issues (2008)

Volume 3: 4 Issues (2007)

Volume 2: 4 Issues (2006)

Volume 1: 4 Issues (2005)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Missing Data Filling Algorithm for Big Data-Based Map-Reduce Technology

Abstract

1. Introduction

Complete Article List