An Efficient Parallel Hybrid Feature Selection Approach for Big Data Analysis

An Efficient Parallel Hybrid Feature Selection Approach for Big Data Analysis

Mohamed Amine Azaiz, Djamel Amar Bensaber
Copyright: © 2022 |Pages: 22
DOI: 10.4018/IJSIR.308291
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Classification algorithms face runtime complexity due to high data dimension, especially in the context of big data. Feature selection (FS) is a technique for reducing dimensions and improving learning performance. In this paper, the authors proposed a hybrid FS algorithm for classification in the context of big data. Firstly, only the most relevant features are selected using symmetric uncertainty (SU) as a measure of correlation. The features are distributed into subsets using Apache Spark to calculate SU between each feature and target class in parallel. Then a Binary PSO (BPSO) algorithm is used to find the optimal FS. The BPSO has limited convergence and restricted inertial weight adjustment, so the authors suggested using a multiple inertia weight strategy to influence the changes in particle motions so that the search process is more varied. Also, the authors proposed a parallel fitness evaluation for particles under Spark to accelerate the algorithm. The results showed that the proposed FS achieved higher classification performance with a smaller size in reasonable time.
Article Preview
Top

Introduction

Big Data has contributed to the growing complexity of data mining and machine learning techniques (Rong et al., 2019) such as classification which algorithms already require a lot of processing. Feature selection is a step of preprocessing that aims to reduce as much data dimension without disrupting learning performance. In the literature, there are various feature selection approaches including Filter (Fatima Bibi et al., 2015), Wrapper and Hybrid methods. In the Filter methods, there are two types of features that must be deleted: irrelevant and redundant features (Yu et al., 2003), (Song et al., 2013), (Lashkia et al., 2004). This is done through using the statistical measures such a correlation measure or statistical independence. An irrelevant feature is a feature that has weak or no correlation with the target Class, and a redundant feature is one that has a strong correlation with another feature (s). “The first does not contribute to predictive accuracy and the second does not respond to obtain a better indicator. It mainly provides information already existing in one or more other attributes” (Song et al., 2013). Filter methods are characterized by the speed of execution, but good results are not always guaranteed due to the insufficiency of a unified and comprehensive definition of statistical correlation. For example, two variables that are not linearly correlated are not necessarily independent. It is possible that they are non-linearly correlated (Shen et al., 2009). Wrapper methods rely on learning algorithms to evaluate the subset of selected features. These methods are characterized by the quality of their results in most cases, but they are challenged by their complexity and execution time. Hybrid methods use filter and wrappers approaches to combine fast of execution with quality of results. Feature selection is the process of finding a subset of features in a large space, which is known for being an NP-Hard problem. In such instances, the use of evolutionary algorithms is among the most effective solutions. In this work, the authors used the BPSO algorithm (Kennedy, et al., 1997) to find the appropriate set of features with some improvements. Since the BPSO algorithm has limited convergence and restricted inertial weight adjustment, the authors propose using the multiple inertia weight strategy inspired from (Too et al., 2019) to influence the changes in particle motions, so that the search process is more diverse. In the BPSO algorithm, the position of the particle is a sequence of bits, each bit containing either the value 0 or 1. The number of bits equals the dimension of data. So, the second improvement of BPSO algorithm involves modifying the new position of the particle so that one or more bit of its position can take its new value from the best global solution in very limited cases. In PSO-based feature selection algorithms, fitness evaluation is the most time-consuming part. This is because fitness evaluation is often using a classifier. Therefore, we have divided the particles into independent groups that execute simultaneously. Within each group, each particle evaluates its fitness value. Apache Spark is among the best frameworks for big data analytics, so we used it for the distribution and parallel execution.

Therefore, our proposed algorithm, that we named PHFS (Parallel Hybrid Feature Selection), has two steps. As a first step, all irrelevant features are removed (only the most relevant features are selected) (Song et al., 2013) to reduce the search space.

Complete Article List

Search this Journal:
Reset
Volume 15: 1 Issue (2024)
Volume 14: 3 Issues (2023)
Volume 13: 4 Issues (2022)
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing