The Effect of Imbalanced Classes on Students' Academic Performance Prediction: An Evaluation Study

The Effect of Imbalanced Classes on Students' Academic Performance Prediction: An Evaluation Study

Osama Mohammed El-Deeb, Walid Elbadawy, Doaa Saad Elzanfaly
Copyright: © 2022 |Pages: 17
DOI: 10.4018/IJeC.304373
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Imbalanced classes in data mining have more challenges in the educational data mining field. This is because most of the datasets collected from educational records are imbalanced by nature. Some classes dominate others and cause bias predictions. This paper studies the effects of the imbalanced classes on the performance of seven different classifiers, which are J48, Random Forest, k-Nearest Neighbors, Naïve Bayes, Random Tree, SVM, and Linear Regression. Moreover, the effectiveness of the SMOTE technique for handling imbalanced data is evaluated against these classifiers. This will be done through the proposal of an early predictive model that predicts student’s academic performance and recommends their appropriate department in a multi-disciplinary institute. According to our results, the Random Forest technique is the best and has the highest level of accuracy is 94.585%.
Article Preview
Top

Introduction

Educational Data Mining is a data mining field that aims to derive useful information from raw data obtained from educational systems (Rawat & Malhan, 2019). This information can be used to better analyze the performance of the students to improve the decision-making process. One of the most challenging problems of educational data is its distribution. The distribution of educational data over time has exceptional characteristics. Among these characteristics is the imbalanced class distribution (Member & Fellow, 2012).

The class imbalanced distribution is identified by the ratio of the number of attributes of the majority class to that of the minority class (Anjana & Sardana, 2017). There are different techniques in the literature for handling the class imbalance problem. Oversampling and under-sampling techniques are the most common (Mohammed, Rawashdeh, & Abdullah, 2020). However, most of these techniques are dealing with the binary class imbalance and just a few findings are dealing with multi-class imbalances. Multi-class imbalance happens when the target variable consists of more than one class with unequal sample sizes for each class (Moubayed, Injadat, Shami, & Lutfiyya, 2018) (Wang & Yao, 2012). Techniques that are commonly used for handling binary-class imbalance may become inefficient for the multi-class imbalance.

The purpose of this paper is three-fold: First, it explores different techniques for handling the imbalanced dataset to evaluate their effects on the accuracy of predicting the students’ academic performance. Second, it proposes a predictive model for the performance of students at an early academic stage in a multi-disciplinary institute. Third, the model will recommend a study path for the student based on his performance. The main goal of predicting academic performance is to alleviate the risk of students’ dropout, the risk of course failure, and poor graduation rates. Most of the current studies are focusing only on the prediction part. However, in this paper, the authors added the recommendation part to guide students when choosing their specialization. This recommendation will help the students make better decisions on their educational path and enhance their performance. The proposed model is based on a real dataset that has been gathered from the Giza Higher Institute of Management Sciences. The Random Forest Classifier has achieved the best results among other classifiers after handling the class imbalance problem in the collected dataset.

Machine learning (ML) is a part of artificial intelligence (AI) that uses data to improve its performance. Machine learning algorithms are used in many fields, such as speech recognition, image classification, text recognition, and educational data mining. Machine learning algorithms play an important role in computer science because they are trained using data to make predictions and classifications. The authors are using machine learning algorithms in educational data mining to predict student performance and recommend student specialization based on regression and classification processes. So, in this paper, the authors train and evaluate J48, Random Forest, and Naïve Bayes classification classifiers to recommend student performance and Random Forest, Linear Regression, and K-Nearest Neighbor Regression classifiers to predict student performance.

The rest of this paper is organized as follows: Section 2 highlights the research significance. Section 3 presents related work about predicting student performance and handling imbalanced educational data. Section 4 outlines the proposed model for addressing the methods applied to handle the problem of imbalanced data, the use of appropriate techniques for class imbalance, and then the classification techniques. Finally, Section 5 describes the results for measuring the performance of the classifiers before and after using a resampling technique to handle the problem of imbalanced data to enhance the prediction accuracy of student performance. Section 6 is the conclusion and future study.

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 7 Issues (2023)
Volume 18: 6 Issues (2022): 3 Released, 3 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing