An Empirical Study of Heterogeneous Cross-Project Defect Prediction Using Various Statistical Techniques

Rohit Vashisht, Syed Afzal Murtaza Rizvi

Source Title: International Journal of e-Collaboration (IJeC) 17(2)

DOI: 10.4018/IJeC.2021040104

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Cross-project defect prediction (CPDP) forecasts flaws in a target project through defect prediction models (DPM) trained by defect data of another project. However, CPDP has a prevalent problem (i.e., distinct projects must have identical features to describe themselves). This article emphasizes on heterogeneous CPDP (HCPDP) modeling that does not require same metric set between two applications and builds DPM based on metrics showing comparable distribution in their values for a given pair of datasets. This paper evaluates empirically and theoretically HCPDP modeling, which comprises of three main phases: feature ranking and feature selection, metric matching, and finally, predicting defects in the target application. The research work has been experimented on 13 benchmarked datasets of three open source projects. Results show that performance of HCPDP is very much comparable to baseline within project defect prediction (WPDP) and XG boosting classification model gives best results when used in conjunction with Kendall's method of correlation as compared to other set of classifiers.

Article Preview

Top

1. Introduction

The main goal of any software development model is to maintain the required quality level in a final software product or service called as Software Quality Assurance (SQA). Any deviation from the real and anticipated results in terms of end-user demands for a certain environment configurations can be described as a defect. Or we can say that the prediction of software defects and quality assurance of software are two complementary activities that underlie the maximum prediction of defects at the correct moment which further leads to release of qualitative software product at the end.

The most critical stage of Software Development Life Cycle (SDLC) is testing as it absorbs a large proportion of the total cost of the project. So, this phase should be primed in the first place in every software development process. The Software Defect Prediction (SDP) is the only way to address this issue. Different methods of data mining are implemented in developing SDP models using historical databases (Horgan & Mathur, 2013). This data usually consists of two components: software metric that provides the basis for assessing the extent to which a software feature is fulfilling some property and the labeling of instance in the target application as defective or non defective instance using suitable classification models (Han et al., 2011). Initially, Defect Prediction (DP) model is developed to locate “within-project” defects by partitioning the accessible defect dataset into two portions so that DP model is trained using one portion of a dataset (referred to as marked cases) and the other portion is used for testing of built-in DP model. Testing the DP model involves finding labels for unlabelled instances that are either faulty or non-faulty (Ambros et al., 2012).

Cross Project Defect Prediction (CPDP) is a study area in which a software project lacking adequate local defect data can use data from other projects to build an effective and efficient predictor of defects. Cross-project data must be obviously mentioned before; it is implemented locally in order to promote CPDP. However, there is no common metric between source and target datasets when HCPDP is used. Matched metrics between two applications can be found through the estimation of correlation coefficient among all possible metric combinations. The heterogeneous metrics that show some kind of comparable distribution in their values are used to predict the flaws across the projects. For example, the number of methods required by classes (RFCs) and the number of different operands may be considered two heterogeneous metrics that are equally aligned in their distribution of values for any two project datasets.

In this paper, the authors will systematically explore the efficiency of distinct classifiers for HCPDP and With-in Project Defect Prediction (WPDP) categories of defect prediction. The following main research issues are addressed in this paper.

RQ1. Whether and to what extent is HCPDP comparable to With-in Project Defect Prediction (WPDP)?
RQ2. Which method of metric matching outperforms and leads to better outcomes?
RQ3. Which predictor is outperforming in HCPDP case?

The proposed research work is organized as follows: - section B reviews work associated to HDP with some initial literature survey of CPDP, section C describes the fundamental Heterogeneous Defect Prediction (HDP) model and its modeling elements, section D describes the datasets used and performance parameters considered to evaluate experimental results, section E explains the designing part of experiments, section F depicts experimental outcomes and their comparison using two methods of metric matching- Spearman's technique of correlation and Kendall’s method and finally, the target project instances are classified as clean or buggy instance with five frequently used classifiers- Naïve Bayes, Random Forest, Logistic Regression, Gradient Boosting & XG (eXtreme Gradient) Boosting and in last, section G summarizes the conclusive findings.

Top

Melo et al. (2002) gave the very first recognized CPDP work. The author suggested a novel evaluation technique called Multivariate Adaptive Regression Spline (MARS) by collecting fault and design information from two Java-based schemes (Xpose & Jwriter) of medium size. The constructed MARS model ranked the classes in the second scheme according to the degree of their tendency to failure using the first system-built model. MARS conducted better and economically feasible than the linear regression model. But the estimated fault probabilities of classes didnot give any notion about the model under prediction.

Complete Article List

Search this Journal:

Reset

Volume 20: 1 Issue (2024)

Volume 19: 7 Issues (2023)

Volume 18: 6 Issues (2022): 3 Released, 3 Forthcoming

Volume 17: 4 Issues (2021)

Volume 16: 4 Issues (2020)

Volume 15: 4 Issues (2019)

Volume 14: 4 Issues (2018)

Volume 13: 4 Issues (2017)

Volume 12: 4 Issues (2016)

Volume 11: 4 Issues (2015)

Volume 10: 4 Issues (2014)

Volume 9: 4 Issues (2013)

Volume 8: 4 Issues (2012)

Volume 7: 4 Issues (2011)

Volume 6: 4 Issues (2010)

Volume 5: 4 Issues (2009)

Volume 4: 4 Issues (2008)

Volume 3: 4 Issues (2007)

Volume 2: 4 Issues (2006)

Volume 1: 4 Issues (2005)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

An Empirical Study of Heterogeneous Cross-Project Defect Prediction Using Various Statistical Techniques

Abstract

1. Introduction

Complete Article List

An Empirical Study of Heterogeneous Cross-Project Defect Prediction Using Various Statistical Techniques

Abstract

1. Introduction

2. Related Work

Complete Article List