Article Preview
Top1. Introduction
The clustering is the process of grouping a set of objects into classes (Han & Kamber, 2006). The objects in the same cluster are similar to one another and are dissimilar to the objects in other clusters. k-means is one of the popular clustering methods. This algorithm divides a set of n objects in to k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. The k-means algorithm proceeds as follows. First, it randomly selects k of the objects as cluster centers. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster center. It then computes the new cluster centers as mean value of the objects in each cluster. The above process continues until the criterion function converges. Usually, the square-error criterion is used. It is calculated as follows:
(1) where
F is the sum of the square error for all objects in the data set;
p is the point in space representing a given object; and
mi is the mean of cluster
Ci. In other words, the distances between objects and their cluster centers are squared and then summed (Han & Kamber, 2006). The major drawback of
k-means algorithm is that the clustering quality is sensitive to the initial election of cluster centers and may converge to the local optimal (Abdel-Kader, 2010).
In recent years, the various metaheuristic algorithms such as genetic algorithm and tabu search are used to avoid getting trapped in local optimal solutions in solving the clustering problem. In domain of genetic algorithm, a genetically based algorithm, which is composed of split and merge algorithm proposed in Garai and Chaunhuri (2004). Laszlo and Mukherjee (2006) presented a genetic algorithm to select a good set of centers for k-means to produce near optimal partitions. Their approach uses hyper-quad trees. The main drawback of this algorithm is that it is only suitable for low-dimensional data, so Laszlo and Mukherjee (2007) presented another method that exploits region-based crossover, but unlike the hyper-quad tree approach, it scales well to higher-dimensional data sets. Hong & Kwong (2008) proposed a novel clustering algorithm that combines the steady-state genetic algorithm and the ensemble learning method. Chang et al. (2009) suggested a Genetic Algorithm with Gene Rearrangement (GAGR). In their algorithm, in order to reduce the degeneracy, a gene rearrangement of the chromosome has been used. Degeneracy occurs when different chromosomes showing the same cluster result. Moreover, a new path-based crossover operator has been presented. This operator builds a path from one chromosome to another chromosome.