Article Preview
Top1. Introduction
Large scale microarray experiments that have been performed under a variety of conditions or at various stages during a biological process have resulted in huge amounts of gene expression data, and have presented big challenges for the field of data mining (de Souto et al., 2008; Kerr et al., 2008). Challenges include rapidly analyzing and interpreting data on thousands of genes measured with hundreds of different conditions, and assessing the biological significance of the results. Clustering is the exploratory, unsupervised process of partitioning the expression data into groups (or clusters) of genes sharing similar expression patterns (Yeung et al., 2003; Kerr et al., 2008). However, the quality of clusters can vary greatly, as can their ability to lead to biologically meaningful conclusions.
On a different note, the biological and medical literature databases are information warehouses with a vast store of useful knowledge. In fact, text analysis has been successfully applied in bioinformatics for various purposes such as identifying relevant literature for genes and proteins, connecting genes with diseases, and reconstructing gene networks (Yandell & Majoros, 2002). Hence, including the literature in the analysis of gene expression data offers an opportunity to incorporate additional functional information about the genes when defining expression clusters. In more general terms, with the availability of multiple information sources, it is a challenging problem to conduct integrated exploratory analyses with the aim of extracting more information than what is possible from only a single source.
The basic problem of learning from multiple information sources has been extensively studied by the machine learning community. In computer vision this problem is referred to as multi-modal learning. In general, there are two approaches to multi-modal learning: feature level integration and semantic integration (Wu et al., 1999). Methods that use feature level integration combine the information at the feature level and then perform the analysis in the joint feature space (Glenisson et al., 2003). On the other hand, the semantic level integration methods first build individual models based on separate information sources and then combine these models via techniques such as mutual information maximization (Becker, 1996).
Microarray experiments usually provide gene expression data on all the genes in a genome. Hence they are inherently “complete”. A major challenge using other sources of data to assist the analysis of gene expression data is that they may not always be complete, i.e., do not provide information on all the genes in the genome.
Recent work from the machine learning community has focused on the use of background information in the form of instance-level constraints. Two types of pair-wise constraints have been proposed: positive constraints that specify that two instances must remain in the same cluster, and negative constraints that specify that two instances must not be placed in the same cluster. Recent examples of work include methods that ensured that constraints were satisfied at each iteration (Wagsta et al., 2001), algorithms that used constraints as initial conditions (Basu et al., 2002), algorithms that learned a distance metric trained by a shortest-path algorithm (Klein et al., 2002), a convex optimization method using Mahalanobis distances (Xing et al., 2002), and semi-supervised clustering that incorporated both metric learning and the use of pair-wise constraints in a principled manner (Bilenko et al., 2004).
While great efforts have been made to develop efficient constrained clustering algorithm variants, the role of constraint sets in constrained clustering algorithm has not been fully studied yet. Recently, Wagstaff et al. (2006) and Davidson et al. (2006) attempted to link the quality of constraint sets with clustering algorithm performance (Davidson et al., 2006; Wagsta et al., 2006). Two properties of constraint set – inconsistency and incoherence – were shown to be strongly negative correlated with clustering algorithm performance.