Article Preview
TopIntroduction
The current challenge in the field of biology is the enormous amount of existing data. This data is complex and unformatted. Also this data is doubled in every two years. The bioinformatics is the interdisciplinary research area of biology & computer science. It uses the computer science methods, models and sophisticated algorithms to solve the biological problems that are related to huge data analysis, gene annotation, pattern reorganization and many more.
The one of the most common problem in biology is motif discovery. In motif discovery, we have to analyze the DNA sequence and find the transcription binding site. Motifs are usually short sequence of nucleotides and it is hard to discover. Motif discovery is a NP Complete problem because we cannot expect the pattern to be exact matching copies owing to biological mutations this is due to large amount of statistical noise (one regulatory region contains from several hundred to several thousand nucleotides) and it is also not known that whether motif is present in a sequence or not. There are different types of motifs in the literature namely: Sequential Motifs, Gapped motifs, Structured Motifs, Planted Motifs and Network Motifs. A number of methods, algorithms and tools have been developed in the recent years to solve these problems. The complete survey of DNA motif finding algorithms, methods and different approaches are presented in (Modan, Das, & Dai, 2007). All these methods suffered from the problem of local optima. Many evolutionary algorithms eliminate the effect of local optima; one of them is Genetic algorithm which is widely used algorithm for motif discovery problem. Che et al. (2005) developed an approach that can be used to predict binding site motifs using a genetic algorithm. Congdon et al. (2005) proposed a GAMI, a Genetic Algorithms approach to Motif Inference. Kaya (2007) proposed an efficient method using multi-objective genetic algorithm (MOGAMOD) to discover optimal motifs in sequential data. Though GA reduces the local optima effect but it has expensive operators which takes time for computation. Motif discovery problem is also solved by Particle swarm optimization one of the optimization algorithm. Hardin and Rouchka (2005) proposed a hybrid motif discovery approach based upon a combination of particle swarm optimization (PSO) and the expectation-maximization (EM) algorithm. They used PSO to generate a seed for the EM algorithm. This method still suffered with local optima. Zhou et al. (2005) proposed a novel algorithm IPSO-GA by integrating an improved particle swarm optimization with genetic algorithm to search sequence motifs from co expressed genes regulated by the NF-kb transcription factor. Reddy et al. (2010) have adopted the features of the PSO to solve the Planted Motif Finding Problem and have designed a sequential algorithm. Lei, C. and Ruan, J. (2009) used a word dissimilarity graph to remap the neighborhood structure of the solution space of DNA motifs, and propose a modification of the naive PSO algorithm to accommodate discrete variables.
In this paper we propose an alternative heuristic never used in this problem. We apply Artificial Bee Colony Optimization (ABC) algorithm to discover quality motifs. Artificial Bee Colony is a new optimization technique comes under swarm intelligence and very effective to solve optimization problems. ABC is inspired by the working principles of natural bees. ABC algorithm uses both local and global search to find out the solution or global optimum. Many problems have been solved using ABC algorithm. The Artificial Bee Colony can solve Constrained Optimization Problem. Engineering design, structural optimization, economics, VLSI design, allocation and location problems are just a few of the scientific fields in which Constrained Optimization problems are frequently met (Karaboga & Basturk, 2007; Karaboga & Basturk, 2007). This novel algorithm and the meaning of these objectives will be described in the sections concerned. To demonstrate the effectiveness and efficiency of our methodology we performed experiment on five real data sets of DNA sequences.