Article Preview
TopIntroduction
In today’s era of big data, scientific discovery process is largely dependent on integration, management and extraction of useful data from available literature (Borkum & Frey, 2014). Extracted information from text mining tasks in chemical literature domain mainly includes named entities. Mining the chemical named entities is aimed at extracting information on unique chemicals, identifying the extracted chemicals by indexing them to the databases and bibliographic sources, assign and verify relationships between chemical entities and biological process, diseases etc., (Eltyeb & Salim, 2014; Banville, 2006; Batchelor & Corbett, 2007).
Machine Learning (ML) which is the automation of processes attributed to human intelligence, in particular - learning, to make decisions and to solve problems based on learning outcomes (Russell et al., 1995; Bottou, 2014), provides tailor made solutions for the task of named entity recognitions. Of late, Conditional Random Fields (CRFs), a class of probabilistic ML methods have contributed to major success in Chemical Named Entity Recognition (CNER) (Klinger et al., 2008). Ambiguity in representations of chemical entities is perhaps the most prevalent limitations concerned with text mining applications to chemical literature amongst others like limited open text corpora and growing number of chemicals (Townsend et al., 2005; Gurulingappa et al., 2013). Figure 1 clearly demonstrates the necessity and importance of named entity recognition as a first step to enable knowledge discovery process in chemical scientific literature.
Figure 1. Different representations of the chemical named entity ‘ethanol’
In spite of humongous work done on application of various approaches for chemical named entity recognition, most of the efforts have concentrated on identifying chemical names at generic level (e.g. chemical against non-chemical) or morphological level (e.g. trivial name, IUPAC, abbreviation, formula or chemical class). To the best of author’s knowledge, there is no effort on identifying chemical names at chemistry level such as organic, inorganic, organometallic, drug, macromolecule and so-forth. Primary reason is because generating annotated corpora is an extremely labor intensive task and similarly annotating corpora with multi-level information including chemistry information requires additional efforts from domain experts.
This work involves efforts from chemistry experts in generating a suitable multi-level labelled corpora as well as machine learning experts in designing and development of a CRFs-based system. The following sections describe methods used for corpus generation and annotation, training and evaluation of a classification model, and benchmarking the results against other state-of-the-art approaches.