Article Preview
TopIntroduction
The automated English Document Image Classification System (EDC) is designed to categorize the scanned pure printed or pure Hand-Written (HW) English documents into mutually exclusive predefined classes. This system is defined in 2 ways, which are English Printed Document Classification (EPDC) system and English Handwritten Document Classification (EHDC) system. Both EPDC and EHDC systems use the underlying concepts of pattern recognition, artificial intelligence, machine learning, text mining, and image mining fields. Over the last three decades, many researchers have successfully implemented various automated text mining and classification systems. These researchers have tested their systems on mono, bi, tri, and multi-lingual real and synthetic documents by using various classifiers and variations. Some other researchers have also provided good solutions for the problems of feature reduction, feature selection, and data and curse of dimensionality reductions. On the other side, in recent years, many automated image recognition, identification, and mining systems have also been developed. These systems were designed primarily for the categorization of maps, geographical areas, drawings, and graphical and pictorial designs. Nowadays, many other image mining systems have also come into existence, which extract and process text characters, words, and lines from the heterogeneous set of multi-font, multi-size, multi-oriented, multi-colored, multi-lingual and multi-script documents. The fields of printed character recognition and script discrimination for non-Indic, such as, Latin, Chinese, Japanese and Korean scripts are already mature. On the other side, many printed text recognizers and processors also exist for Indic scripts, such as, Devanagari, Bengali, Gujarati, and Gurumukhi etc. The printed text processing systems are always found simpler than the handwritten ones. The reason behind the complexity of handwritten text processing primarily lies in the cursive writing style, overlapped and touched characters, and uneven height, size and gaps among the characters and words. Secondly, it also depends on the writer how smoothly and clearly he writes the text. Many Indian scripts also use a head line on the top of the characters, which also increase the segmentation issues. All these conceptual illustrations of text classification systems and image mining systems have motivated the authors to propose an integrated single and multi-script document image classification system, which accepts the text document images and categorizes them into predefined classes. In this way, the area of document classification coexists with the image content retrieval and recognition paradigm.
These new dimensions of text document image processing include the major steps of preprocessing, character recognition, word recognition, and document classification. Nowadays, many researchers are paying attention to it. Puri and Singh (2018) provided a survey on Devanagari scripted Hindi text document classification system by using Support Vector Machine (SVM) and fuzzy. This survey primarily focused upon Hindi basics, importance, survival, and differentiation between Hindi and other scripts, and then it provided detailed discussions on existing research contributions from 1990 to till date. Another research contribution is a tri-layered segmentation and bi-leveled classifier based advanced, robust, fast Hindi Printed Document Classification using SVM and Fuzzy (HPDC-SF), which discussed detailed algorithmic procedures for document classification (Puri & Singh, 2019). The HPDC-SF system was designed to categorize unknown documents into predefined Hindi classes through the critical Task Stages (TS) of segmentation, Shirorekha-Less (SL) character extraction, SL word association, fuzzy matching, and classification. This system used Predefined Keywords (PK) in Romanized form of Hindi characters.