Applications of Feature Engineering Techniques for Text Data

Applications of Feature Engineering Techniques for Text Data

Shashwati Mishra, Mrutyunjaya Panda
DOI: 10.4018/978-1-7998-6659-6.ch010
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Feature plays a very important role in the analysis and prediction of data as it carries the most valuable information about the data. This data may be in a structured format or in an unstructured format. Feature engineering process is used to extract features from these data. Selection of features is one of the crucial steps in the feature engineering process. This feature selection process can adopt four different approaches. On that basis, it can be classified into four basic categories, namely filter method, wrapper method, embedded method, and hybrid method. This chapter discusses about different techniques coming under these four categories along with the research work on feature selection.
Chapter Preview
Top

Feature Engineering Process

Feature engineering process has a vital role in extraction and selection of appropriate features from the input data for further analysis and prediction. The feature engineering process involves deciding the type of features, creating the features, verifying the effectiveness of the features and accordingly improve or accept the features.

Machine learning algorithms are based on various mathematical, statistical and optimization principles. These techniques cannot be directly applied on unstructured text data. Therefore the unstructured text data must be converted to a structured format which will be easy for analysis. The preprocessing stage performs different activities like Tokenization, Noise removal etc..

Tokenization

The process of dividing the given textual input string into pieces is called tokenization. These individual pieces which comprise of keywords, symbols, phrases and other elements of a language are called tokens. Some symbols like punctuation marks are discarded in the process of tokenization.

Complete Chapter List

Search this Book:
Reset