Article Preview
TopIntroduction
Text classification is a basic task in Natural Language Processing (NLP). It aims to classify a text into an appropriate class. There are many practical applications of text classification, such as sentiment analysis (Pang & Lee, 2005) and topic categorization (Wang & Manning, 2012).
Traditional approaches, based on the text representation, extract features for a general classifier. For instance, in the model of bag-of-words, statistics on unigrams, bigrams, and k-grams are used as features (Pang et al., 2002; Wang & Manning, 2012). These traditional approaches either totally ignore the word order or constrain their focus on small word tuples, which results in inevitable information loss. Moreover, they suffer from the problems of data sparsity and high dimensionality.
In recent years, the development of deep neural networks has drawn attention in various NLP tasks with the help of the pre-trained word embeddings (Pennington et al., 2014; Xu et al., 2016; Liu et al., 2017). Words are embedded into an implicit space where each word is represented by a vector. Semantic relationships among words are preserved in the space by their reciprocal distances commonly. Word embeddings can not only alleviate the problem of data sparsity (Bengio et al., 2003), but also are demonstrated that they preserve meaningful syntactic and semantic regularities (Pennington et al. 2014). With the assistance of word embeddings, deep learning approaches, especially recurrent neural networks (RNN) (Elman 1990) and convolutional neural networks (CNN) (LeCun et al., 2003) are introduced to learn the semantic representation of the text. These approaches have promoted the development of text classification (Kalchbrenner et al., 2014; Jiang et al., 2018).
RNN is a natural way to process a text word by word by regarding it as a sequential data. RNN preserves a word’s information and its contextual information into a hidden unit. Long Short-term Memory (LSTM) units (Hochreiter & Schmidhuber, 1997) and its updated version, gated recurrent units (GRU) (Chung et al. 2014), can solve gradient vanishing and gradient explosion problems, which makes the two units become popular prototypes. Theoretically, arbitrary long sequences including long-term dependencies in texts can be modeled by RNN (Cho et al. 2014). Alternative popular neural network model, CNN has made impressive progress in modeling semantic information of texts (Kalchbrenner et al., 2014; Kim, 2014). CNN performs the feature mapping by a one-dimension (1D) convolution operator in a two-dimension (2D) matrix that is composed of concatenated embeddings of words in texts together. Then, a 1D pooling operation is applied to the convolutional features over the time domain to obtain a fixed-length vector as output. Both position-invariant and local features in texts can be captured by CNN due to its convolutional operation.