Article Preview
TopA manual approach to classify web queries is straightforward. Usually several assessors are involved in the classification process; and, to reduce the subjectivity, more than one person typically is asked to classify the very same query. If and when a consensus is not found initially, either another element is added to ease the classification or a discussion between the adjudicators is promoted to reach a consensus. In a study that focused on studying queries that users submit to search engines, Amanda Spink, Wolfram, Jansen, and Saracevic (2001) manually classified a sample of 2,414 queries submitted to the Excite search Engine into 11 categories. Focusing on the study of health queries submitted to search engines, Spink et al. (2004) also do a manual classification of queries to select the ones related to the topic of health. Despite being a popular approach, manual classification is slow and represents a tedious process requiring the availability of one or more human classifiers. In some cases, the huge volume of queries may even make the classification task impracticable; for these reasons, automatic methods have been proposed.
In Information Retrieval (IR), several approaches to detect topics in documents and collections of documents have emerged. Some methods are based on mathematical models, for example, the method of Latent Semantic Analysis, which is a method based on co-occurrences of terms in the collection to reduce the semantic context of the documents (Landauer, Foltz, & Laham, 1998). Even so, as web queries are more or less short, these methods are not the most appropriate.