Article Preview
TopIntroduction
Rapid advancements made in Internet technologies have resulted in massive volumes of data in the form of text digitization. In the light of ever-increasing textual data, a technology that can automatically mine useful information from text is urgently needed. In this context, information extraction technology emerged at a historic moment and has been widely used (Fiori et al., 2014). Event extraction is the most challenging operation in information extraction, which aims to automatically extract information that users are interested in from unstructured text and present it in the form of structured events (Ahn, 2006). Event extraction has given a huge impetus to the development of knowledge graph construction (Bu et al., 2021), text mining (Lyu & Liu, 2021), information retrieval (Feng et al., 2021), etc.
At present, event extraction can be divided into meta-event extraction and topic event extraction, where a meta-event only describes simple actions or state changes, whereas a topic event describes the developmental processes of things. Event extraction broadly involves two subtasks: trigger extraction and event argument extraction, where a trigger refers to a keyword that can clearly express the occurrence of an event, and an event argument refers to the related descriptions such as time, place, and participant of the event. An event can be detected, and its type can be determined by identifying the trigger. Each event type is provided with a unique representation frame, and each relevant entity in the sentence determines whether it is an event argument based on the frame, and if so, its argument role can be determined.
Traditional meta-event extraction approaches mainly adopt pattern matching and machine learning. The former refers to the detection and extraction of meta-events under the guidance of meta-event templates, which show effective performance in specific fields. However, building meta-event templates is time-consuming and laborious; furthermore, building a general meta-event template is difficult. The latter is modeled as a multi-classification task or sequence labeling task, after which the extracted features are used as model inputs to complete the meta-event extraction. However, training models using supervised learning strategy requires large volumes of labeled samples and considering that these labeled samples are generally manufactured by experts, their manufacturing cost is high. When the quantity of labeled samples is small and the categories are unbalanced, the extraction performance of the models decreases. To overcome this limitation, researchers have proposed the adoption of semi-supervised learning strategy (Zhou & Li, 2010) that utilizes a small number of labeled samples and a large number of unlabeled samples to train models. Tri-training (Zhou & Li, 2005) is a classical semi-supervised learning algorithm that adopts bootstrapping to train three classifiers, makes them work together, and expands the training set by constantly introducing new training samples from unlabeled sample set to obtain three classifiers with excellent performance. Because unlabeled samples are cheap and easy to obtain, the use of semi-supervised strategy to train high-performance event extraction models is a current research hotspot.
Compared with sentence-level meta-events, document-level topic events contain richer global semantic information, including multi-facet meta-events, which can present the core content of the text from a global perspective. However, the description information of topic events is scattered in the text, and the existing meta-event extraction approaches cannot meet the demand of topic event extraction, which is a complicated procedure. The difficulty lies in determining all topic-related meta-events within the scope of the document and merging and extracting these meta-events. At present, the event frame or ontology is usually applied in some topic event extraction work to represent each component of the topic event and the relations between them, which has achieved superior results in specific fields. Nevertheless, the existing topic event extraction technologies are not mature enough; especially the intra-textual semantic understanding and cross-textual event extraction need further research.