Article Preview
Top1. Introduction
In recent years, an increasing number of semantic data sources have been published on the Web. These sources are further interlinked to form Linking Open Data (LOD). Among LOD, DBpedia1 and YAGO2 are the two main data sources serving as the hub. The DBpedia project (Bizer et al., 2009) extracts structured information from Wikipedia and publishes this information on the Web. DBpedia is currently one of the largest hubs of LOD. YAGO (Suchanek, Kasneci, & Weikum, 2007) is another huge and well-known semantic knowledge base (KB), derived from Wikipedia, WordNet and GeoNames. Both DBpedia and YAGO evolve and have published many versions.
Due to the multilingual nature of Wikipedia, both DBpedia and YAGO contain semantic data in Chinese. While Wikipedia is one of the largest encyclopedias on the Web, the number of Chinese articles is much fewer than that of articles in English or German. Thus, DBpedia and YAGO do not contain adequate Chinese knowledge compared with the size of knowledge expressed in English. On the other hand, in China, there are 10 times as many articles in Hudong-Baike3 and Baidu-Baike4, which are two Chinese encyclopedia Websites, as the Chinese version of Wikipedia. Emerging projects such as Zhishi.me (Niu et al., 2011), SSCO (Hu, Shao, & Ruan, 2014) and XLore (Wang et al., 2013) try to extract structured Chinese information from a combination of Chinese encyclopedia Web sites including Hudong-Baike, Baidu-Baike and Chinese Wikipedia. Both Zhishi.me5 and SSCO6 have Web sites with user-friendly GUIs for user access.
Since there are so many KBs in different languages that are extracted from different sources via different methods, it is natural to ask questions such as: How do the qualities of KBs change when new data sets of KBs are published? Are the qualities of Chinese KBs comparable to or better than their English counterparts? How are the qualities of extracted KBs with multiple data sources impacted by these data sources? Will these KBs share similar errors or not?
To address the assessment requirements of comparing Web-scale extracted KBs, we focus on two quality dimensions, namely Richness and Correctness. The reason is, whether a KB is Web-scale depends on the richness of the data, and extracted data is prone to errors. To find suitable metric sets to measure the above quality dimensions, we survey the research on metrics and methodologies on LOD evaluation, as all the above KBs are inspired by the design principles of LOD. Zaveri et al. (2016) summarized 69 metrics and categorized them into 4 dimensions, namely Accessibility, Intrinsic, Contextual and Representational. The sub dimensions of Intrinsic include Syntactic validity, Semantic accuracy, Consistency and Completeness. Our Richness dimension relates to Completeness sub dimensions in Zaveri et al. (2016), and our Correctness dimension relates to Syntactic validity, Semantic accuracy and Consistency. However, the metrics in a metric set of a sub dimension from Zaveri et al. (2016) are collected from different research works, and they logically overlap and interweave. In the meantime, they do not share a unified representation. In another pilot study, Glenn and Dave7 listed 15 metrics to assess the quality of a data set. The metrics include Accuracy, Completeness, Typing and Currency, etc. However, they do not provide any formulas on how to calculate these metrics.
We provide a graph-based conceptual representation for Web-scale KBs and define metric sets of the two dimensions in a quasi-formal way. Different KBs are represented by the same conceptual representation. This approach is different from TripleCheckMate (Kontokostas, Zaveri, Auer, & Lehmann, 2013), which is solely based on DBpedia. The conceptual representation consists of a schema graph and data graph. The metrics are defined on the two graphs, and we focus on the metrics on a data graph because our Chinese KBs have little schema information.