Article Preview
TopIntroduction
In 2001, there was a movement called the Semantic Web whose goal was to endow the current Web with metadata, and, as a result, had the goal of evolving it into a Web of Data to improve its accessibility by computers (Polleres & Huynh, 2009; Shadbolt et al., 2006). Currently, we are witnessing an increasing popularity of the Web of Data, chiefly in the context of Linked Open Data, which is a successful initiative that consists of a number of principles to publish, connect, and query data in the Web (Bizer et al., 2009a). The consequence of this popularity is the existence of a large variety of web sources, which focus on several domains, such as government, life sciences, geography, media, libraries, or scholarly publications (Heath & Bizer, 2011). Furthermore, these sources offer their data using the RDF language, and they can be queried using the SPARQL query language (Antoniou & van Harmelen, 2008).
Scientists are currently working with the Web of Data as a large database to answer structured queries from users (Polleres & Huynh, 2009). As a result, one the main challenges scientists are facing in this context is coping with scalability, i.e., processing data at Web scale, which is usually referred to as Big Data (Bizer et al., 2011). Another challenge is not only to implement scalable solutions to deal with this amount of data, but also dealing with the steadily growth of sources in the context of the Web of Data, e.g., in the domain of Linked Open Data, there were roughly 12 such sources in 2007 and, as of the time of writing this article, there exist 226 sources (LOD Cloud, 2012).
Ontological models are used to provide schema semantics to RDF data. These models comprise types, data properties, and object properties, each of which is identified by a URI (Antoniou & van Harmelen, 2008). Ontological models are shared and developed with the consensus of one or more communities (Rivero et al., 2013b), which define a number of inherent constraints over the models, such as subtypes, the domains and/or ranges of a property, or subproperties.
In traditional information systems that comprises a back-end database, developers first need to create a data model according to the user requirements, which is later populated. Contrarily, in the Web of Data, data can exist without an explicit model, since the way it is implemented is that data in the Web already existed and models were added later. Not only that, several models may exist for the same set of data. As a result, in the context of the Web of Data, we cannot usually rely on existing ontological models to understand RDF data since there might be a gap between the models and the data, i.e., the data and the model are usually devised in isolation, without taking each other into account (Glimm et al., 2012). Furthermore, RDF data may not satisfy a particular ontological model related to these data, which is mandatory to perform a number of tasks, such as data integration (Makris et al., 2012), data exchange (Rivero et al., 2013c), data warehousing (Glorio et al., 2012), or ontology evolution (Flouris et al., 2008). As a final conclusion, current techniques to perform information integration can leverage from the discovering of conceptual models (Rivero et al., 2013a).
To give an idea that this gap between ontological models and RDF data is not negligible in practice, we provide two real-world examples based on current models and data (see (Arenas et al., 2014) for an in-depth discussion on this topic). The examples are as follows: