Article Preview
Top1. Introduction
With the dissemination of the Resource Description Framework (RDF) and the SPARQL query language, the number of organizations that use RDF to publish data on the Web is growing, and the total amount of RDF data that have been published is also increasing at a staggering rate. RDF data and SPARQL query have been used in a wide range of tasks, such as semantic stream processing (Barbieri et al., 2009; Chun et al., 2017), spatiotemporal query processing (Hu et al., 2015; Jaziri et al., 2015; Eom et al., 2017) and analyzing ontological models (Rivero et al., 2015), etc.
Processing SPARQL queries against a large volume of RDF data is a challenging task. Most of the conventional methods center on developing scalable RDF query engines. RDF stores like Jena (Carroll et al., 2004), RDF-3X (Neumann et al., 2010), 3store (Harris et al., 2003), Hexastore (Weiss et al., 2008), SW-Store (Abadi et al., 2009) and Sesame (Broekstra et al., 2002) use a centralized approach. As the amount of RDF data increases, it is becoming harder to store and process them on a single machine. There is also a distributed approach to storing RDF data in a distributed relational database system, in which SPARQL queries can be converted to SQL versions (Husain et al., 2011). RDF stores like SHARD (Rohloff et al., 2010) and HadoopRDF (Husain et al., 2011) use a distributed computing system to store RDF data across numerous machines (Huang et al., 2011; Picalausaal et al., 2012). Most of these systems use Hadoop to execute joins between subsets of RDF data. The shuffling stage and large network usage between the Map and Reduce stages of a MapReduce framework result in performance decrease in the conventional Reduce-side joins. To overcome the performance degradation, MAPSIN (Schätzleet et al., 2012) and RDFChain (Choi et al., 2013) introduce Map-side joins to SPARQL query processing. There still exists the issue of large network usage when several jobs are needed to execute a query. So, TriAD (Gurajada et al., 2014) adopts a join-ahead pruning via graph summarization to prune triples ahead of join operations.
Overall, the following issues exist with the conventional methods:
- •
The existence of shuffling phase and network overload between the Map and Reduce phases of a MapReduce framework. Shuffling causes traffic due to a large amount of intermediate results;
- •
The performance issue of saving intermediate results to a disk. When moving from a Map phase to a Reduce phase, the intermediate results should be saved to a disk before the Reduce phase, resulting in a number of disk I/Os. The disk I/O is a major overhead that should be avoided;
- •
The increasing number of jobs to execute a query. When executing a complex or long running query, multiple jobs may be needed for the query to be carried out successfully. When multiple jobs are connected sequentially, both the first and second issues outlined above may occur.