DIAERESIS: Knowledge Graph Partitioning for Efficient Query Answering

The rapid explosion of linked data demands effective and efficient storage, management, and querying methods. Apache Spark is one of the most widely used engines for big data processing, with more and more systems adopting it for efficient query answering. Existing approaches, exploiting Spark for querying RDF data, adopt partitioning techniques for reducing the data that need to be accessed in order to improve efficiency. However, simplistic methods for data partitioning fail to minimize data access at query answering and effectively improve query efficiency. In this demonstration, we present DIAERESIS, a novel platform that exploits a summary-based partitioning strategy achieving a significant improvement in minimizing data access and as such improving query-answering efficiency. DIAERESIS first identifies the top-k most important schema nodes and distributes the other schema nodes to the centroid they mostly depend on. Then, it allocates the corresponding instance nodes to the schema nodes they are instantiated under, creating vertical sub-partitions and indexes. We allow conference participants to actively identify the impact of our partitioning methodology on data distribution and replication, data accessed for query answering, and query answering efficiency. Further, we contrast our approach with existing partitioning approaches adopted by state-of-the-art systems in the domain, providing a deep understanding of the challenges in the area.


INTRODUCTION
The prevalence of Linked Open Data, and the explosion of available information on the Web, have led to an enormous amount of widely available RDF datasets [2].To store, manage, and query these everincreasing RDF data, many distributed big data processing engines have been developed, like Apache Spark.Apache Spark is a big-data management engine, with a renewed interest in using it for efficient query answering over RDF data [1].The platform uses in-memory data structures that can be used to store RDF data, offering increased efficiency, and enabling effective, distributed query answering.
The problem.The data layout plays an important role in efficient query answering in a distributed environment.The obvious way of using Spark for RDF query answering is to store all triples as a single large file in HDFS, which natively divides the file into blocks distributed across the computational nodes.Using this approach, query answering usually needs to access a large volume of data to identify the required information.This results in poor query-answering performance.
The elusive solution: simplified partitioning schemes.As this problem has already been recognized by the research community, many approaches have been developed, trying to minimize data access when answering SPARQL queries [1].To achieve this, most of the Spark-based RDF query answering approaches exploit simplistic partitioning of triples (e.g., creating a partition for every predicate) and employ various indexing (e.g., bloom filters) and join pre-computation techniques.For example, SPARQLGX [3], uses a vertical partitioning strategy, S2RDF [6] implements an extended version of the classic vertical partitioning technique, called ExtVP that pre-computes semi-join, and WORQ [5] uses vertical partitioning tables and is based on bloom joins using bloom filters.However, although the aforementioned partitioning techniques are successful in optimizing fragments or certain categories of SPARQL queries, they fail to have a wider impact on all query categories, resulting in poor overall performance improvement for query answering.
Our solution.To address these problems, we introduce DIAERE-SIS [7], showing how to effectively partition data, balancing data distribution among partitions and reducing the size of the data accessed for query answering and thus, drastically improving query answering efficiency.The core idea is to identify important schema nodes as centroids, distribute the other nodes to the centroid that they mostly depend on, and assign the instance nodes to the corresponding schema nodes.Finally, a vertical sub-partitioning step further minimizes the accessed data during query answering, where appropriate indexes enabling the rapid identification of the specific data to be loaded at querying.
Demonstration.In this demonstration, we present the choices made within the DIAERESIS system for partitioning, subpartitioning, and indexing.We allow conference participants to experiment with various datasets and query workloads and identify the benefits of the proposed solution.We experimentally show that as the number of partitions in DIAERESIS grows, replication increases, however on average individual subpartitions contain less data (proved also theoretically in the main paper [7]).Finally, we contrast our approach with other state-of-the-art vertical, and hybrid partitioning strategies that use different kinds of indexes.We explore and demonstrate the trade-offs between data distribution, replication, and query efficiency in DIAERESIS and other systems.A video presenting various aspects of our demonstration is available online1 .Based on the available partitioning scheme, the DIAERESIS Query Processor receives and executes input SPARQL queries exploiting the available indexes.In the sequel, we will analyze the building blocks of the system.

Partitioning
The DIAERESIS Partitioner undertakes the task of partitioning the input RDF dataset, initially into first-level partitions, then into vertical partitions, and finally constructing the corresponding indexes to be used for query answering.Specifically, the Partitioner uses the Dependency Aware Partitioning (DAP) algorithm in order to construct the first-level partitions focusing on the structure of the data graph and the dependence between the nodes.In the sequel, 2.1.1Dependency Aware Partitioning Algorithm.The Dependency Aware Partitioning (DAP) algorithm, given an RDF dataset  and a number of partitions , splits the input dataset into  partitions.It uses the betweenness centrality along with the number of instances for identifying the importance of each node and uses this importance for identifying centroids.Then, the Dependence measure is used for assigning nodes to centroids.It considers that infrequent connections between two classes are more informative than frequent ones, the importance of the classes, and their distance in order to identify the dependent nodes.
Depending on the characteristics of the individual dataset (e.g., it might be the case that most of the instances fall under just a few schema nodes), data might be accumulated into one partition, leading to data access overhead at query answering, as large fragments of data should be examined.Such as, DAP tries to achieve a balanced data distribution by reducing data access and maintaining a low replication factor.Example 2.1.Consider the graph shown in Figure 2 which presents a fragment from the LUBM ontology.Assume now that we would like to partition the dataset into three partitions ( = 3).The first step is to select the three most important schema nodes (the ones in boldface) and then to assign to each centroid, the schema nodes that depend on it.The result partitioning scheme is shown in the figure .2.1.2Vertical Partitioning.Besides first-level partitioning, the DI-AERESIS Partitioner also implements vertical sub-partitioning to further reduce the size of the data touched.Thus, it splits the triples of each partition produced by the DAP algorithm, into multiple vertical partitions, one per predicate.Each vertical partition contains the subjects and the objects for a single predicate, enabling at query time a more fine-grained selection of data that are usually queried together.The vertical partitions are stored as parquet files in HDFS (see Figure 1).A direct effect of this choice is that when looking for a specific predicate, we do not need to access the entire data of the first-level partition storing this predicate, but only the specific vertical partition with the related predicate.Two theorems prove that as the number of partitions in DIAERESIS grows, replication increases, however, on average individual subpartitions contain less data (see [7]).
2.1.3Indexing.Next, in order to speed up the query evaluation process, we generate appropriate indexes, so that the necessary subpartitions are directly located during query execution.Specifically, as our partitioning approach is based on the schema of the dataset and data is partitioned based on the schema nodes, initially, we index for each schema node the first-level partitions (Class Index) it is primarily assigned, and also the vertical partitions (VP Index) it belongs to.For each instance, we index also the schema nodes under which it is instantiated (Instance Index).The VP index is used in case of a query with unbound predicates, in order to identify which vertical partitions should be loaded, avoiding searching all of them in a first-level partition.
Example 2.2. Figure 3 presents example indexes for our running example.Assuming that we have five instances in our dataset, the Instance Index, shown in the figure (left), indexes for each instance the schema node to which it belongs.Further, the Class Index records for each schema node the first-level partitions it belongs, as besides the one that is primarily assigned, it might also be allocated to other partitions as well.Finally, the VP Index contains the vertical partitions that the schema nodes are stored into (for each first-level partition).For example, the schema node Organization (along with its instances) is located in Partition-2 and specifically its instances are located in the vertical partitions affiliatedOf, orgPublication and rdfs:subClassOf.

Query Processor
In this section, we focus on the query processor module, implemented on top of Spark.An input SPARQL query is parsed and then translated into an SQL query.To achieve this, first, the Query Processor detects the first-level and vertical partitions that should be accessed for each triple pattern in the query, creating a Query Index.This procedure is called Partition Discovery.Then, this Query Index is used by the Query Translation procedure, to construct the final SQL query.Our approach translates the SPARQL query into SQL in order to benefit from the Spark SQL interface and its native optimizer which is enhanced to offer better results.

Partition Discovery.
In the partition discovery module, we create an index of the partitions that should be accessed for answering the input query, called Query Index.Specifically, we detect the fist-level partitions and the corresponding vertical partitions that include information to be used for processing each triple pattern of the query, exploiting the available indexes.
The corresponding algorithm takes as input a query, the indexes (presented in Section 2.1.3),and statistics on the size of the firstlevel partitions estimated during the partitioning procedure and returns an index of the partitions (first-level and vertical partitions) that should be used for each triple pattern.
Example 2.3.The creation of Query Index for a query is a threestep process depicted in Figure 4. On the left side of the Figure, we can see the four triple patterns of the query.The first step is to map every triple pattern to its corresponding schema nodes.If a triple pattern contains an instance then the Instance Index is used to identify the corresponding schema nodes.Next by using the Class Index (Figure 3), we find for each schema node the partitions where it is located in (Partitions IDs in Figure 4).Finally, we select the smallest partition in terms of size, for each schema node based on statistics collected for the various partitions.For example, for the second triple pattern (FORTH orgPublication ?y) we only keep partition 2 since it is smaller than partition 3.For each one of the selected partitions, we finally identify the vertical partitions that should be accessed, based on the predicates of the corresponding triple patterns.In case of an unbound predicate, such as in the third triple pattern of the query (FORTH ?p ?x) in Figure 4, the VP Index is used to identify the vertical partitions in which this triple pattern could be located based on its first-level partition (Partition ID:2).The result Query Index for our running example is depicted in Figure 4.

Query Translation & Optimization.
In order to produce the final SQL query, each triple pattern is translated into one SQL subquery.Afterward, all sub-queries are joined using their common variables.For each sub-query, the combination of the first-level partition with the vertical partition(s) based on the Query Index is practically used as the table name in the "FROM" clause of the SQL query.Finally, in order to optimize query execution, we have implemented a query optimization procedure, exploiting statistics recorded during the partitioning phase, to push joins on the smallest tables -in terms of rows -to be executed first, further boosting the performance of our engine.
Example 2.4.In Figure 4, an example is shown of the query processor module in action.The input of the translation procedure is the Query Index.Each triple pattern is translated into an SQL query, based on the corresponding information for the first-level and vertical partitions (SQL Sub-Queries in Figure 4) that should be accessed.The name of the table of each SQL query is the concatenation of the first-level and the vertical partitions.In case of an unbound predicate, such as the third triple pattern, the sub-query asks for more than one table based on the vertical partitions that exist in the Query Index for the specific triple pattern.Finally, sub-queries are reordered by the DIAERESIS optimizer that pushes joins on the smallest tables to be executed first -in our example, the p3_type is first joined with p2_orgPublication.

DEMONSTRATION
The purpose of the demonstration is primarily to understand the design choices behind DIAERESIS and how decisions on partitioning strategy and indexing affect efficiency and data access for query answering, as well as data replication.The code of the DIAERESIS system along with the datasets and workloads are available in our GitHub repository 2 .
We are going to use one real-world RDF dataset and one synthetic benchmark.DBpedia v3.8 occupies 29.1GB of storage and for the demonstration, we use a set of 112 BGP queries generated again by the FEASIBLE benchmark with real query logs.LUBM [4] is a widely used synthetic benchmark for evaluating semantic web repositories, modeling information about universities, and providing 14 queries.For the demonstration, we will use three LUBM versions (30.1GB, 46.4GB, and 223.2GB) enabling us to examine the behavior of the system as data grows.
For the demonstration, we will follow the following steps: (i) DIAERESIS Partitioning.The demonstration will start by presenting DIAERESIS, explaining the various choices made for data  partitioning, sub-partitioning, and indexing.The user will be able to select the number of partitions to be created and information will be presented for the distribution of the data in the various partitions, the overall replication factor as well as the space occupied by the individual vertical sub-partitions.As such, the user will be able to understand the novelty of our partitioning algorithm and how the data layout affects the above dimensions.
(ii) DIAERESIS Query Answering.In this step, we will explain how query answering is executed and the impact of the number of partitions on query efficiency.Various query workloads will be used for query answering and query execution time and the size of the data accessed will be reported for both individual queries and the whole workload.This will enable us to directly contrast the number of partitions, the induced overall storage overhead, and the reduction of the space in vertical sub-partitions, showing that query efficiency is subsequently improved.
(iii) Alternative partitioning schemes.Then, other partitioning schemes will be explained, including vertical partitioning and variations such as extended-vertical partitioning.As the partitioning scheme dictates the indexing scheme that should be available for efficient query answering, indexing options will be discussed.In this stage, DIARESIS will be contrasted with SPARQLGX [3], S2RDF [6], and WORQ [5] -using the same datasets and query workloads.As already mentioned, SPARQLGX uses a vertical partitioning S2RDF presents an extended version of the classic vertical partitioning technique, called ExtVP that pre-computes semi-join, and WORQ uses vertical partitioning tables and is based on bloom joins using bloom filters.The information about the data accessed, execution time, and storage overhead for all systems will be presented and discussed.

Figure 1
Figure 1 presents an overview of the DIAERESIS architecture, along with its internal components.Starting from the left side of the figure, the input RDF dataset is fed to the DIAERESIS Partitioner in order to partition it.For each one of the generated first-level partitions, vertical partitions are created and stored in the HDFS.Along with the partitions and vertical partitions, the necessary indexes are produced as well.Based on the available partitioning scheme, the DIAERESIS Query Processor receives and executes input SPARQL queries exploiting the available indexes.In the sequel, we will analyze the building blocks of the system.

Figure 3 :
Figure 3: Instance, Class, and VP indexes for our running example.