Adaptive Distributed RDF Graph Fragmentation and Allocation based on Query Workload
Abstract: As massive volumes of Resource Description Framework (RDF) data are growing, designing a distributed RDF database system to manage them is necessary. In designing this system, it is very common to partition the RDF data into some parts, called fragments, which are then distributed. Thus, the distribution design comprises two steps: fragmentation and allocation. In this study, we explore the workload for fragmentation and allocation, which aims to reduce the communication cost during SPARQL query processing. Specifically, we adaptively maintain some frequent access patterns (FAPs) to reflect the characteristics of the workload while ensuring the data integrity and approximation ratio. Based on these frequent access patterns, we propose three fragmentation strategies, namely vertical, horizontal and mixed fragmentation, to divide RDF graphs while meeting different types of query processing objectives. After fragmentation, we discuss how to allocate these fragments to various sites while balancing the fragments. Finally, we discuss how to process queries based on the results of fragmentation and allocation. Experiments over large RDF datasets confirm the superior performance of our proposed solutions.
Existing system: Currently, as the RDF dataset sizes are beyond the capacity of a single machine, designing a distributed RDF database system is essential. In a distributed RDF system, given an RDF graph G, the first issue is “data fragmentation and allocation” . This refers to the method of partitioning an RDF graph G into several parts, called fragments, and then distributing them (allocation) among different sites. An important issue during data fragmentation and allocation is how to reduce the communication cost between different sites during distributed query evaluation. As evaluating a SPARQL query is a sub graph (homomorphism) matching problem that exhibits strong locality, we propose a local pattern-based fragmentation strategy in this study. Proposed system: In this study, we investigate an efficient distributed SPARQL query engine. The focus of this work is “data fragmentation and allocation” for RDF repository. Here, as discussed in Section 1, the proposed fragmentation is based on mined frequent patterns. In practice, the number of patterns is often much larger than the number of sites. Hence, by drawing on the lessons of distributed relational database design, we separate the distribution design to better deal with the complexity of the problem. Advantages: As mentioned before, VF and HF strategies are designed for different purposes. While the former focuses on utilizing the locality of queries to improve the throughput, the latter is designed to reduce the query response time by parallelizing sub queries. In practice, we often need to trade off the advantages of two aspects. Therefore, we design an MF strategy. Disadvantages: We formalize the FAP selection problem (Section4.1) and prove that it is an NPhard problem. Thus, we propose a heuristic algorithm that can guarantee data integrity and the approximation ratio (Theorem 2). Furthermore, to continuously
adapt to the changes in workloads, we introduce an incremental technique for data fragmentation guided by the workload (see Section 5). Our algorithms can achieve a satisfactory performance (see experiments in Section 9). Here, as discussed in Section 1, the proposed fragmentation is based on mined frequent patterns. In practice, the number of patterns is often much larger than the number of sites. Hence, by drawing on the lessons of distributed relational database design, we separate the distribution design to better deal with the complexity of the problem. Modules: FAP selection and maintenance: Given an FAP, we build a fragment by collecting all its matches in the RDF graph to reduce the communication cost. As selecting all FAPs could lead to expensive space cost owing to data replication, there should be a trade-off between performance gain and space cost when selecting FAPs. We formalize the FAP selection problem (Section 4.1) and prove that it is an NP-hard problem. Thus, we propose a heuristic algorithm that can guarantee data integrity and the approximation ratio (Theorem 2). Furthermore, to continuously adapt to the changes in workloads, we introduce an incremental technique for data fragmentation guided by the workload (see Section 5). Our algorithms can achieve a satisfactory performance (see experiments in Section 9). FAP Selection : Consider a frequent pattern p. If a fragment is generated from the graph induced by the matches of pattern p, then evaluating all queries containing the pattern can be speeded up by using this fragment. The more queries an FAP hits, the more gains obtained during query processing. Thus, one factor of defining Benefit (Definition 7) in selecting a pattern p is the number of queries that p hits. Furthermore, owing to high space cost, it is not necessary to generate fragments based on all frequent patterns. For example, if a pattern p is frequent, its child pattern p0 (i.e., p0 p) is also frequent due to the apriority property . Let Q0 and Q denote all queries in the query workload containing p0 and p respectively. If Q0 is similar to Q, i.e., Q0\Q Q is large, it means that if a SPARQL Q query contains p0 (child pattern), it has a
high probability of containing p (parent pattern) as well. Obviously, we only need to maintain the fragment for the larger pattern p to save the space cost, because the larger pattern will contribute to answering a larger sub query of Q. DFS Coding: DFS coding can translate a graph into a unique edge sequence by performing a DFS. Each vertex is subscripted by its discovery time in the DFS search. The forward edges are those edges in the DFS tree while the backward edges are the remaining ones. We put the backward edges into the order as follows. Given a vertex v, all of its backward edges should appear after the forward edge pointing to v. Given vi and its two backward edges v i!vj and v i!vk, if j < k, then edge v i!vj appears before edge v i!vk. Then, a complete edge sequence can be formed, and this sequence is called a DFS code. Readers can refer to for more details on DFS coding and an example is provided in Appendix 3. Vertical Fragmentation: For VF, we put homomorphism matches to the same FAP into the same fragment. As a query graph often contains only a few FAPs, only sites storing relevant fragments need to be accessed to find matches. Filtering out irrelevant fragments can improve the query performance. Furthermore, sites that do not store relevant fragments can be used to evaluate other queries in parallel, which improves the total throughput of the system. In summary, the VF strategy utilizes the locality of SPARQL queries to improve both the query response time and throughput. Experimental results in Section 9 also confirm the above argument.