J. Comp. & Math. Sci. Vol.3 (6), 563-569 (2012)
A Study of Query Optimization in Non-Centralized Environments UPENDRA NATH TRIPATHI1 and HIMANSHU PANDEY2 1
Department of Computer Science Department of Mathematics & Statistics DDU Gorakhpur University, Gorakhpur, U. P., INDIA. 2
(Received on: October 7, 2012) ABSTRACT In the present study, we have gone through issues and techniques related to optimizing queries in non- centralized environments. Among all parallel architectures, the shared nothing and the shared memory paradigms have emerged as the most viable ones for database query processing. The main difference between distributed and parallel database is only in terms of geographic situation. In this study we have focused on Method- Structure Space and the Planner module. Keywords: Distributed Database, Parallel Database, Query Processing, Query Optimization.
1. INTRODUCTION We will study about query optimization techniques and developments in the area of non centralized environment e.g. parallel databases and distributed databases separately and will try to give some inferences from that. 2.1 Parallel Databases Highly parallel database systems are beginning to displace traditional mainframe computers for the largest database and transaction processing task. Boral H. and D. DeWitt presented a study about predicting
the demise of database machines. Most database machine research had focused on specialized, often trendy, hardware such as CCD memories, bubble memories and optical disks. None of these technologies fulfilled their promises; so there was a sense that conventional CPUs, electronic RAM and moving-head magnetic disks would dominate the scene for many years to come. While predictions were fairly accurate about the future of hardware, the critics were certainly wrong about the overall future of parallel database systems. Parallelism in database can be achieved through several mechanisms. First of all data can be fortified across multiple disks after parallel I/O,
Journal of Computer and Mathematical Sciences Vol. 3, Issue 6, 31 December, 2012 Pages (557-663)
564
Upendra Nath Tripathi, et al., J. Comp. & Math. Sci. Vol.3 (6), 563-569 (2012)
Furthermore, relational operations e.g. sort, join, aggregation can be executed in parallel and data can be also partitioned and each processor can work independently on its own partition. Different queries can be run in parallel with each other and in case of confliction, concurrency control takes care of it. Parallelism in terms of I/O parallelism reduces the time required to retrieve relations from disk by partitioning and relations may be stored on multiple disks. Disk partitioning for I/O parallelisms can be done through Round Robin partitioning, Hash partitioning and/or Range partitioning. Round robin technique is best suited for
sequential scan of entire relation on each query and since all disks have almost an equal number of tuples, retrieval work is thus well balanced between disks. Further, Hash partitioning is good for sequential access and point queries on partitioning attribute and it is not suitable for range queries. Range partitioning provides data clustering by partitioning attribute value and it is good for sequential access and good for point queries on partitioning attribute because only one disk needs to be accessed. Parallel database can be understand through following figures:
Central Database
Node 1 Instance 1 Process set 1
Node 2 Instance 2 Process set 2
Node n Instance n Process set n
Fig. 1.1 Using a Single Database in Parallel Mode
Database 1
Node 1 Instance 1 Process set 1
Database 2
Node 2 Instance 2 Process set 2
Database n
Node n Instance n Process set n
Fig. 1.2 Using Exclusive Database Journal of Computer and Mathematical Sciences Vol. 3, Issue 6, 31 December, 2012 Pages (557-663)
Upendra Nath Tripathi, et al., J. Comp. & Math. Sci. Vol.3 (6), 563-569 (2012)
Among all parallel architecture, the shared-nothing and the shared-memory paradigms have emerged as the most viable ones for database query processing. Thus, query optimization research has concentrated on these two. The processing choices that either of these paradigms offers represent a huge increase over the alternatives offered by method-structure space module in a sequential environment. Query optimization in parallel databases is significantly more complex than query optimization in sequential databases because cost models may be more complicated, since we must take into account partitioning costs and issues such as skew and resource contention. Scheduling of execution tree in parallel system must decide that how to parallelize each operation and how many processors to use for it and what operations to pipeline, what operations to execute independently in parallel and what operations to execute sequentially, one after the other. In parallel system, determining the amount of resources to allocate for each operation is a problem e.g. allocating more processors than optimal can result in high communication overhead. Further in this query optimization, long pipelines should be avoided as the final operation may wait a lot for inputs, while holding precious resources. Thus for this type of system, heuristics are needed because it is tough to decide from the number of parallel evaluation plans from which to choose from is much larger than the number of sequential evaluation plans. For parallel query optimization two alternative heuristics can be proposed by heuristics: (i) No pipelining and inter-operation pipelining, just parallelize every
565
operation across all processors. Finding of best plan can be much easier now with the use of standard optimization technique but with new cost model. In Vokano parallel database, exchangeoperator model can be useful in which exchange operator is introduced into query plans to partition and distribute tuples. In this, each operation works independently on local data on each processor, in parallel with other copies of the operation. (ii) First of all choose most efficient sequential plan and then choose how best to parallelize the operations in that plan, this can explore pipelined parallelism as an option. In query optimization for a parallel system, choosing a good physical organization (partitioning technique) is important to speed up queries. Besides the above, some issues must be considered in the design of parallel system to optimize queries: *
Parallel loading of data from external sources is needed in order to handle large volumes of incoming data.
*
Resilience to failure of some processors or disks because (i) Probability of some disk or processor failing is higher in a parallel system. (ii) Operation (perhaps with degraded performance) should be possible in spite of failure, and (iii) Redundancy achieved by storing extra copy of every data item at another processor.
Journal of Computer and Mathematical Sciences Vol. 3, Issue 6, 31 December, 2012 Pages (557-663)
566
Upendra Nath Tripathi, et al., J. Comp. & Math. Sci. Vol.3 (6), 563-569 (2012)
*
On-line reorganization of data and schema must be supported e.g. index construction on TB databases can take hours or days even on a parallel system.
*
There must be a support for on-line repartitioning and schema changes.
There has been much work in the database literature on designing query processing and optimization algorithms that can effectively exploit multi-processor parallelism. Apers et al. given a study about the trade offs between response time optimization and total work optimization. Here response time optimization is, that minimize the total time taken to execute a query and total work optimization is, that minimize the total work across all processors. In this context, Ganguly et al. formalize these trade offs and shown that optimizing for response time is much difficult than optimizing for total work, since the response time metric does not obey the principle of optimality. To make the parallel query optimization problem tractable, Hong and Stone broker present a two-phase approach that separates join order optimization from parallel scheduling issues. In this study they gave two-stage approach in which given a query, one first identities the optimal sequential plan for it using conventional techniques and then identifies the optimal parallelization/scheduling of that plan. Various techniques have been proposed for the second stage, but none of them claims to provide a complete and optimal answer to the scheduling operation, which remains an open research problem. In the segmented execution model, one considers only schedules that process memory-resident right-deep segments of
query plans one-at-a-time. Shekita et al. combined this model with a novel heuristic search strategy with good results for sharedmemory. In contrast to all the search-space reduction heuristics, Lanzelotte et al. dealt with both deep and bushy trees, considering schedules with independent parallelism, where all the pipelines in an execution are divided into phases, pipelines in the same phases are executed in parallel, and each phase start only after the previous phase ended. 2.2 Distributed Databases The difference between distributed and parallel DBMSs is that the former are formed by a collection of independent, semiautonomous processing sites that are connect via a network that could be spread over a large geographic area, whereas the latter are individual systems controlling multiple processors that one in the same location, usually in the same machine room. Several commercial systems are offering distributed versions of their products e.g. DB2, Informix, Sybase, Oracle, SQL server etc. A distributed database system is a collection of logically related database that co-operates in a transparent manner where transparent implies that each user within the system may access all of the data within all of the databases as if they were a single database. There should be location independence i.e. as the user is unware of where the data is located, it is also possible to move the data from physical location to another without affecting the user. Structure of a Distributed Database Management System (DDBMS) could be drawn as in following figure:
Journal of Computer and Mathematical Sciences Vol. 3, Issue 6, 31 December, 2012 Pages (557-663)
Upendra Nath Tripathi, et al., J. Comp. & Math. Sci. Vol.3 (6), 563-569 (2012)
567
Fig 1.3 Structure of Distributed DBMS
DDBMS facility reduces communication overhead as most data access is local, less expensive and performance is better. Instead of one server handling the full database, we have a collection of machines handling the same databases that improves processing power. If a server fails, then the only part of the system that is affected is the relevant local sites and the rest of the system remains functional and available. In DDBMS, it is easier to accommodate increasing the size of the global (logical) database and local autonomy can be achieved also in distributed database because the database is brought
nearer to users, this can effect a cultural change as it allows potentially greater control over local data. The cost model in DBMS assigns an estimated cost to any partial or complete plan in the search space. It also determines the estimated size of the data stream for output for every operator in the plan. In a distributed execution environment, there are two different time consumption estimates to be considered just like in parallel database system: total time and response time. The first one is the sum of time consumed by each processor, regardless of concurrency, while the latter considers that operations
Journal of Computer and Mathematical Sciences Vol. 3, Issue 6, 31 December, 2012 Pages (557-663)
568
Upendra Nath Tripathi, et al., J. Comp. & Math. Sci. Vol.3 (6), 563-569 (2012)
may be carried out concurrently. For a distributed database, in a query involving a multi-site join and, possible databases with replicated files, the DDBMS must decide where to access the data and how to proceed with the join. It can be three step process: i. Query Decomposition ii. Data Localization iii. Global optimization. Thus, other than the necessary extensions of the cost model module, the main differences between centralized and distributed query optimization are in the method-structure space module, which offers additional processing strategies and opportunities for transmitting data for processing at multiple sites. In early distributed systems, where the network cost was dominating every other cost, a key idea has been using semi joins for processing in order to only transmit tuples that would certainly contribute to join results. (P.A. Bernstein et al., L.F. Mackert el al.) An extension of that idea is using Bloom filters, which are bit vectors that approximate join columns and are transferred across sites to determine which tuples might participate in a join so that only these may be transmitted (L.F. Mackert et al.). 3. CONCLUSION In this study we have gone through several work that have been done in the area of query optimization for non centralized environments viz. parallel databases as well as distributed databases and try to draw some inferences from them. Further, we have given some suggestions also to improve query optimizations in both type of
non centralized environments viz. parallel and distributed database. REFERENCES 1. Alexander, W., et al., "Process and Dataflow Control in Distributed DataIntensive Systems," Proc. ACM SIGMOD Conf., Chicago, June 1988. October, (1983). 2. Boral, H. and D. DeWitt, "Database Machines: An Idea Whose Time has passed? A Critique of the Future of Database Machines," Proceedings of the 1983 Workshop on Database (1983). 3. DeWitt, D.J., "DIRECT - A Multiprocessor Organization for Supporting Relational Database Management Systems," IEEE Transactions on Computers, June, (1979). 4. DeWitt, D. J., Katz, R., Olken, F., Shapiro, D., Stonebraker, M. and D. Wood,"Implementation Techniques for Main Memory Database Systems", Proceedings of the 1984 SIGMOD Conference, Boston, MA, June, (1984). 5. E. Shekita, H. young and K.L. Tan, “Multi- join optimization for symmetric multiprocesors�, Proc of 19th Intl. VLDB Conf., pages 479-492, Dublin , Ireland, (1993). 6. Graefe, G., and K. Ward, "Dynamic Query Evaluation Plans", Proceedings of the 1989 SIGMOn Conference, Portland, OR, June (1989). 7. Graefe, G., and K. Ward, "Dynamic Query Evaluation Plans", Proceedings of the 1989 SIGMOn Conference, Portland, OR, June (1989). 8. Hsiao, H. I. and D. J. DeWitt, "Chained Declustering: A New Availability
Journal of Computer and Mathematical Sciences Vol. 3, Issue 6, 31 December, 2012 Pages (557-663)
Upendra Nath Tripathi, et al., J. Comp. & Math. Sci. Vol.3 (6), 563-569 (2012)
9.
10.
11.
12.
13.
14.
Strategy for Multiprocessor Database Machines", Proceedings of the 6th International Conference on Data Engineering, Feb. (1990). Lawrie, D.H., "Access and Alignment of Data in an Array Processor", IEEE Transactions on Computers, V. 24.12, Dec. (1975). L.F. Mackert and G.M. Lohman. “ R* validation and performance evaluation for distributed queries” , Proc of 12th Intl. VLDB Conf, pages 149-159, Japan (1986). Livny, M., S. Khoshafian, and H. Boral, "Multi-Disk Management Algorithms", Proceedings of the 1987 SIGMETRICS Conference, Banff, Alberta, Canada, May, (1987). P.A. Bernstein, N.Goodman, E.Wong, C.L. Reeve, and J.B. Rothnie. “Query Processing in a system for distributed databases”, ACM TODS, 6(4), pages 602-625, (1981). P. Apers, A. Hevner, and S. Yao. Optimization algorithms for distributed queries. IEEE Trans. Software Eng., 9(1), (1983). R. Lanzelotte, P. Valduriez, and M. Zait, “On the effectiveness of optimization search strategies for parallel execution
15.
16.
17.
18.
19.
20.
569
space”. Proc. Of 19th Intl. VLDB Conf, pages 493-504, Ireland (1993). S. Ganguly, W. Hasan, and R. Krishnamurthy. Query optimization for parallel execution. In SIGMOD, (1992). Selinger, P. G., et al., "Access Path Selection in a Relational Database Management System," Proceedings of the 1979 SIGMOD Conference, Boston, MA., May (1979). Stonebraker, M., "Muffin: A Distributed Database Machine," ERL Technical Report UCB/ERL M79/28, University of California at Berkeley, May (1979). Stonebraker, M., "The Case for Shared Nothing," Database Engineering, Vol. 9, No.1, 1986. Wei Hong and Michael Stonebraker. Optimization of parallel query execution plans in xprs. In PDIS, (1991). Wolf, J.L., Dias, D.M., and P.S. Yu, "An Effective Algorithm for Parallelizing SortMerge Joins in the Presence of Data Skew," 2nd International Symposium on Databases in Parallel and Distributed Systems, Dublin, Ireland, July, (1990). Zeller, H.J. and J. Gray, "Adaptive Hash Joins for a Multiprogramming Environment", Proceedings of the 1990 VLDB Conference, Australia, August (1990).
Journal of Computer and Mathematical Sciences Vol. 3, Issue 6, 31 December, 2012 Pages (557-663)