IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013
ISSN 2320 – 6020
A Survey on Dynamic Replication Strategies for Improving Response Time in Data Grids Ashish Kumar Singh1, Shashank Srivastava2 and Udai Shanker3 ABSTRACT: Replication is a phenomenon in which we create a exact copies of data for making better data availability. Replication is process which is good in distributed database system. Dynamic replication is helpful in reducing bandwidth consumption and also access latency in the large scale database such as data grid. Through replication process we can improve response time in a huge database like data grid. Different replication strategies can be defined depending on when, where, and how replicas are created and destroyed. Data grid is the best example of distributed database system. Data grid is a distributed collection of storage and computational resources that are not bounded within a geophysical location. Whenever we deal with the data grid then we have to deal with the geographical and temporal locality. KEYWORDS: Data grid, Replication, DDBS, Scalability. Introduction In recent years, distributed databases have become an important area of information processing, and their importance will rapidly grow. In distributed database, sites are interconnected through a network, for managing data in this wholly interconnected environment we need a method so that issue of data availability is not arises. This interconnected environment forms a data-grid. Data grid is a collection of huge amount of data which is located at multiple sites or at individual sites where each site can be its own multiple administrative power as to who may access the data. Data replication is the method through which we can solve all the issues of accessing data from the server by improving performance and availability of data. There are two principle replication schemes one is Active replication and another is Passive replication. In a replicated environment, copies of data are hosted by multiple sites. By increasing the number of copies or replicas enhances the system performance by improving the locality of data. If you are dealing with the data grid then you have deal with temporal and geographical locality where temporal locality means popular files in past will be accessed more in future and geographical locality means files recently accessed by a client are likely to be accessed by nearby clients. Ashish Kumar Singh1, Shashank Srivastava2 and Udai Shanker3 Department of Computer Science & Engineering, Madan Mohan Malaviya Engineering College, Gorakhpur-273 010 Email: ashi001.ipec@gmail.com1, shashank07oct@gmail.com2 and udaigkp@gmail.com3
Problems in file replication are Availability, Reliability, Cost, Scalability, Throughput, network traffic, Response time, and Autonomous operation. The purpose of a distributed file system is to allow users of physically distributed computers to share data and storage resources by using a common file system. The main advantages of replication are [5]: 1. Improved availability: in case of a failure of a node, the system can replicate the data from another site, which also improves the availability. 2. Improved performance: since the data is replicated among several nodes, the user can obtain data from the node nearest the node or that is the best in terms of workload. Data replication is very attractive in order to increase system throughput and also provide fault-tolerance. However, it is a challenge to keep data copies consistent. LITERATURE ALGORITHMS
REVIEW
OF
REPLICATION
Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be complex and time-consuming depending on the size and number of the distributed databases. This process can also require a lot of time and computer resources [14]. So there are several of algorithm are proposed by the different author for removing the problems related to the replication process are as follows. Dynamic Group Protocol [1] In this paper author develop a protocol Dynamic Group Protocol in 1992 which adapts itself to changes in site availability and network connectivity, which allows it to tolerate n -2 successive replica failures. The dynamic group
Š ijbstr.org 29
IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013 protocol is to operate in distributed environments where some of the sites contain full replicas of data objects. These sites can fail and can be prevented from exchanging messages by failures of the communication subnet. Dynamic group (DG) protocol that achieves both fast access and high data availability. This protocol organizes the replicas of a data object into small groups of equal size that are not necessarily disjoint. These groups correspond to the columns of the grid protocol. The set of all groups for a given data object will constitute a group set for that data object. When the failure of a site is detected, the groups are dynamically rearranged in order to protect the availability of the replicated data object against subsequent fadures. The number of groups may therefore decrease as sites fail, but the number of replicas in each group will remain constant. In this protocol two rules are define for maintain the quorum are as follows: Rule 1: Write Rule. The quorum for a write operation consists of: one complete group of live replicas from the quorum set Q and one live replica from every other group in Q. Rule 2: Read Rule. The quorum for a read operation consists of either one complete group of live replicas or one live replica from every group in Q. It is an efficient replication control protocol for managing replicated data objects that have more than five replicas. Like the grid protocol, this dynamic group protocol requires only O(√n) messages per access to enforce mutual consistency among n replicas. It differs from the grid protocol by reorganizing itself every time it detects a change in the number of available sites or the connectivity of the network. As a result, it can tolerate n-2 successive replica failures and provides data availability comparable to that of the dynamiclinear voting protocol. As a future work more have to done to evaluate the impact of simultaneous failures and network partitions and to devise grouping strategies minimizing the likelihood that any such event could disable access to the replicated data. A Pure Lazy Technique for Scalable Transaction Processing in Replicated DBS [2] In this paper author had discovered a pure lazy technique in 2005. the synchronization of replicas can be classified in to two categories are eager and lazy. In eager synchronization, all copies of a data item are updated by a single transaction, this type of synchronization suffered problem when any one of the replica is unavailable that’s why author select lazy synchronization. Each update transaction executes at a primary site while each read-only transaction executes at a secondary site. This algorithm performance is compared with the performance of the BLOCK algorithm against algorithm ALG-1SR, which provides only global serializability (1SR) and not strong session 1SR. ALG-1SR provides no session guarantees and simply routes all update transactions to a primary site and all read-only transactions to the secondary site. ALG-1SR never
ISSN 2320 – 6020 blocks transactions though they may be blocked by the local concurrency control at their execution site. ALG-1SR is an implementation of the DAG(WT) [6] protocol. Technique in this paper, a pure lazy approach, employs lexicographically ordered vectors to avoid transaction inversions and allows scalability of the primary database through partitioning and replication. We studied the performance of this algorithms and found that this techniques cost almost same as 1SR, which does not prevent transaction inversions. We conclude that this solution is a viable technique for achieving scalability and preventing transaction inversions in lazy replicated database systems. Dynamic Replica Management for Data Grid [3] In this paper author had discovered an algorithm DRCPS (Dynamic Replica Creation, Placement, And Selection algorithm) in 2010 in which there is a discussion about the about dynamic creation of replicas and replica placement which can automatically maintain the data by the status of data. Replica creation decides which file to be replicated and how many replications to be created. It also reduces the unnecessary replication. Replica placement deals with the placement of replicas in appropriate locations. It places the replica in a appropriate location so as to reduce the placement cost. Two factors are considered for the placement of replicas, the response time and the bandwidth. By using this algorithm Mean job execution time can be minimized, and there is an effective network usage. It is implemented by using a data grid simulator, Optorsim developed by European data grid projects. In this paper author uses dynamic replication to conserve bandwidth in a hierarchical environment. In this algorithm replica Selection procedure is based on whether to select the replica from the master site or from the region header or from any neighbouring site. The decision is based on the Weighted Euclidean Distance. As part of future work we plan to consider additional parameters for placing the replicas. Replica selection can also be extended by considering additional parameters such as security. The dynamism of the sites can be included as future work such that sites can join and quit in grid at any time Research of replication in unstructured P2P Network [4] In this paper research is based on the sub dividable area. The author proposes a new mechanism calls junction replication Method (JRM) in unstructured decentralized P2P network which reflects the most usual P2p network in the real environment. Set a new limit, if a normal node’s request over that limit, the author also sends the replica to it directly. In this way, the author can avoid nodes become overloaded with low node in nodes. In this paper still there is an improvement needed First, Security of JRM needs to be considered as the node’s location is easy to be exposed. And Second, for the development of computer, the gap between super node and
© ijbstr.org 30
IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013 normal ones is smaller; the author need to find methods to spread replicas averagely and decrease the pressure of super node. And a better load – balancing in the whole network is the final destination. The FLARE Strategy [9] In this paper author had discovered Dynamic Data Replication strategy on VOD Servers in 2012 in which VoD architecture is used which is proposed hybrid storage system for a large-scale VoD system. In this strategy Customers are divided into regional networks and each regional network has one or more server clusters known as units. In VoD server architecture, the videos are primarily stored on the hard disks and the replicas of the popular videos are store on flash SSDs. In FLARE, read requests for the first ten minutes of popular videos that have been replicated will be served by flash SSD. The number of HDD severs in the hybrid storage system is much larger than that of flash SSD servers. The flash SSD severs and HDD servers together form a cluster. Each Flash SSD is of a fixed size. The performance of FLARE by comparing it with the conventional disk striping strategy for Video-on-Demand servers. The flash SSDs are arranged in an array in an SSD node. Various SSD parameters can be modified. Page size, number of disks used, number of planes are a few to name. It is observe that increase in the number of requests greatly improves the performance of the requests. The difference between the total time taken to serve the requests by FLARE and the total time taken to serve the requests by sequentially striped HDD nodes increases with the increase in number of nodes. In particular, FLARE improved the performance of the I/O system by up to 23% when compared with the traditional striped HDD system. Moreover, FLARE consistently performs better when compared to the conventional HDD disk striping technique. In future, this work can be extended to show that the energy can also be optimized by using flash SSD nodes along with HDD nodes. Impact of peer – to – peer communication on real – time performance of file replication algorithms [11] In this study, performance of four replication algorithms, two from the literature and two new algorithms, are evaluated. For this evaluation, a process oriented and discrete – event driven simulator is developed. A detailed set of simulation studies are conducted using the simulator and the results obtained are presented to elaborate on the real – time performance of these replication algorithms. In this paper there still need refinement for making more efficient algorithm are These initial yet detailed results on the impact of the peer – to – peer communication on the replication algorithms in terms of the real time grid performance motivate the development of more sophisticated replication algorithms to better use of the grid resources, which will be topic of the future research.
ISSN 2320 – 6020 PHFS (predictive hierarchal fast spread) [10] In this paper author had designed a replication technique in which PHFS technique is uses to decrease the access latency of data access. It is an extension of fast spread presented by Ranganthan et al. [12]. This algorithm uses predictive techniques for the prediction of future usage of files and then pre-replicates them in hierarchal data grid on a path from source to client. It works in two phases, in phase one it makes the file access log files by collecting the access information from all over the system. In the next phase it applies data mining techniques like clustering and association rule mining to find useful knowledge like clusters of files that are accessed together or most frequent sequential access patterns. In this algorithm predictive working set(PWS) is formed so whenever a client requests a file, PHFS finds the PWS of that file and replicates all members of PWS along with the requested file on the path from source to client. It is noticed that the PHFS method is more suitable for the applications in which the clients keep on working in the same context for a long time period, and requests of clients are not random. So it is more suitable for scientific applications in which researchers are working on a project. LALW (Latest access Largest Weight)[13] In this paper author had discovered a dynamic weighted data replication scheme in a data grid in 2008 in which they propose an algorithm named LALW in which popular file is selected for replication and calculates a suitable number of copies that satisfied the current requirement of network and also select grid sites for replication. By setting a different weight for each data access record, the importance of each record is differentiated. The data access records in the nearer past have higher weight. It indicates that these records have higher value of references. In other words, the data access records in the long past have lower reference values. This algorithm is based on hierarchical architecture in which there is a dynamic replication policymaker which is at the center of the structure which only responsible for making policy for replicating the data on the suitable site. There is a cluster Header used to manage the site information in a cluster. The Policymaker collects the information about accessed files from all headers. Each site maintains a detailed record for each file. The record is stored in the form of <timestamp, FileId, ClusterId >, and each cluster header als maintain a record in the form of <FileId, ClusterId, Number>, and DRP also maintain record in the form < FileId, Number>. In constant time interval, the Policymaker gets the information for the files from all cluster Headers. Information gathered at different time intervals has different weights in order to distinguish the importance between history of records. The rule of setting weight uses the concept of Half-life. Each cluster have a cluster header which maintain local information and DRP maintain a global information. Performance of this algorithm is measure by using the Grid simulator OptorSim. OptorSim provides a modular
© ijbstr.org 31
IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013 framework to simulate the real Data Grid environment. The simulation results show that LALW have average job execution time is similar to LFU optimizer, but excels in terms of Effective Network Usage. For the future work, we will try to further reduce the job execution time. There are two factors to be considered. One is the length of a time interval. If the length is too short, the information about data access history is not enough. On the contrary, the information could be overdue and useless if the length is too long. The other one is the base of exponential decay. If the base is larger than 1 but smaller than 2, the declensional rate of weight will be slower. Then information about data access history will be more important in contributing to find the popular files.
ISSN 2320 â&#x20AC;&#x201C; 6020 Comparison of dynamic replication strategies Table 1 gives the comparison of different dynamic replication algorithm which we discuss in the above section on the basis of different parameters like availability, reliability, scalability, type of management, network traffic, response time, and throughput. There are different algorithms on different parameter gives good response to the user among all the algorithms in LALW algorithm there is lot of work to do for improving the throughput of the system or we can say that from research point of view LALW algorithm is good for extending work.
Dynamic Replication Based on Availability and Popularity [5] In this paper author proposed an algorithm for dynamic replication which is based on two parameter are availability and popularity of the data in the data grid which follow hierarchical architecture in June 2011. In this algorithm hierarchical topology is used the root in the topology is used to bind together the different clusters, so the longer distance between two clusters consists of two jumps. This root contains a list of replicas that exists in the system. That is to say on each cluster, if a request comes in and it cannot be satisfied within the cluster, the Cluster-Head (CH) leads this request to the root which in turn directs it to a cluster that has a copy of this data. In this paper the availability and popularity of the data is decided by the use of concept of primary copy which have Boolean variable D(i). If D(i) = False, then it means this replica is created by cluster head, it cannot be deleted. If D(i)= True, then it means this replica is created by node, it can be deleted. Node which has smallest degree of responsibility and a good stability is a best responsible node. Node which has greater number of access to a given piece of data is a best client. If a node receives a request to store the new replica home, it will react according to the type of replication such as replication based on availability (best responsible node) or replication based on popularity (best client node). For the validation of proposed approach author had developed the FTSIM simulator by using java language. Through this simulator author verified that this approach have good response time in comparison of no replication. Hence conclusion is that the system must guarantee a minimum level of data availability and it improves the availability depending on the demand for this data. For the future work this can be extended by introducing smart appearance in decision-making and for the integration of agents within each cluster in order to implement a cooperative distributed intelligence among agents.
Table 1: Comparison of different dynamic replication algorithm CONCLUSION In this paper, we have reviewed the different file Replication algorithms on the basis of different parameters. Also work of the entire algorithm is discussed and their future work and also their simulation result on the basis parameters what they take at the time simulation of their work. A comparison of the different algorithms with respect to various parameters like availability, reliability, scalability, throughput, network traffic, response time etc. in our future work we intend to do increases the throughput, decrease the response time, and improve the availability, scalability and reliability of the large distributed database and also these work we can do on the data grid.
Š ijbstr.org 32
IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013 REFERENCES 1.
Jehaiz-Fraincois Paris and Perry Kim Sloope: Dynamic Management of Highly Replicated Data IEEE 1060-9857/92, 1992.
2.
Khuzaima Daudjee and Kenneth Salem: A Pure Lazy Technique for Scalable Transaction Processing in Replicated Databases, Proceedings of the 2005, 11th (ICPADS'05) 2005 IEEE.
3.
K. Sashi, Antony Selvadoss Thanamani: Dynamic replication in a data grid using a Modified BHR Region Based Algorithm. Future Generation Computer Systems. 27(2): 202-210, 2011.
4.
5.
H. sato, S. Matsuoka, T. Endo, File clustering based replication algorithm in a grid environment, 2009 9th IEEE/ ACM international symposium on cluster computing and the grid., Issue: 978-0-7695-3622-4. Bakhta Meroufel and Ghalem Belalem “Dynamic Replication Based on Availability and Popularity in the Presence of Failures” Journal of Information Processing Systems(JIPS), Vol. 8, No. 2, pp. 263278, June 2012.
6.
Y. Breitbart, R. Komondoor, R. Rastogi, S. Seshadri, and A. Silberschatz. Update propagation protocols for replicated databases. In Proc. SIGMOD, pages 97–108, 1999.
7.
G-F. Hughes, J-F. Murray, and K-K. Delgoda. Improved disk drive failure warnings. In IEEE Transaction on reliability, 51(3): 350-357, 2002.
8.
S. Venugopal, R. Buyya, and K. Ramamohanarao, A Taxonomy of Data Grids for Distributed Data Sharing, Management, and Processing, ACM Computing Surveys. 38(1): 1-53, 2006.
9.
Ramya Manjunath, Tao Xie: Dynamic Data Replication on Flash SSD Assisted Video-onDemand Servers, 2012 IEEE.
ISSN 2320 – 6020 11. M. atanak, A. Dogan, Impact of peer – to –peer communication on real –time performance of file replication algorithm., 2008 23rd international symposium on computer and information sciences., issue: 27-29 Oct. 2008. 12. K. Ranganathan, I. Foster, Design and evaluation of dynamic replication strategies for a high performance data grid, in: International Conference on Computing in High Energy and Nuclear Physics, vol. 2001, 2001. 13. Ruay-Shiung Chang, Hui-Ping Chang, Yun-Ting Wang, “A Dynamic Weighted Data Replication Strategy in Data Grids”, pp. 414-421, 2008 IEEE. 14. http://en.wikipedia.org/wiki/Distributed_database. 15. A. Tamir, “An O (pn2 ) algorithm for the p-median and related problems on tree graphs,” Oper. Res. Lett., vol. 19, pp. 59–64, 1996. 16. W. Hoschek, F. J. Janez, A. Samar, H. Stockinger, and K. Stockinger. Data management in an international DG project. In Proceedings of GRID Workshop, pages 77–90, 2000.
10. L.M. Khanli, A. Isazadeh, T.N. Shishavanc, PHFS: A dynamic replication method, to decrease access latency in multi-tier data grid, Future Generation Computer Systems (2010).
© ijbstr.org 33