International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014
Efficient Query Processing for Distributed Hash Tree with Less Maintenance C.Indumathi, M.Tech IT, Prist University, India
Ç Abstract- Hash Tree -An Efficient Query with Low
network dynamics and node failures that are
Maintenance Indexing scheme in DISTRIBUTED HASH
common in large-scale P2P networks.
TREEs. Two
novel
techniques
contribute
to
the
superior performance: a clever naming mechanism and a
tree
summarization
enhancements :
strategy.
We
present
two
an extensible technique for indexing
scheme unbounded data domains and a double-naming strategy for improving system load balance. Compared with state- of-the art indexing scheme scheme, it saves maintenance cost and substantially improves query
Load
balancing: Load balance in DISTRIBUTED HASH TREEs can be efficiently achieved thanks to uniform hashing. While DISTRIBUTED HASH TREEs are popular in developing various P2P applications, such as large-scale data storage content distribution
performance in terms of both response time and
and scalable multicast/any cast services they are
bandwidth consumption.
extremely poor in supporting critical queries
Keywords—Hashing, maintenance
query
efficient,
less
This is primarily because data locality, which is crucial to processing such critical queries, is destroyed by uniform hashing employed in DISTRIBUTED HASH TREEs. Two issues are
1.INTRODUCTION: Distributed Hash Table is a widely used
critical
to
the
performance
of
an
over-
building block for scalable Peer-to-Peer (P2P)
DISTRIBUTED HASH TREE indexing scheme
systems. It provides a simple lookup service: given
scheme: query efficiency and index maintenance
a key, one can efficiently locate the peer node
cost. In conventional applications where queries
storing the key. By employing consistent hashing
are more frequent than data updates, achieving
and
query efficiency is considered the first priority.
carefully
designed
overlays,
these
DISTRIBUTED HASH TREEs exhibit several
However, in P2P systems, peer joins and departures usually result in data insertions and
advantages that fit in a P2P context: typical
deletions to/from the system; and the peer
DISTRIBUTED HASH TREE of N peers, the
join/departure rate can be as high as the query rate
lookup latency is O(n log n) hops with each peer
Such data updates incur constant index updates.
Scalability
maintaining
and
only
efficiency:
In
“neighbours.”
a
Robustness:
DISTRIBUTED HASH TREEs are resilient to
ISSN: 2249-2593
Thus, the cost of index maintenance becomes a non-negligible factor in evaluating system
http://www.ijcotjournal.org
Page 10
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 performance. This perspective, however, is not realized in existing over-DISTRIBUTED HASH TREE indexing scheme schemes.
HASH
TREE
indexing
scheme
scheme:
Query
efficiency and Index maintenance cost. 2.2 PROPOSED SYSTEM
More specifically, in a distributed context, each peer maintains a local view of the global index; in order to achieve a better query performance, the common idea is to enlarge the local view and let each peer know more about the global index. In
In the proposed scheme, we introduce Hash Tree - a low-maintenance yet query-efficient scheme for data indexing scheme over DISTRIBUTED HASH TREEs. Two novel techniques contribute to the superior performance of : a naming mechanism that gracefully
DST the index structure remains static and is
distributes the index structure over the underlying
made known globally. However, this static design
DISTRIBUTED HASH TREE and a tree summarization
inherently goes against the dynamic nature of P2P
mechanism that offers each peer a scalable local view
systems and easily leads to load imbalance.
without incurring extra maintenance cost. can support
2.SYSTEM ANALYSIS
various critical queries with near optimal performance.
2.1 EXISTING SYSTEM
2.2.1 ADVANTAGES IN PROPOSED SYSTEM:
DISTRIBUTED HASH TREE provides a
We propose to address both query efficiency
simple lookup service: given a key, one can efficiently
and maintenance efficiency for data indexing scheme
locate the peer node storing the key. DISTRIBUTED
over DISTRIBUTED HASH TREEs.
HASH TREEs are popular in developing various P2P
modification of the underlying DISTRIBUTED HASH
applications such as large scale data storage, content
TREE and hence possesses the virtues of simplicity and
distribution and scalable multicast/ any cast services.
adaptability.
DISTRIBUTED
We present two enhancements to
HASH
TREE
indexing
scheme
schemes are of great simplicity to design and
requires no
Hash Tree: Query
efficient and Low maintenance.
implement. However as hashing technique employed in
is the first over-DISTRIBUTED HASH TREE
DISTRIBUTED HASH TREEs destroys data locality, it
indexing scheme scheme with such flexibility. saves
is not a trivial task to support cmplex queries in
maintenance cost and substantially improves query
DISTRIBUTED HASH TREE-based P2P system.In
performance in terms of both response time and
DISTRIBUTED HASH TREEs the cost of index
bandwidth consumption.
maintenance becomes a non-negligible factor in
3. PROJECT DESCRIPTION
evaluating system performance.
2.1.1 DRAWBACKS IN EXISTING SYSTEM
3.1 PROBLEM DEFINITION We address a challenging problem of how to
DISTRIBUTED HASH TREES does not support critical queries in efficient way. Two issues are critical to the performance of an over-DISTRIBUTED
ISSN: 2249-2593
efficiently support critical query processing with query efficient and low maintenance cost in
existing
DISTRIBUTED HASH TREE-based P2P systems. To
http://www.ijcotjournal.org
Page 11
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 tackle this problem, we propose
Hash Tree ()- An
extremely poor in supporting critical queries such
Efficient Query with Low Maintenance Indexing
as range queries and k-nearest-neighbor (k-NN)
scheme Scheme in DISTRIBUTED HASH TREEs.
queries.
employs a novel naming mechanism and a tree summarization strategy for graceful distribution of
its
This is primarily because data locality,
index structure.
which is crucial to processing such critical queries, 3.2 OVERVIEW
is destroyed by uniform hashing employed in
In a Distributed Hash Table is a widely used building block for scalable Peer-to-Peer (P2P) systems. It provides a simple lookup service: given a key, one can efficiently locate the peer node storing the key. By employing consistent hashing and
carefully
designed
overlays,
these
DISTRIBUTED HASH TREEs exhibit several
DISTRIBUTED HASH TREEs.Two issues are critical
to
the
performance
of
an
over-
DISTRIBUTED HASH TREE indexing scheme scheme: query efficiency and index maintenance cost. In conventional applications where queries are more frequent than data updates, achieving query efficiency is considered the first priority.
advantages that fit in a P2P context: However, in P2P systems, peer joins and Scalability and efficiency: In a typical DISTRIBUTED HASH TREE of N peers, the lookup latency is OðlogNÞ hops with each peer
departures usually result in data insertions and deletions to/from the system; and the peer join/departure rate can be as high as the query rate Such data updates incur constant index updates.
maintaining only “neighbors.”
Thus, the cost of index maintenance becomes a Robustness:
DISTRIBUTED
HASH
TREEs are resilient to network dynamics and node failures that are common in large-scale P2P
factor
in
evaluating
system
performance. More specifically, in a distributed context, each peer maintains a local view of the global index; in order to achieve a better query
networks. Load
non-negligible
balancing:
Load
balance
in
DISTRIBUTED HASH TREEs can be efficiently
performance, the common idea is to enlarge the local view and let each peer know more about the global index.In DST the index structure remains
achieved thanks to uniform hashing.
static and is made known globally. While DISTRIBUTED HASH TREEs are popular in developing various P2P applications, such as large-scale data storage content distribution and scalable multicast/any cast services they are
However, this static design inherently goes against the dynamic nature of P2P systems and easily leads to load imbalance. As an alternative,
ISSN: 2249-2593
RST
http://www.ijcotjournal.org
allows
for
dynamic
tree
Page 12
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 growth/contraction
and
broadcasting mechanism
further
employs
a
to maintain its global
DISTRIBUTED HASH TREE indexing scheme schemes, the main challenge of
is to find a
view. However, the index maintenance cost is
mapping from data keys to DISTRIBUTED HASH
prohibitively high as a single node split causes a
TREE keys such that data locality is preserved
broadcast to all other nodes, which may render the
with minimal maintenance cost.The figure gives an
whole P2P system unscalable.
overview of the mapping operation in . First, employs a space partition tree to index data keys.
In the index, a data unit is called a record,
Then, after the partition tree is decomposed and
and each record is identified by a data key. We
summarized in a data structure called a leaf bucket,
assume that the data keys to be indexed fall into a
uses a novel naming function to map leaf buckets
bounded one dimensional space
to DISTRIBUTED HASH TREE keys. In the
Fig.1. System Architecture for indexing scheme scheme
following, we explain these two procedures in detail. 3.3 MODULE DESCRIPTION 3.3.1. MODULE 1: DATA STORING
is proposed to support critical queries over some existing DISTRIBUTED HASH TREEs, while exact-match queries can be directly and efficiently
answered
by
the
existing
DISTRIBUTED HASH TREE infrastructure. OVERVIEW:
In the index, a data unit is called a record, and each record is identified by a data key .On the other hand, to assign the records in the underlying DISTRIBUTED HASH TREE, each data record is associated
with
another
key,
called
DISTRIBUTED HASH TREE key. In a native indexing scheme scheme, one may set the DISTRIBUTED HASH TREE key directly to be the data key. However, this would destroy data However, this would destroy data locality, as mentioned earlier, and lead to inefficient support to critical queries. Thus, similar to other over-
ISSN: 2249-2593
locality, as mentioned earlier, and lead to inefficient support to critical queries. Thus, similar to other
over-DISTRIBUTED HASH TREE
http://www.ijcotjournal.org
Page 13
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 indexing scheme schemes, the main challenge of
Fig.2. An example of a space partition tree.
is to find a mapping from data keys to DISTRIBUTED HASH TREE keys such that data locality is preserved with minimal maintenance cost. First, employs a space partition tree to index data keys. Then, after the partition tree is decomposed and summarized in a data structure called a leaf bucket, uses a novel naming function to map leaf buckets to DISTRIBUTED HASH TREE keys.
SPACEPARTITION TREE: As the name implies, the space partition tree (or simply partition tree for short) recursively partitions the data space into two equal-sized
LOCAL TREE SUMMARIZATION:
subspaces until each subspace contains fewer than
Recall that data records are stored in leaf
split data keys. Only leaf nodes store data records
nodes; we need to map only leaf nodes to the underlying
(or just data entries with pointers pointing to actual
DISTRIBUTED HASH TREE. On the other hand, a
data records). Basically, the space partition tree is a
bare leaf node lacks the knowledge of the overall tree
binary tree with structural properties listed below:
structure, which, as we will see, is critical to critical query processing. Thus, we propose a distributed data
Double Root. Unlike a conventional
structure, termed leaf bucket, to store data records and
binary tree, the space partition tree has two roots.
summarize the partition tree’s structural information.
The additional root, termed virtual root, is a virtual
Each leaf bucket corresponds to a leaf node in the
node above the ordinary one.
tree.According to the completeness property of the partition tree, all branch nodes must exist in the tree.
Completeness. Every tree node, except the
Some branch nodes may contain a subtree, called the
virtual root and leaf nodes, has two children, that
neighboring subtree.The structures of these neighboring
is, every internal node has two children.
subtrees are unknown in the current local tree, but are maintained by some other leaves’ local trees. From a global viewpoint, the local trees of all leaves together guarantee the partition tree’s integrity. In other words, the leaf buckets collectively maintain the tree’s
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 14
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 structural information. Thus,the remaining issue is how
QUERYING
to map each leaf bucket as an atomic unit to the
Implementation of Range Queries Algorithm:
DISTRIBUTED HASH TREE key, which is achieved by a novel naming function..
A range query is a common database operation that retrieves all records where some value is between an upper and lower boundary.
NAMING FUNCTION:
Range queries are unusual because it is not After the partition tree is decomposed and
generally known in advance how many entries a
summarized in a data structure called a leaf bucket,
range query will return, or if it will return any at
uses a novel naming function to map leaf buckets
all. Many other queries, such as the top ten most
to DISTRIBUTED HASH TREE keys.
senior employees, or the newest employee, can be
3.3.2. MODULE 2 : INDEXING SCHEME AND
done more efficiently because there is an upper
QUERYING
bound to the number of results they will return. A
INDEXING SCHEME
query that returns exactly one result is sometimes
LSH Based Indexing scheme:
called a singleton.
Locality-sensitive hashing (LSH) functions are the functions that make closer objects collide with a higher probability than far apart objects. An LSH index is constructed with L separate hash tables with the functions, each of which is constructed with k hash functions independently
Implementation of Max/Min Query Algorithm: Min max is a decision rule used in decision theory, game theory, statistics and philosophy for minimizing the possible loss while maximizing the potential gain. Alternatively, it can be thought of as maximizing the minimum gain
and uniformly at random (with replacement) from a given LSH family H.
Implementation of K-NN Query Algorithm: K-nearest neighbor search identifies the
Fig.3. Novel Naming function to map leaf buckets
top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. K-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors. 3.3.3. MODULE 3: LOOKUP
In this module , fundamental service in is the lookup operation: given a data key, a lookup returns the corresponding DISTRIBUTED HASH
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 15
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 TREE key. Essentially, this is to find the label of
systems. It provides a simple lookup service: given
the leaf bucket that covers , upon which we can
a key, one can efficiently locate the peer node
apply
storing the key. By employing consistent hashing
the
naming
function
to
obtain
the
DISTRIBUTED HASH TREE key.
and
carefully
designed
overlays,
these
DISTRIBUTED HASH TREEs exhibit several advantages that fit in a P2P context: Scalability
3.3.4. MODULE 4: MAINTENANCE Unlike overlay-dependent indexes that
and efficiency: In a typical DISTRIBUTED HASH
would update their structures with network
TREE of N peers, the lookup latency is O(n log n)
structure changes caused by system dynamics (i.e.,
hops
peer joins/departures/failures), the
“neighbours.”
index only
with
each
peer
maintaining
Robustness:
only
DISTRIBUTED
needs to handle data updates while leaving network
HASH TREEs are resilient to network dynamics
structure
and node failures that are common in large-scale
changes
to
the
underlying
DISTRIBUTED HASH TREE.
P2P networks. Load balancing: Load balance in
Data Insertion and Leaf Split:
DISTRIBUTED HASH TREEs can be efficiently
Inserting a data record into
involves a
achieved thanks to uniform hashing.
lookup and a possible leaf split process. In , one
While DISTRIBUTED HASH TREEs are
leaf bucket will stay on the current peer, denoted as
popular in developing various P2P applications,
the local leaf, while the other one, denoted as the
such as large-scale data storage content distribution
remote leaf, will be pushed out to some other peer.
and scalable multicast/any cast services they are
The local leaf is not pushed out and consumes no
extremely poor in supporting critical queries such
bandwidth overhead. This nice property, which we
as range queries and k-nearest-neighbour (k-NN)
call incremental leaf split
queries. This is primarily because data locality,
Data Deletion and Leaf Merge:
which is crucial to processing such critical queries,
To remove the data key , , similar to data insertion, first performs
Lookup to locate the leaf
is destroyed by uniform hashing employed in DISTRIBUTED HASH TREEs.
bucket. It then executes a local deletion operation
Two issues are critical to the performance of
to remove the corresponding record. Data deletion
an over-DISTRIBUTED HASH TREE indexing
may further lead to a merge of leaf buckets if the
scheme scheme: query efficiency and index
number of records (called load for brevity)
maintenance cost. In conventional applications where queries are more frequent than data updates,
4. SYSTEM IMPLEMENTATION
achieving query efficiency is considered the first priority.
Distributed Hash Table is a widely used
However, in P2P systems, peer joins and
building block for scalable Peer-to-Peer (P2P)
departures usually result in data insertions and
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 16
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 deletions to/from the system; and the peer
This paper proposed , a Hash Tree, for
join/departure rate can be as high as the query rate
efficient
Such data updates incur constant index updates.
DISTRIBUTED HASH TREEs. differs from PHT,
Thus, the cost of index maintenance becomes a
a
non-negligible
system
TREE indexing scheme scheme, in the following
performance. This perspective, however, is not
aspects: Both PHT and are based on the idea of
realized in existing over-DISTRIBUTED HASH
space partitioning. While PHT maps its index
TREE indexing scheme schemes.
structure into DISTRIBUTED HASH TREE in a
factor
in
evaluating
On the contrary, existing schemes improve
data
indexing
representative
scheme
over
over-DISTRIBUTED
HASH
straightforward manner, leverages a clever naming
query efficiency by sacrificing index maintenance
function,
which
cost as a trade-off. More specifically, in a
maintenance
distributed context, each peer maintains a local
DISTRIBUTED
view of the global index; in order to achieve a
performance.
significantly
cost
and
lowers
improves
HASH
the the
TREE-lookup
better query performance, the common idea is to
In this section, we further present two
enlarge the local view and let each peer know more
extensions to the index, including how to index
about the global index.
unbounded data domains and how to improve peer
In DST the index structure remains static and is made known globally. However, this static
load balance. Extensible Indexing scheme
design inherently goes against the dynamic nature of P2P systems and easily leads to load imbalance. We propose
The basic index deals with a bounded data domain (i.e., in the normalized [0, 1] space), which
Hash Tree ()-An Efficient
requires a priori knowledge of the indexed data.
Query with Low Maintenance Indexing scheme
However, in many applications, such knowledge
Scheme in DISTRIBUTED HASH TREEs. Two
cannot be obtained in advance; and even more the
novel techniques contribute to the superior
data domain may change over time.
performance of : a clever naming mechanism and a
Improvement of Peer Load Balance
tree summarization strategy. We present two
In general, DISTRIBUTED HASH TREEs offer
enhancements to : an extensible technique for
load balance quite efficiently, yet not that
indexing scheme unbounded data domains and a
effectively. Specifically, if the imbalance ratio
double-naming strategy for improving system load
denotes the ratio of the heaviest load to the average
balance. Compared with state-of-the art indexing
load
scheme scheme, namely PHT, and DST,
saves
DISTRIBUTED HASH TREEs only bound the
50%-75% of maintenance cost and substantially
imbalance ratio at O(logN) with high probability
improves query performance in terms of both
[5], [1].
for
the
peers
in
the
P2P
network,
response time and bandwidth consumption.
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 17
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 5. CONCLUSION AND FUTURE
peer load balance. As a comparison, PHT (and
ENHANCEMENT
other existing over-DISTRIBUTED HASH TREE
5.1. CONCLUSION:
schemes) only supports data indexing scheme of
This paper proposed , a Hash Tree, for efficient
data
indexing
scheme
over
bounded domains and achieves better load balance by
modifying
DISTRIBUTED
HASH
DISTRIBUTED HASH TREEs. differs from PHT,
TREEs.Experimental
a
HASH
comparison with the state-of-the-art indexing
TREE indexing scheme scheme, in the following
scheme techniques PHT and DST, saves 50-75
aspects: Both PHT and are based on the idea of
percent of index maintenance cost and supports
space partitioning. While PHT maps its index
more efficient lookup operations.
representative
over-DISTRIBUTED
structure into DISTRIBUTED HASH TREE in a
Moreover,
results
show
that
in
has a much better query
straightforward manner, leverages a clever naming
performance
function,
the
consumption and response time. As an over-
the
DISTRIBUTED HASH TREE scheme,
which
maintenance
significantly
cost
DISTRIBUTED
and HASH
lowers
improves
TREE-lookup
performance.
in
terms
of
both
bandwidth
is
adaptable to generic DISTRIBUTED HASH TREEs and can be easily implemented and
employs local tree summarization to
deployed in any DISTRIBUTED HASH TREE-
provide each bucket a local view. This local view
based P2P system.
is
5.2. FUTURE ENHANCEMENTS:
essentially
helpful
for
distributed
query
processing, but unlike PHT’s sequential leaf link,
In this section, we further present two
requires no extra maintenance cost. In PHT, all leaf
extensions to the index, including how to index
nodes and internal nodes are mapped to the
unbounded data domains and how to improve peer
DISTRIBUTED HASH TREE space, whereas only
load balance.
leaf nodes are mapped in . The processing of range queries in PHT has to go through all internal nodes
5.2.1 Extensible Indexing scheme
of the subtree in addition to the leaf nodes in the
The basic index deals with a bounded data
queried range, which at least doubles the search
domain (i.e., in the normalized [0, 1] space), which
cost.Thanks to the novel naming function, one can
requires a priori knowledge of the indexed data.
easily determine the leftmost/rightmost leaf node
However, in many applications, such knowledge
under a subtree in O(1) lookup.
cannot be obtained in advance; and even more the
As
such,
min/max
queries
can
be
efficiently supported in . can be extended to index unbounded
data
domains
For example, if we want to index the
naturally
publication dates of MP3 files in a P2P file-sharing
accommodate a double naming strategy to improve
application, the data domain for publication dates
ISSN: 2249-2593
and
data domain may change over time.
http://www.ijcotjournal.org
Page 18
International Journal of Computer & Organization Trends – Volume 8 Number 1 – May 2014 Protocols for Computer Comm. (SIGCOMM), pp. 73-84,
is not fixed. In this section, we propose E-, an extensible that supports data indexing scheme of
2005. 5.
unbounded data domains.
A.I.T. Rowstron and P. Druschel, “Storage Management and Caching in Past, a Large-Scale, Persistent Peer-to-Peer Storage Utility,” Proc. Symp. Operating Systems Principles (SOSP), pp. 188- 201, 2001.
5.2.2 Improvement of Peer Load Balance
6.
In general, DISTRIBUTED HASH TREEs
J. Kubiatowicz, D. Bindel, Y. Chen, S.E. Czerwinski, P.R. Eaton,
D.
Geels,
R.
Gummadi,
S.C.
Rhea,
H.
Weatherspoon, W. Weimer, C. Wells, and B.Y. Zhao,
offer load balance quite efficiently, yet not that
“Oceanstore: An Architecture for Global-Scale Persistent
effectively. Specifically, if the imbalance ratio
Storage,” Proc. Architectural Support for Programming
denotes the ratio of the heaviest load to the average
Languages and Operating Systems (ASPLOS), pp. 190-
load
for
the
peers
in
the
P2P
network,
DISTRIBUTED HASH TREEs only bound the
201, 2000. 7.
F. Dabek, M.F. Kaashoek, D.R. Karger, R. Morris, and I. Stoica, “Wide Area Cooperative Storage with CFS,” Proc.
imbalance ratio at O(logN) with high probability [5], [1]. This result is considerably large for largescale P2P networks. In this section, we propose a double-naming strategy as an improvement for balancing peer load. The double-naming strategy naturally adapts the “power of two choices” scheme [61] (PoTC) to , which bounds the imbalance ratio at O(log log N) 6.REFERENCES 1.
A.I.T. Rowstron and P. Druschel, “Pastry: Scalable, Decentralized Object Location, and Routing for LargeScale Peer-to-Peer Systems,” Proc. Middleware, pp. 329350, 2001.
2.
B.Y. Zhao, J. Kubiatowicz, and A.D. Joseph, “Tapestry: A Fault- Tolerant Wide Area Application Infrastructure,” Computer Comm. Rev., vol. 32, no. 1, p. 81, 2002.
3.
D.R. Karger, E. Lehman, F.T. Leighton, R. Panigrahy, M.S. Levine, and D. Lewin, “Consistent Hashing and Random
Trees:
Distributed
Caching
Protocols
for
Relieving Hot Spots on the World Wide Web,” Proc. Symp. Theory of Computing (STOC), pp. 654-663, 1997. 4.
S.C. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy,
S.
Shenker,
“OpenDISTRIBUTED
I.
HASH
Stoica, TREE:
and A
H.
Yu, Public
DISTRIBUTED HASH TREE Service and Its Uses,” Proc. 2005 Conf. Applications, Technologies, Architectures, and
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 19