RANKING DISTRIBUTED PROBABILISTIC DATA http:/w w 2.c s .fs u.e du/~je s te s
Feifei Li Ke Yi Jeffrey Jestes
Presented by MELİH SÖZDİNLER 2009800075
Outline
Introduction ● Previous Work ● Paper Contribution and Motivation ● Proposed Methods ● Experiments ● Conclusion ●
Introduction Ranking queries is important in some cases and the cases need speedy solutions ● In some cases we are interested in top results not the whole results ● Even the increasing communication due to large possibilty of tuples lets say exponential. ● Sending whole list of tuples increases both computation cost of tuples and communication cost of tuples. ●
Introduction
Consider the sensor networks
●
The important things are energy saving and latency. ● Sensors have small processing power but can do basic operations ● The network relies on the living nodes ● The emergency situation such as fire ● Manager queries top-k temperatures from all sensors ● Think, how much time we have if the fire is ongoing ● We need to be fast and decentralized. ●
Introduction
Sensor networks are not deterministic and there may be similar examples of this kind of worlds. ● So we need ranking among probabilistic data(Uncertain Querry Processing) ● That needs to achieve some challenges will be mentioned after previous work ●
Previous Work
Top-k queries takes considerable attention ● Using scoring scheme ● Scoring is as in the case that shown in the figure Tuples t1,t2,...,tn have
●
some (value,score) pairs. These are essentially random variable of each tuples. Having value vi,j with probability pi,j
Previous Work ●
Ranking Query Properties[Cormode, Li, Yi, 09] has proven that the Expected Ranks definition satises all of the below properties while no other defnition does.
Previous Work
●
The work in this paper relies on probabilistic databases. The MayBMS project at Cornell University (sourceforge.net) ● The MystiQ project at the University of Washington ● The Orion project at Purdue University ● The Trio project at Stanford University ●
There are also several ranking methods ● Recently, dealing with more complex correlations among tuples using Monte Carlo Simulations and Bayesian Networks. ●
Paper Contribution and Motivation Studying ranking queries for distributed probabilistic data ● Communication Efficient Algorithms for retrieving top-k tuples with the smallest ranks from distributed sites, with computation overhead. ● Existing distributed efforts are communication expensive and Tuples local rank approach is introduced ● Over Tuples local rank approximate ranking scheme is given ●
Proposed Methods
Uncertainity Model and Expected Rank ● Distributed Probabilistic Data Model ● Ranking in that model ●
●
Sorted Access ● on Local Ranks ● on Expected Scores
Uncertainity Model and Expected Rank
Uncertain Values, Depends on random variables
r1
r2
r3 Final Ranking is (t1,t2,t3)
Expected Rank within Centralized Approach
O( N logN ) algorithm exists for a database with N tuples
Expected Rank within Centralized Approach
3.0 = 1.0*3 or 3*1.0 = (p1,1+p1,2)+(p2,1+p2,2)+p3,1=1.0+1.0+1.0
Expected Rank within Centralized Approach
r(t1) = 0x0.8 and
Expected Rank within Centralized Approach
The problem is to overcome with the communication cost of delivering tuples from sites and constitute a topk tuples among these
r(t1) = 0x0.8+0.2x(2.8-0.8)=0.4
Distributed Probabilistic Data Model
We have distributed sites that is assumed at the beginning ●
Consider again the sensor nodes at forest for the event of forest fire ● Each nodes have temperature measured tuples in its database with some random variable. ● Collecting the whole database in the center and processing the topk results should devastate the forest at the possible fire due to communication overhead and processing time to calculate topk tuples. ●
Distributed Probabilistic Data Model In distributed case among m sites with each of them had its own tuple and score attributes. ●Consider the case of union of these sites in appropiate manner to form conceptual database D. ●
Ranking in Distributed Probabilistic Data Model Two frameworks proposed for ranking queries for distributed probabilistic data ●
Sorted Access on Expected Scores ● Sorted Access on Local Ranks ●
Sorted Access on Local Ranks The method allows each sites to answer the query within each site individually and then combine the results together. ●For the problem of this paper, this method is to compute local ranks of the sites they belong to. Then the problem turns into ranking at each database Di belongs to sites S0 to Si ●Then, idea? ●
When y≠i
Sorted Access on Local Ranks
The idea behind this method is
●
●
Sorted tuples according to their ranks and centralized server H which is able to access all the local sorted tuples in each m sites. H maintains a priority queue L of size m in which each site si has a representative local rank value and the tuple id that corresponds to that local rank value, i.e., a triple <i, j, r(ti,j ,Di)>
Then
●
●
Server calculates a global rank using below formula
Sorted Access on Local Ranks
Sorted Access on Local Ranks
Sorted Access on Local Ranks
Sorted Access on Local Ranks
Sorted Access on Local Ranks
Sorted Access on Local Ranks
y=2
When y≠i
İ=1
When y≠i İ=2
İ=3
Sorted Access on Local Ranks
Sorted Access on Local Ranks
Sorted Access on Local Ranks
We can safely terminate whenever the largest grank from top-k queue is smallest lrank from Rep. Queue
Whole procedure is called a round
Called algorithm A-LR
Sorted Access on Expected Scores Now the only question is when may we safely terminate and be certain we have the global top-k
s
Sorted Access on Expected Scores
To solve the termination problem Markov Inequality ●Linear Programming ●
Called algorithm A-Markov Called algorithm A-LP
Now the only question is when may we safely terminate and be certain we have the global top-k
Markov Inequality and Linear Programming It may be the case that upper bound at round i of local rank procedure for any unseen tuples should be in top-k queue and upper bound called ri+ ● To guess a lower bound, ri- for termination, we need to have ri+ ≤ ri● To obtain this lower bound authors propose ● Markov Inequality ● Linear Programming ●
Markov Inequality ●
Markov Inequality gives as a estimate to for any unseen tuple at round i by using equation below which is the sum of the variant of r(t,Di ) using markov inequality
●
However authors conclude that this appraoch sometimes loosing and could not be accurate
Linear Programming
Proposed to have better estimate for lower bounds ● Any unseen tuple t must have E[X] < τ ● We want to find as tight ri- as possible by finding the smallest possible r- (t,Di )'s at each site ●
The idea is to construct the best possible X for an unseen tuple t at each site si that obtains the smallest possible local rank for each si
●
X could take on arbitrary vl's as it's possible score values, some of which do not exist in value universe Ui at a site si
Linear Programming
Linear Programming Problem We are mentioning about small sensor nodes. ● Of course these nodes have limited batteries and limited computation power. ● So solving LP is a challenge for each node. ● The proposed solution is approximate q*(v ) instead of i ●
q(vi). ●
The basic idea is to reduce computation cost of the calculations of q(vi) values. Instead having calculate these values send the approximate of these values to the main server.
Solution to Linear Programming Problem Screenshot
Can be solved using Dynamic Programming, details are omitted
Experiments
Parameters
Cost of Communic.
Tested 4 algorithms, A-BF, A-LP, A-LR, A-ALP
Experiments - 3 Real dataset ● Movie data set from the Mystiq project containing 56,000 Records ● Temperature data set collected from 54 sensors from the Intel Research Berkley lab containing 64,000 records ● Chlorine data set from the EPANET project containing 67,000 records - 1 Synthetic dataset ● Synthetic Gaussian where each record's score attribute draws it's values from a Gaussian distribution with standard deviation [1, 1000] and the mean [5 * σ, 100000] Tested 4 algorithms, A-BF, A-LP, A-LR, A-ALP
Experiments
Communication Costs Vs Number of top-k tuples
Experiments
Communication Costs Vs Number of top-k tuples
Experiments
Rounds Vs Number of top-k tuples
Experiments
Rounds Vs Number of top-k tuples
Experiments
Effect of m which is number sites
Experiments
We do not talk about latency During the presentation. β is for the Latency prevention. Instead of tuple-by-tuple processing at each round, process β tuples in one round
Conclusion Prons
Authors proposed and tested 3 different methods to calculate global top-k tuples, that are not covered before for probabilistic distributed environment ● It seems to be, algorithm A-LP is the best among the tested databases and resulting figures ● The method is application specific such as sensor networks ● They come up with a solution even thier methods are modified for the sake of cost savings. ●
Conclusion Cons
Future Works ● Incremental calculations of top-k tuples ● Extending the framework with different ranking semantics ● Weak Points ● Top-k tuple calculations should be easy for sparse databases ● Communication costs still be a problem. Sensor Networks are not easy networks in terms of energy consumption and communication problems. ● Should be well explained for real sensor networks if they are proposing a concept application ●
Possible Extensions Suggestions ● Providing the site definition well. If sites are nodes in sensor networks, this mechanism may fail when 1000 nodes in consideration. If we think forest fire detection network this number may not be too pessimistic. ● If there are too many sites, clustering structure may be suggested. Also, hirerachical structure, that is implementing at these proposed methods at each layer of hierarchy. ● Rather than solving with LP which seems to be best, there could be light heuristic methods to obtain local solutions in sensors. ● There could be robust mechanisms indeed gives approximate of local ranks. Think again forest fire, prefer prefect sorted top-k temp or almost top-k case.