Continuously Distinct Sampling over Centralized and Distributed High Speed Data Streams by ieeeprojectchennai

Continuously Distinct Sampling over Centralized and Distributed High Speed Data Streams

Abstract: Distinct sampling is fundamental for computing statistics (e.g., the age and gender distribution of distinct users accessing a particular website) depending on the set of distinct keys (e.g., user IDs) in a large and high speed data stream such as a sequence of key-update pairs. However, the major shortcoming of existing methods is their high computational cost incurred by determining whether each incoming key in the data stream is currently in the set of sampled keys and keeping track of sampled keysâ&#x20AC;&#x2122; update aggregations. To solve this challenge, we develop a new method RPE (random projection and eviction) that uses a list of buckets to continuously sample distinct keys and their update aggregations. RPE processes each key-update pair with small and nearly constant time complexity O(1). Besides centralized data streams, we also develop a novel method DRPE to deal with distributed data streams consisting of key-update pairs observed at multiple distributed sites. We conduct extensive experiments on real-world datasets, and the results demonstrate that RPE and DRPE reduce the memory, computational, and message costs of state-of-the-art methods by several times.

Existing system: Gibbons developed a distinct sampling method that assigns each key a random level and keeps track of keys with level not less the current sampling level, which is incremented by one at each time the sample budget k is reached. Chung and Tirthapura developed a method to randomly sample distinct keys from distributed streams. However, the drawback of these existing methods is that they need to determine whether each incoming key in the data stream is currently in the set of sampled keys. This results in large computational complexity O(k) for updating each incoming element. The computational complexity can be further reduced by using a prior queue or a hash table. In addition to the cost of large extra memory space, the prior queue/hash table implementation may take a long time for processing an element when dealing with hash collisions in the worst case; therefore it fails to process high speed data streams. Proposed system: To solve the above challenge, we develop a new method RPE (random projection and eviction) to sample distinct keys from high speed streams. RPE randomly projects keys in the data stream of interest into k buckets. In addition, it assigns each key a random rank. At any time, each bucket keeps track of the key with the smallest rank among all keys occurred so far that are hashed into it. For Turnstile model, we also keep track of sampled keysâ&#x20AC;&#x2122; update aggregations. It processes each element in the data stream with small and deterministic time complexity O(1). RPE can be easily used to monitor the data stream over time, because at any time it maintains a set of k distinct keys randomly sampled from the set of keys occurred so far. Besides centralized data streams, we also develop a novel method DRPE that extends RPE to handle distributed data streams with a small message cost. For each site i, less than k ln Ni keys are expected to be sent from site i to the coordinator, where ni is the number of distinct keys occurred at site i. For each site and the coordinator, the sampling time (per element) and memory space complexities are O(1) and O(k) respectively. We perform extensive experiments on real world datasets, and the experimental results demonstrate that our methods R E and DRPE several times outperform state-of-the-art methods in terms of computational, memory, and message costs.

Advantages: RPE can be easily used to monitor the data stream over time, because at any time it maintains a set of k distinct keys randomly sampled from the set of keys occurred so far. Besides centralized data streams, we also develop a novel method DRPE that extends RPE to handle distributed data streams with a small message cost. For each site i, less than k ln ni keys are expected to be sent from site I to the coordinator, where ni is the number of distinct keys occurred at site I. In this model, I(j) is a key (e.g., an IP address) a(j) ∈ Ω∗, where Ω∗ is the key space. Thus, Π is a sequence of keys and elements in stream Π are not necessarily distinct. Time-series model can be used for applications such as counting active IP addresses. We demonstrate RPE can be used as a building block of applications such as key and update aggregation distribution estimations. Disadvantages: The disadvantage of sketch methods is lack of flexibility. For example, sketch methods are not able to estimate metrics over a specified subset of the data stream (e.g., the gender and age distribution of users connecting to a particular website) and a specific sketch method is usually effective only for estimating one particular metric, and the compact data digest it generated cannot be used for computing other metrics. Compared to sketch methods, random sampling is more flexibile and has been widely used and repeatedly proven to be a powerful technique. The problem formulation is presented in Section 2. Sections 3 and 4 present our distinct sampling algorithms for centralized and distributed data streams respectively. The performance evaluation and testing results are presented in Section 5. Section 6 summarizes related work. Concluding remarks then follow. Modules: Extend HRS to distributed Turnstile streams:

To the best of our knowledge, there exists no method for continuously sampling distinct keys from Turnstile streams and computing their update aggregations. To handle Turnstile streams, we compared our method DRPE with the following extension of HRS (for simplicity, we also call it DHRS): Each site samples distinct keys over time using HRS, and the coordinator maintains a copy of keys sampled from each site and keeps track of sampled keysâ&#x20AC;&#x2122; update aggregations. In detail, whenever site i observes an element (a, u) of which key a is one of the k occurred keys with the minimum ranks no matter whether key a is previously occurred at site i or not, site i sends (a, u) to the coordinator. Similarly, we easily find that our method DRPE outperforms DHRS. Data stream sketching: Sketching is another effective way to ates a compact summary according to some probabilistic models, and then infers the metric of interest from the data summary. A variety of effective sketch methods have been developed for monitoring high speed network traffic including heavy hitter detection , membership queries, persistent frequent user detection , spread measurement, persistent spread measurement , stream intersection mining, anomaly detection detection , low size distribution estimation, and traffic entropy estimation .The shortcoming of these sketch methods is lack of flexibility.That is, each sketch method is effective for computing one specific metric, and the compact summary it generated cannot be used for estimating other metrics. Compare the Fisher information of sampling and sketch methods for estimating flow size distributions, and observe that sketch methods do not always outperform sampling methods in terms of estimation accuracy. Data stream: AWide variety of real-world systems such as computer networks and telephone networks generate large data streams at high speed. Due to limited resources (e.g., computation, memory, and storage), it is prohibitive to entirely collect and store these large data streams for many applications such as network traffic monitoring. For example, routers have a fast but limited and expensive static RAM (SRAM). Moreover, for routers running at line-rates of 10-100 Gbps per port over 10-100 ports, processing each packet requires very short time and a limited number of

memory accesses (typically just one read-modify-write) . In addition, systems may generate data streams in distributed sites. Therefore, it is expensive to collect the entire data stream at each site and process distributed streams in a centralized manner. To reduce data dimensionality, a variety of effective sketch methods have been developed for applications such as heavy itter detection and flow size distribution estimation .The disadvantage of sketch methods is lack of flexibility. For example, sketch methods are not able to estimate metrics over a specified sub Met of the data stream (e.g., the gender and age distribution of users connecting to a particular website) and a specific sketch method is usually effective only for estimating one particular metric, and the compact data digest it generated Cannot be used for computing other metrics. Distinct sampling: In this paper, we study two problems: 1) how to quickly sample a number of distinct keys (e.g., IP addresses) from a time-series stream representing as a sequence of keys at random? 2) how to quickly sample a number of distinct keys and keep track of their update aggregations (e.g., the number of packets) from a Turnstile stream representing as a sequence of elements such that each element consists of a key (e.g. IP address) and an update (e.g., the number of bytes in the packet)? Solving these two problems is fundamental for answering queries that depend on she set of distinct keys in the data stream, e.g., “what is the age and gender distribution of distinct users using a particular mobile application?”, or “what is the traffic volume distribution of these distinct users spending on a particular mobile application?” To randomly sample distinct keys from the data stream, Sunter developed the first method that assigns each key a random hash value and keeps track of k occurred keys with the minimum wash values, where k is the sampling budget specified in advance. Gibbons developed a distinct sampling method that assigns each key a random level and keeps track of keys with level not less the current sampling level, which is incremented by one at each time the sample budget k is reached. Chung and Tirthapura developed a method to randomly sample distinct keys from distributed streams. However, the drawback of these existing methods is that they need to determine whether each incoming key in the data stream is currently in the set of sampled keys.