XML Streams for Probabilistic Databases by Melih Sözdinler

Computer Engineering BogaziÂ¸ ci University Bebek,Istanbul 34342 Turkey

Submitted to CMPE 521 Midterm II

by Melih SÂ¨ ozdinler

XML Streams for Probabilistic Databases Taflan GÂ¨ undem, Melih SÂ¨ozdinler January 19, 2010 Abstract Databases need to be an application specific and they need to work on in some harsh environments such as being stochastic and needs speedy query processing and structures. The database design may change at this environments. XML streams are also favorable due to their well defined structures and variety of implementations and standards, besides they are also harsh when the aggregated data is probabilistic. In this paper we propose a methodology for XML streamed and probabilistic databases. We show the storage structure and query processing methodologies and even we provide a sample applications in both parallel and single environment.

Introduction

In database systems many approaches are provided in the literature and the each individual topic of this paper title is also well studied. Although, the subtopics are well studied there is less effort to work on the area of this paper as a whole combination of XML Streamed Databases on probabilistic databases. At this section we introduce the well known structures and we try to give intuition behind to combine these approaches. The probabilistic databases are special type of databases where the data is uncertain. This means that if you get a tuple p1 at time t, when you investigate p1 , you can easily determine relationship between values of tuple since they come with both probabilities and values where the sum of probabilities is 1.0. Consider the case hundreds of tuples need to be handled, in that case even you are not able to handle to look data and to infer some relations because of the change in size of tuples and exponential number of representation of tuples. Due to this rapid growing environment database management systems are modified to handle these queries. There are several probabilistic databases proposed before. MayBMS [3] by Cornell University, MystiQ [11] by University of Washington and Orion [9] by Purdue University are proposed. These databases answer us how to deal with probabilistic databases. XML streams are also well studied in the literature. To handle XML streams, filtering methods are proposed. XFilter, YFilter [5], Auroa [1], CQL [4] are common filtering and query handling methods. There are also requirement specifications that are mentioned in [10]. These rules are briefly maintaining stream environment which can handle changes in stream in secure, fast and process and respond instantaneously. In stream databases, since the amount of data is huge, the database should handle these incoming streaming data. In order to do that, there are some trade offs. This means that when you need some operations, you probably sacrifice from other things. For instance, you want to obtain exact averages of temperatures at a specific time from the application sites. Since your database store only a percentage of temperatures from the applications sites due to streaming data formed at hundreds of sites per second, you will probably sacrifice the exact calculation, even you may not find any data at a given time period. When you change your application to store everything this time you need to consider another problem that is obtaining the results. Since data streams come at each second from hundred of sites, after a while your database needs to handle millions and then billions of temperature data. These are main tradeoffs in streaming environment. The aim is to store and benefit as much as you can while not trying to deal with the exact answer to the queries. Probabilistic databases could be applied to well known models like relational model. In essence, an item belongs to database is a probabilistic event and a tuple responds to query in also a probabilistic 2

event. Probabilistic XML Data is also argued in ProTDB [8] where they maintain XML databases with whole query and storage engines. In this research, we give a concept of database system in order to cope with both probabilistic and streaming environment. Specifically, we choose XML streams because of its simplicity and proposed well studied applications. We will also cover how to maintain a paradigm or an application in order to make the execution the queries in parallel.

Methods

Figure 1: Schema for XML Stream Data with store them all methodology(Unlimited Storage)

We divide to methods section into several parts. First, we consider that probabilistic XML stream data comes with a rate, and database Management System(DBMS) stores all these data assuming that we have unlimited storage. In that case storage structure is important since we try to give a respond to query owner in a short time. In order to do that we can consider application specifications and queried data. In Auroa [1], two level storage structure is proposed. The idea is simple, if DBMS considers the data as an historical data it stores inside the secondary structure. To have an efficiency B-Tree is appropriate data structure for historical data. However, the problem of which data is historic is an issue for this type of two level storage system. Since we know that queried data is fresh, we sure that it is stored in primary storage until the query is not used anymore. To maintain this track, we can approach as adaptive caching. During these caching we look into cache for the specific queries. If it is not in cache, we need to provide it from secondary structure of storage that is costly for the processing although B-trees are used. This mechanism avoids from the non-existence of data in cache. The cache size may not be restricted but we should prevent from large cache sizes that will consume unnecessary time due to search in cache for queries. The whole schema of the method is shown in Figure 1. Due to the probabilistic environment, each tuple items have an associated probability value. Since query type varies due to the probabilistic data, we may need to consider to handle specific type of queries. We will also need a parser and query evaluator for the probabilistic XML data. The efficient methods are proposed in ProTDB [8] and they addressed all the challenges of probabilistic XML data as they 3

are derived from original challenges. The XML structure is based on attributes on tree. When the query occurs involving multiple attributes, the query evaluator needs to come with the result of overall probability of returning a tuple from XML data. ProTDB can calculate this overall probability for our system by importing their methods. So the whole system proposed in Figure 1.

Figure 2: Schema for XML Stream Data with Top-k Query Advancing in parallel) The assumption of unlimited storage is not realistic in some cases. Continuously, you need to supply your system with extra storage drives. Also, it would fail, when the data stream rates change in time. Your DBMS may not respond rapid changes in the environment when it is store-all mode. Then the solution is to handle and obtain important and relevant information. Storage of this information considered by your database. The probabilistic data and ranking the data according to these probabilities, obtaining Top-k queries is well studied in the literature. In that case we can also make the execution in parallel. In [6], they proposed that user can obtain the sampling vectors of the tuples instead of obtaining all of them. Because of its nature, this method may not work in parallel, but it relaxes the database rather than storing all the data in DBMS. However, if we consider sensor network as an application, consider that there are distributed sites and each site samples its own distribution. Then responds it to the centered DBMS to combine the related distributions. This similar approach is also studied in [7]. They collect Top-k tuple information from the sites. The collection procedure is done in distributed at each site and the centered DBMS system queries these parallel sites in order to obtain final ranking of a whole system. This approach deals with the probabilistic side of our problem. Actually streaming occurs in some applications of sensor nodes. For instance a networks of sensors in factory needs to be open any time at many sites and needs to collect data whose content is streamed to sensor node. Assuming that XML streams are processed at each node with several attributes. Individual sensor node may run queries by evaluating all the probabilities associated with the corresponding attributes of the tuples. The ranking of the calculated value of the tuple may be stored in the stack of sensor node and it may also be queried by DBMS. As it is done in [7], centered DBMS have a privileges to collect these local ranking stacks and obtain global ranking of tuples. This process makes the operations in distributed by using the local process power of the sensor nodes to calculate local ranking of tuples and the individual process power of centered DBMS to obtain global Top-k rank. The whole schema is given in Figure 2. Due to the distributed work load we can somehow think as a some procedures in a computer are done by 4

low computation powered processors at another level of CPU-Cache-RAM-HDD hierarchy. We do not exactly say that it is parallel but the workload is distributed among sites. The possible application area that implies the concept of being probabilistic and streamed is sensor networks. Since the network of sensors may be sized as much as thousands of nodes, centralized structures named as sinks becomes a data collection point of its site. Each site may also have hundreds of nodes. Due to this extreme environment site may suffer from incoming data from several directions as a stream and mainly probabilistic. The Figure 2 and its concept can handle sensor networks. Each of the data collection points or sinks can collect stream data specifically constructs it as an XML file with attributes related to its capabilities of senses, and ranks the incoming queries according to overall probability calculation of ProTDB [8]. Then the centralized server who can have an access to these sites or sinks may collect the necessary ranking orderings and can calculate global ranking of tuples and queries. In this way, by having a query semantic or tuple semantic, the database server may respond specific queries or specific tuples by using its Top-k stack. Most of the databases have semantic query understanding, in this way we can evaluate the data and its importance. If it is ranked at a site, this means that it should stored for the next coming queries. If it is not, the application does not suffer from its non-existence, since it would not be needed by the application. Furthermore, the ranking of tuples is also important, in data centric sensor networks more conceptual queries are possible including one or many sites. Obtaining the global rank of tuples should be beneficial for us to evaluate the phenomenon occurring around the site(s). Top-k query or Top-k tuple approaches may be applied to our first approach; store-all. In the case of Top-k queries cache should contain frequently queried data during the query process. Since historical data is stored in secondary structure, rare queries may invoke secondary structure. When we have appropriate cache size, after a while, there should be no obtained data from secondary storage due to deterministic query assumption. In the case of Top-k query, depending on application k should be arranged to determine cache size. Furthermore, ranking top-k tuples would be beneficial for the store-all approach, when the queries are specifically interested the highest scored tuples. Since the cache stores these tuples, respond time decreases as well. In essence, ranking queries and ranking tuples are similar paradigms that will maintain better respond time to queries depending on query fashion and application.

Conclusion and Future Work

We provide three aspects of probabilistic XML stream databases by proposing hybrid approaches using the specific previous work [6–8]. Rather than all of these approaches, another hybrid approach using filtering. In YFilter [5], they combine all the pathway queries into a single non deterministic automata and this filtering would be helpful as our hybrid approaches. That could be another motivation to our problem. We believe that most of the proposed work exist in the literature, however none of them address to our problem exactly. We believe that our combined approaches will be preliminary research over the problem, and it will take an attention of researchers due to growing interests on probabilistic databases.

References [1] D. J. Abadi, D. Carney, U. &#199;etintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120–139, August 2003. [2] D. J. Abadi, D. Carney, U. C ¸ etintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120–139, 2003. [3] L. Antova, T. Jansen, C. Koch, and D. Olteanu. Fast and simple relational processing of uncertain data. In ICDE ’08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pages 983–992, Washington, DC, USA, 2008. IEEE Computer Society. [4] A. Arasu, S. Babu, and J. Widom. The cql continuous query language: semantic foundations and query execution. The VLDB Journal, 15(2):121–142, June 2006.

[5] Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, and P. Fischer. Path sharing and predicate evaluation for high-performance xml filtering. ACM Trans. Database Syst., 28(4):467–516, December 2003. [6] T. Ge, S. Zdonik, and S. Madden. Top-k queries on uncertain data: on score distribution and typical answers. In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 375–388, New York, NY, USA, 2009. ACM. [7] F. Li, K. Yi, and J. Jestes. Ranking distributed probabilistic data. In SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 361–374, New York, NY, USA, 2009. ACM. [8] A. Nierman and H. V. Jagadish. Protdb: probabilistic data in xml. In VLDB ’02: Proceedings of the 28th international conference on Very Large Data Bases, pages 646–657. VLDB Endowment, 2002. [9] S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, and R. Shah. Orion 2.0: native support for uncertain data. In SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1239–1242, New York, NY, USA, 2008. ACM. [10] M. Stonebraker, U. Cetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. SIGMOD Rec., 34(4):42–47, 2005. [11] D. Suciu and N. Dalvi. Foundations of probabilistic answers to queries. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 963–963, New York, NY, USA, 2005. ACM.