sridhar by sridhar bala

Zongmin Ma and Li Yan (Eds.) Soft Computing in XML Data Management

Studies in Fuzziness and Soft Computing, Volume 255 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: kacprzyk@ibspan.waw.pl Further volumes of this series can be found on our homepage: springer.com Vol. 238. Atanu Sengupta, Tapan Kumar Pal Fuzzy Preference Ordering of Interval Numbers in Decision Problems, 2009 ISBN 978-3-540-89914-3 Vol. 239. Baoding Liu Theory and Practice of Uncertain Programming, 2009 ISBN 978-3-540-89483-4 Vol. 240. Asli Celikyilmaz, I. Burhan Türksen Modeling Uncertainty with Fuzzy Logic, 2009 ISBN 978-3-540-89923-5 Vol. 241. Jacek Kluska Analytical Methods in Fuzzy Modeling and Control, 2009 ISBN 978-3-540-89926-6 Vol. 242. Yaochu Jin, Lipo Wang Fuzzy Systems in Bioinformatics and Computational Biology, 2009 ISBN 978-3-540-89967-9 Vol. 243. Rudolf Seising (Ed.) Views on Fuzzy Sets and Systems from Different Perspectives, 2009 ISBN 978-3-540-93801-9 Vol. 244. Xiaodong Liu and Witold Pedrycz Axiomatic Fuzzy Set Theory and Its Applications, 2009 ISBN 978-3-642-00401-8 Vol. 245. Xuzhu Wang, Da Ruan, Etienne E. Kerre Mathematics of Fuzziness – Basic Issues, 2009 ISBN 978-3-540-78310-7 Vol. 246. Piedad Brox, Iluminada Castillo, Santiago Sánchez Solano Fuzzy Logic-Based Algorithms for Video De-Interlacing, 2010 ISBN 978-3-642-10694-1

Vol. 247. Michael Glykas Fuzzy Cognitive Maps, 2010 ISBN 978-3-642-03219-6 Vol. 248. Bing-Yuan Cao Optimal Models and Methods with Fuzzy Quantities, 2010 ISBN 978-3-642-10710-8 Vol. 249. Bernadette Bouchon-Meunier, Luis Magdalena, Manuel Ojeda-Aciego, José-Luis Verdegay, Ronald R. Yager (Eds.) Foundations of Reasoning under Uncertainty, 2010 ISBN 978-3-642-10726-9 Vol. 250. Xiaoxia Huang Portfolio Analysis, 2010 ISBN 978-3-642-11213-3 Vol. 251. George A. Anastassiou Fuzzy Mathematics: Approximation Theory, 2010 ISBN 978-3-642-11219-5 Vol. 252. Cengiz Kahraman, Mesut Yavuz (Eds.) Production Engineering and Management under Fuzziness, 2010 ISBN 978-3-642-12051-0 Vol. 253. Badredine Arfi Linguistic Fuzzy Logic Methods in Social Sciences, 2010 ISBN 978-3-642-13342-8 Vol. 254. Weldon A. Lodwick, Janusz Kacprzyk (Eds.) Fuzzy Optimization, 2010 ISBN 978-3-642-13934-5 Vol. 255. Zongmin Ma, Li Yan (Eds.) Soft Computing in XML Data Management, 2010 ISBN 978-3-642-14009-9

Zongmin Ma and Li Yan (Eds.)

Soft Computing in XML Data Management Intelligent Systems from Decision Making to Data Mining, Web Intelligence and Computer Vision

ABC

Editors Zongmin Ma College of Information Science and Engineering Northeastern University 3-11 Wenhua Road Shenyang, Liaoning 110819 China E-mail: zongmin_ma@yahoo.com

Li Yan School of Software Northeastern University 3-11 Wenhua Road Shenyang, Liaoning 110819 China

ISBN 978-3-642-14009-9

e-ISBN 978-3-642-14010-5

DOI 10.1007/978-3-642-14010-5 Studies in Fuzziness and Soft Computing

ISSN 1434-9922

Library of Congress Control Number: 2010929475 c 2010 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com

Preface

Being the de-facto standard for data representation and exchange over the Web, XML (Extensible Markup Language) allows the easy development of applications that exchange data over the Web. This creates a set of data management requirements involving XML. XML and related standards have been extensively applied in many business, service, and multimedia applications. As a result, a large volume of data is managed today directly in XML format. With the wide and in-depth utilization of XML in diverse application domains, some particularities of data management in concrete applications emerge, which challenge current XML technology. This is very similar with the situation that some database models and special database systems have been developed so that databases can satisfy the need of managing diverse data well. In data- and knowledge- intensive application systems, one of the challenges can be generalized as the need to handle imprecise and uncertain information in XML data management by applying fuzzy logic, probability, and more generally soft computing. Currently, two kinds of situations are roughly identified in soft computing for XML data management: applying soft computing for the intelligent processing of classical XML data; applying soft computing for the representation and processing of imprecise and uncertain XML data. For the former, soft computing can be used for flexible query of XML document as well as XML data mining, XML duplicate detection, and so on. Additionally, it is crucial for Webbased intelligent information systems to explicitly represent and process imprecise and uncertain XML data with soft computing. This is because XML has been extensively applied in many application domains which may have a big deal of imprecision and vagueness. Imprecise and uncertain data can be found, for example, in the integration of data sources and data generation with nontraditional means (e.g., automatic information extraction and data acquirement by sensor and RFID). Also XML has been an important component of the Semantic Web framework, and the Semantic Web provides Web data with well-defined meaning, enabling computers and people to better work in cooperation. Soft computing has been a crucial means of implementing machine intelligence. Therefore, soft computing cannot be ignored in order to bridge the gap between human-understandable soft logic and machine-readable hard logic. It can be believed that soft computing can play an important and positive role in XML data management. Currently the research and development of soft computing in XML data management are attracting an increased attention.

Preface

This book covers in a great depth the fast growing topic of techniques, tools and applications of soft computing in XML data management. It is shown how XML data management (like model, query, integration) can be covered with a soft computing focus. This book aims to provide a single account of current studies in soft computing approaches to XML data management. The objective of the book is to provide the state of the art information to researchers, practitioners, and graduate students of the Web intelligence, and at the same time serving the information technology professional faced with non-traditional applications that make the application of conventional approaches difficult or impossible. This book, which consists of twelve chapters, is organized into three major sections. The first section containing the first four chapters discusses the issues of uncertainty in XML. The next four chapters, covering the flexibility in XML data management supported by soft computing, comprise the second section. The third section focuses on the developments and applications of soft computing in XML data management in the final four chapters. Chapter 1 proposes a general XML Schema definition for representing and managing fuzzy information in XML documents. Different aspects of fuzzy information are represented by starting from proposals coming from the classical database context. Their datatype classifications are extended and integrated in order to propose a complete and general approach for representing fuzzy information in XML documents by using XML Schema. In particular, a fuzzy XML Schema Definition is described taking into account fuzzy datatypes and elements needed to fully represent fuzzy information. Chapter 2 aims to satisfy the need of modeling complex objects with imprecision and uncertainty in the fuzzy XML model and the fuzzy nested relational database model. After presenting the fuzzy DTD model and the fuzzy nested relational database model based on possibility distributions, the formal approach is developed in order to map a fuzzy DTD model to a fuzzy nested relational database schema. Chapter 3 describes a fuzzy XML schema to represent an implementation of a fuzzy relational database that allows for similarity relations and fuzzy sets. A flat translation algorithm is provided to translate from the fuzzy database implementation to a fuzzy XML document that conforms to the suggested fuzzy XML schema. The proposed algorithm is implemented within VIREX. A demonstrating example is presented to illustrate the power of VIREX in converting fuzzy relational data into fuzzy XML. Chapter 4 aims at automatically integrating data sources, using very simple knowledge rules to rule out most of the nonsense possibilities, combined with storing the remaining possibilities as uncertainty in the database and resolving these during querying by means of user feedback. For this purpose, the chapter introduces this â&#x20AC;&#x153;good is good-enoughâ&#x20AC;? integration approach and explains the uncertainty model that is used to capture the remaining integration possibilities. It is shown that using this strategy, the time necessary to integrate documents drastically decreases, while the accuracy of the integrated document increases over time.

Preface

VII

Chapter 5 focuses on the retrieval of XML data from heterogeneous multiple sources and proposes a new approach enabling the retrieval of meaningful answers from different sources, by exploiting vague querying and approximate join techniques. It essentially consists in first applying transformations to the original query obtaining relaxed versions of it, each matching the schema adopted at a single source, then using relaxed queries to retrieve partial answers from each source and finally combining them using information about retrieved objects. The approach is experimentally validated and has proved effective in a P2P setting. Chapter 6 presents a fuzzy-set-based extension to XQuery which allows user to express preferences on XML documents and retrieves documents discriminated by their satisfaction degree. This extension consists of the new xs:truth built-in data type intended to represent gradual truth degrees as well as the xml:truth attribute to handle satisfaction degrees in nodes of fuzzy XQuery expressions. XQuery language is extended to declare fuzzy terms and use them in query expressions. Additionally, several kinds of expressions as FLWOR are fuzzified. An evaluation mechanism is presented in order to avoid superfluous calculation of truth degrees. Chapter 7 describes the design and implementation of a fuzzy nested querying system for XML databases. The research involved is outlined and examined to decide on the most fitting solution that incorporates fuzziness into a user interface intended to be attractive to naive users. The findings are applied via the implementation of a prototype which covers the intended scope of a demonstration of fuzzy nested querying. This prototype is integrated into VIREX (a user-friendly system allowing users to view and use relational data as XML) and includes an easy to use graphical interface that will allow the user to apply fuzziness in order to easier search XML documents. Chapter 8 focuses on fuzzy duplicate detection in XML data, a crucial task in many applications such as data cleaning and data integration. By using two main dimensions, which are the methods effectiveness and efficiency, four algorithms that have been proposed for XML fuzzy duplicate detection are described and analyzed for comparison purpose. Also a comparative experimental evaluation performed on both artificial and real-world data is presented. The comparison shows the performances of these four algorithms. Chapter 9 proposes a machine-readable fuzzy-EPC representation in XML based on the EPC Markup Language (EPML) to conceptually represent fuzzy business process models. It reports on the design of the Fuzzy-EPC compliant schema and shows major syntactical extensions. A realistic example (sales order checks) is sketched, showing that Fuzzy-EPML is able to serve as an adequate interchange format for fuzzy business process models. Chapter 10 aims to design and develop an XML based framework to represent and merge the statistical information of clinical trials in XML documents. This framework considers any valid clinical trial including trials with partial information, and merges statistical information automatically with the potential to add a component to extract clinical trials information automatically. A method is developed to analyze inconsistencies among a collection of clinical trials and if necessary to exclude any trials that are deemed to be illegible. Moreover, two sets

VIII

Preface

of clinical trials, trials on Type 2 diabetes and on neurocognitive outcomes after off-pump versus on-pump coronary revascularisation, are used to illustrate the framework. Chapter 11 presents the main characteristics of a new Fuzzy Database Aliança (Alliance). The system is the union of fuzzy logic techniques, a database relational management system and a fuzzy meta-knowledge base defined in XML. Aliança accepts a wide range of data types, including all information already treated by traditional databases, as well as incorporating different forms of representing fuzzy data. The system uses XML to represent meta-knowledge. The use of XML makes it easy to maintain and understand the structure of imprecise information. Also Aliança is designed to allow easy upgrading of traditional database systems. The Fuzzy Database Architecture Aliança approximates the interaction with databases to the usual way in which human can reason. Chapter 12 presents SUNRISE (System for Unified Network Routing, Indexing and Semantic Exploration) for XML data sharing. Aiming at semantic interoperability in heterogeneous networks, SUNRISE is a PDMS (Peer Data Management System) infrastructure, which leverages the semantic approximations originating from schemas’ heterogeneity for an effective and efficient organization and exploration of the network. SUNRISE implements soft computing techniques which cluster peers in Semantic Overlay Networks according to their own contents, and promote the routing of queries towards the semantically best directions in the network.

Acknowledgements We wish to thank all of the authors for their insights and excellent contributions to this book and would like to acknowledge the help of all involved in the collation and review process of the book. Thanks go to all those who provided constructive and comprehensive reviews. Thanks go to Janusz Kacprzyk, the series editor of Studies in Fuzziness and Soft Computing, and Thomas Ditzinger, the senior editor of Applied Sciences and Engineering of Springer-Verlag, for their support in the preparation of this volume. The idea of editing this volume stems from our initial research work which is supported by the National Natural Science Foundation of China (60873010), the Fundamental Research Funds for the Central Universities (N090504005 & N090604012) and Program for New Century Excellent Talents in University (NCET-05-0288).

Northeastern University, China April 2010

Zongmin Ma Li Yan

Contents

Part I: Uncertainty in XML An XML Schema for Managing Fuzzy Documents . . . . . . . . . . . Barbara Oliboni, Gabriele Pozzani Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li Yan, Jian Liu, Z.M. Ma Human Centric Data Representation: From Fuzzy Relational Databases into Fuzzy XML . . . . . . . . . . . . . . . . . . . . . . . ¨ Keivan Kianmehr, Tansel Ozyer, Anthony Lo, Jamal Jida, Alnaar Jiwani, Yasin Alimohamed, Krista Spence, Reda Alhajj Data Integration Using Uncertain XML . . . . . . . . . . . . . . . . . . . . . Ander de Keijzer

Part II: Flexibility in XML Data Management Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Bettina Fazzinga Fuzzy XQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Marlene Goncalves, Leonid Tineo Attractive Interface for XML: Convincing Naive Users to Go Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Keivan Kianmehr, Jamal Jida, Allan Chan, Nancy Situ, Kim Wong, Reda Alhajj, Jon Rokne, Ken Barker An Overview of XML Duplicate Detection Algorithms . . . . . . . 193 P´ avel Calado, Melanie Herschel, Luıs Leit¨ ao

Contents

Part III: Developments and Applications Fuzzy-EPC Markup Language: XML Based Interchange Formats for Fuzzy Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Oliver Thomas, Thorsten Dollmann An XML Based Framework for Merging Incomplete and Inconsistent Statistical Information from Clinical Trials . . . . . . 259 Jianbing Ma, Weiru Liu, Anthony Hunter, Weiya Zhang Alian存ca: A Proposal for a Fuzzy Database Architecture Incorporating XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Raquel D. Rodrigues, Adriano J. de O. Cruz, Rafael T. Cavalcanti Leveraging Semantic Approximations in Heterogeneous XML Data Sharing Networks: The SUNRISE Approach . . . . . 315 Federica Mandreoli, Riccardo Martoglia, Wilma Penzo, Simona Sassatelli, Giorgio Villani Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

Part I: Uncertainty in XML

An XML Schema for Managing Fuzzy Documents Barbara Oliboni and Gabriele Pozzani

Abstract. Topics related to fuzzy data have been investigated in the classical database research field, and in the last years they are becoming interesting also in the XML data context. In this work, we consider issues related to the representation and management of fuzzy data by using XML documents. We propose to represent different aspects of fuzzy information by starting from proposals coming from the classical database context. We extend and integrate their datatype classifications in order to propose a complete and general approach for representing fuzzy information in XML documents by using XML Schema. In particular, we describe a fuzzy XML Schema Definition taking into account fuzzy datatypes and elements needed to fully represent fuzzy information.

1 Introduction Issues related to the representation, processing, and management of information in a flexible way appear in several research areas (e.g., artificial intelligence, databases and information systems, data mining, and knowledge representation). Requirements related to fuzziness come from the observation that human reasoning is not exact and precise as happen usually in personal computers. Humans do not follow precise and always equal rules. Moreover, in some applications data come with errors or are inherently imprecise since their values are subjective (e.g., values for representing customer satisfaction degrees). Thus, it has been natural for researchers try to incorporate flexible features in software. Hence, several proposals deal with Barbara Oliboni Department of Computer Science, University of Verona, Italy e-mail: barbara.oliboni@univr.it Gabriele Pozzani Department of Computer Science, University of Verona, Italy e-mail: gabriele.pozzani@univr.it

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 3â&#x20AC;&#x201C;34. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

B. Oliboni and G. Pozzani

problems related to the representation and processing of imprecise data. Many of them starts from theories formulated by Zadeh [36]. Zadeh formalized notions related to fuzziness and uncertain data representation by presenting a theory about fuzzy sets, possibility theory, and similarity relations. These notions are the basic ones used in many proposals related to the representation of imprecise data in classical databases for making them more flexible [13, 21, 25, 24, 27, 28, 40, 41]. As an example, fuzzy databases allow one to represent the uncertainity of physical measures or subjective human preferencies. On the other hand, fuzzy processing of data allows one to reply to queries not only returning exact matching data but also data similar to the requested ones. In this way, the system is able to get around errors in queries formulation coming from user misunderstanding or from incomplete information representation. Among all proposals about fuzzy databases, we consider the GEFRED one [21], which is based on generalized fuzzy domains and relations and allows one to represent possibility distributions, similarity relations, linguistic labels and all other fuzzy concepts and datatypes. The GEFRED model was extended by Galindo et al. [13] to define a complete database system capable to manage fuzzy information. For extending the GEFRED model they define a fuzzy ER conceptual model, a fuzzy relational database and an extended SQL language (FSQL) able to manage fuzzy data. In this work, we consider the model proposed by Galindo et al., and in particular their fuzzy data types classification, as a starting point for classifying data types needed to represent fuzzy information in XML documents. Since XML is imposing itself as a standard for representing and exchanging information on the net, topics related to the modeling of fuzzy data can be considered very interesting also in the XML data context. Few proposals in the literature deal with the representation of fuzzy information in XML documents [14, 19, 20, 26] by considering different aspects. In our proposal, we adopt the data types classification defined in [13] for the relational database context, and adapt it to the XML data context. In order to manage data types, differently from other related approaches, we choose to use XML Schema [32] instead of DTD [23]. DTD is included in the XML 1.0 standard [23], and thus it is widely used and supported in applications. However DTD has some limitations: it does not support new XML features (e.g. namespaces), it has some lack of expressivity and it uses a non-XML syntax to describe the grammar. All these limitations are overcame by XML Schema [32]. XML Schema can be used to express a set of rules to which an XML document must conform in order to be considered â&#x20AC;&#x153;validâ&#x20AC;? (with respect to that schema), and provides an object oriented approach to the definition of XML elements and datatypes. Moreover, it is compatible with other XML technologies like Web services, XQuery (for XML document querying) and XSLT (for XML document presentation). Thus, we propose a general approach for representing fuzzy information in XML documents by using XML Schema. We describe a fuzzy XML Schema definition taking into account fuzzy data types and elements needed to fully represent fuzzy information.

An XML Schema for Managing Fuzzy Documents

Our proposal for an XML Schema able to represent fuzzy data can be used by any organization or system managing uncertain data. These users may have the necessity to exchange fuzzy information through different subsystems, locally or over the net, and the use of fuzzy XML documents may represent a good solution. Moreover, fuzzy XML documents can be used by these systems as a storage method for collected fuzzy data. Since, actually, there are no DBMSs implementing fuzzy capabilities and the development of a fuzzy extention for an existing DBMS may require too effort, fuzzy XML documents can represent a simple way to store and manage fuzzy information, as already happen for classical data. Our proposal can help in organizing these data providing a common and complete reference Schema for representing fuzzy data. This work is structured as follows: in Section 2 we present some background notions useful to better understand the context of this proposal. In Section 3 we present our proposal of an XML Schema definition introducing new fuzzy datatypes and elements needed to represent fuzzy information in an XML document. In Section 4 we give an example of an XML document satisfying the proposed Schema, by considering information managed by a weather station. In Section 5 we further extend the proposed Schema allowing the representation of some information useful during the fuzzy processing of an XML document. Some examples about these fuzzy processing information are illustrated in Section 6. In Section 7 we discuss how a classical XML document can be changed in order to comply with our fuzzy XML Schema proposal and be able to represent fuzzy data. In Section 8 we give a brief description of other approaches presented in the literature about representation and querying of fuzzy XML documents. Finally, in Section 9 we sketch some conclusions and future research directions.

2 Background In this section we briefly report some background notions on fuzziness, on relational databases dealing with fuzzy data, and on XML. Several proposals deal with the representation of uncertain data in databases. The relational approach [6, 7, 8] has introduced the NULL value in order to represent unknown attribute values (i.e., none value is applicable or all values in the domain are possible). NULL value introduces a tri-valued logic. Later on, for example in Umano-Fukami model [27, 28], NULL value was further differentiated introducing the fuzzy values UNKNOWN, UNDEFINED and NULL. UNKNOWN means that any value in the domain is possible, UNDEFINED means that none of the values in the domain is possible and NULL (it is different by the null pointer) means that we do not know anything, in other words it may be both undefined or unknown. However, more systematic approaches to fuzzy databases started from the notion of fuzzy set and other related notions. The definition of fuzzy set, introduced by Zadeh in [36], is based on the classical notion of set and extends it to introduce flexibility. In the classical definition, a set S

B. Oliboni and G. Pozzani

on a domain D is defined as a boolean function μ : D → {0, 1} that says us whether an object in D belongs (1) or not (0) to S; μ is called the membership function of S. The membership function associated to a fuzzy set F is a function μF : D → [0, 1] valued in the real unit interval [0, 1]. Thus, in a fuzzy set, each object in D belongs to the set with a certain degree; this means that each object is related to a membership degree. In 1971 Zadeh introduced the notion of similarity relation [37]: given a set of objects, a similarity relation defines the similarity degree between any pair of objects, i.e., how much two objects are similar one to each other. By using similarity relations, users can retrieve not only a requested object but also the similar ones, introducing fuzziness in queries. The use of similarity relations inside relational model was introduced in the Buckles-Petry Model [4] to get fuzzy capability to relational databases. Moreover, in [38], Zadeh has extended the fuzzy set theory introducing the possibility theory, an alternative to probabilistic theory. This notion was further extended by Dubois and Prade in [11] and subsequent work. A possibility distribution is based on the relationship between linguistic variable and fuzzy set notions. A possibility distribution is determinated by the question “Is x A?” where A is a fuzzy set on domain X and x is a variable on X. The use of possibility theory in relational model was introduced in three main different models: Prade-Testemale model [25, 24], Umano-Fukami model [27, 28] and Zemankova-Kandel model [40, 41]. All above fuzzy approaches and models have been joined in the GEFRED model of Medina, Pons and Vila [21]. The GEFRED model is based on generalized fuzzy domains and relations which extend classical domains and relations and allows one to represent possibility distributions, similarity relations, linguistic labels and other fuzzy concepts and datatypes. The GEFRED model was extended by Galindo et al. [13] by defining a complete database system able to manage fuzzy information. Extending the GEFRED model they define a fuzzy ER conceptual model, a fuzzy relational database and an extended SQL language (FSQL) capable to manage fuzzy data. In particular they define new fuzzy datatypes that allow one to store fuzzy values in database tables and fuzzy degrees which allow one to incorporate other uncertainty information with several meanings. Moreover they store some meta-data about fuzzy objects in auxiliary tables called Fuzzy Metaknowledge Base (FMB). In this work, we will start from the GEFRED model for defining a suitable approach to represent fuzzy information in XML documents. XML (eXtensible Markup Language) [23] is a markup language introduced as a simplified subset of SGML (Standard Generalized Markup Language) [16] by the World Wide Web Consortium (W3C) [29]. XML is the standard de facto for describing and exchanging data between different systems and applications using Internet. XML is extensible because it supports user-defined elements and datatypes. The grammar for tags in an XML document is defined in a DTD (Document Type Definition) [23] to which the XML document must refer. The elements in an XML document, related to a given DTD, must respect the DTD itself.

An XML Schema for Managing Fuzzy Documents

DTD is included in the XML 1.0 standard, and thus it is widely used and supported in applications. However DTD has some limitations: it does not support new XML features (e.g., namespaces), it has some lack of expressivity and it uses a non-XML syntax to describe the grammar. All these limitations are overcame by the XML Schema [32] (also called XML Schema Definition, XSD). XML Schema can be used to express a set of rules to which an XML document must conform in order to be considered â&#x20AC;&#x153;validâ&#x20AC;? (with respect to that schema). XML Schema provides an object oriented approach to the definition of XML elements and datatypes. Moreover it is compatible with other XML technologies like Web services, XQuery (for XML documents querying) [31] and XSLT (for XML documents presentation) [33]. Our proposal deals with the representation of fuzzy data in XML documents, is based on the extended version of the GEFRED model proposed by Galindo et al. [13], and uses XML Schema.

3 XML Schemata for Fuzzy Information In this section we propose a fuzzy XML Schema Definition containing the new fuzzy datatypes and elements needed to represent fuzzy information, accordingly to the extended GEFRED relational data model [13]. In particular, we define appropriate XML schemata for fuzzy datatypes and degrees and for the related auxiliary information stored in the Fuzzy Metaknowledge Base (FMB). The definition of an XML Schema may be divided into several related schemata. Each Schema may refer to other schemata by introducing a different namespace for each of them. Namespaces allow one to refer and use objects defined in different schemata specifying their locations. Moreover, namespaces allow one to distinguish between different elements with the same name but with different definitions, locations, and semantics. To each namespace corresponds a different XML Schema, in such a way the system can retrieve the correct definition for each element. Fig. 1 depicts relationships among the XML schemata constituting the proposed overall schema. Each line represents a reference of a Schema inside another one. Note that the Schema base.xsd is defined just one time but it is referred by all other second level schemata. In XML documents, data are represented in a structured way and their structure is defined by related XML schemata. For example, if we consider an XML document obtained by a database, its XML Schema may define that tuples are represented in elements called record and they are arranged in an element named as the table name. In this work we focus only on the structure of fuzzy information supposing the user already has a general XML Schema defining the structure of other crisp parts of the document. In the following sections we analyse all parts of the XML Schema we propose for managing fuzzy information.

B. Oliboni and G. Pozzani

FleXchema.xsd FuzzyOrdType.xsd base.xsd FuzzyNonOrdSimType.xsd base.xsd FuzzyNonOrdType.xsd base.xsd degrees.xsd base.xsd FMB.xsd base.xsd processing.xsd base.xsd Fig. 1 Reference relations between proposed XML schemata

3.1 The Root Schema FleXchema.xsd is the main file of the proposed schema. It defines the general structure of fuzzy datatypes, FMB (see Section 3.7), and processing information (see Section 5) recalling definitions given in several different files, that we will analyse in following sections. First of all, we introduce the definitions of the four fuzzy datatypes that our XML Schema proposal allows one to represent: 1. classical crisp (non fuzzy) data marked to be processed with fuzzy operations, represented by datatype classicType; <xs:complexType name="ClassicType"> <xs:sequence> <xs:any namespace="http://www.w3.org/2001/XMLSchema" minOccurs="1" maxOccurs="1" /> </xs:sequence> <xs:attribute name="info" type="xs:IDREF" use="required" /> <xs:attribute name="type" type="xs:string" fixed="T1" use="required" /> </xs:complexType>

An XML Schema for Managing Fuzzy Documents

2. imprecise data over an ordered underlying domain, represented by datatype FuzzyOrdType (see Section 3.3); <xs:complexType name="FuzzyOrdType"> <xs:sequence> <xs:any namespace="http://stars.sci.univr.it/FuzzyOrdType" minOccurs="1" maxOccurs="1" /> </xs:sequence> <xs:attribute name="info" type="xs:IDREF" use="required" /> <xs:attribute name="type" type="xs:string" fixed="T2" use="required"/> </xs:complexType>

3. imprecise data over a discrete nonordered domain and related by a similarity relation, represented by datatype FuzzyNonOrdSimType (see Section 3.4); <xs:complexType name="FuzzyNonOrdSimType"> <xs:sequence> <xs:any minOccurs="1" maxOccurs="1" namespace="http://stars.sci.univr.it/FuzzyNonOrdSimType"/> </xs:sequence> <xs:attribute name="info" type="xs:IDREF" use="required" /> <xs:attribute name="type" type="xs:string" fixed="T3" use="required"/> </xs:complexType>

4. imprecise data over a discrete nonordered domain and not related by a similarity relation, represented by datatype FuzzyNonOrdType (see Section 3.5). <xs:complexType name="FuzzyNonOrdType"> <xs:sequence> <xs:any namespace="http://stars.sci.univr.it/FuzzyNonOrdType" minOccurs="1" maxOccurs="1" /> </xs:sequence> <xs:attribute name="info" type="xs:IDREF" use="required" /> <xs:attribute name="type" type="xs:string" fixed="T4" use="required"/> </xs:complexType>

Each datatype is defined as an XML complexType with two required attributes. The first attribute (info) is an IDREF refering to the element in the FMB part of the document containing the meta-information (see Section 3.7) about the interested fuzzy object. The second attribute (type) is a fixed string encoding the datatype of the considered element. The possible codings we define for the datatypes are: • • • •

T1 for classicType datatype; T2 for FuzzyOrdType datatype; T3 for FuzzyNonOrdSimType datatype; T4 for FuzzyNonOrdType datatype.

B. Oliboni and G. Pozzani

This fixed attribute allows us to distinguish between the different fuzzy classes of datatypes. Some fuzzy datatypes (e.g., possdistr, null, unknown) are defined in several classes and we may need a way to distinguish them in order to process them in different ways. Finally, each datatype contains a subelement representing the actual fuzzy data. These subelements are defined by using the any XML element and each one allows one to insert an element selected from a referred different namespace. Each namespace is defined in another external XML Schema. In particular, the any subelement in classicType refers to the basic XML Schema provided by the W3C [29]. In this way, it is possible to specify any value of the classical crisp datatypes (e.g. strings, integers, timestamps). Subelements in the other three datatypes refer to namespaces defined in different XML schemata proposed by us and explained in the following sections. To better understand how these definitions may be used, let us consider the following example. It represents a classical crisp data containing the name of a customer, where type=T1 means that the name is a crisp data, and info="ABC" means that the related meta-information are contained in the FMB element with ID ABC. <name type="T1" info="ABC"> John </name>

Up to now, we have defined datatypes able to represent the structure of the fuzzy information. Finally, the main Schema introduces elements defining the structure of new particular parts of a fuzzy XML document. These elements delineate the structure of the FMB and processing information. FMB is a sequence of (in some cases, optional) elements, each one describing a different meta-information (see Section 3.7). Meta-information include label definitions, default margin for approximate values, and similarity relations. <xs:element name="FMB"> <xs:complexType> <xs:sequence minOccurs="0" maxOccurs="1"> <xs:element ref="xsfmb:fcl" minOccurs="1" maxOccurs="1"/> <xs:element ref="xsfmb:labelDefs" minOccurs="0" maxOccurs="1"/> <xs:element ref="xsfmb:fam" minOccurs="0" maxOccurs="1"/> <xs:element ref="xsfmb:simRelDefs" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType> </xs:element>

Finally, the root Schema file, FleXchema.xsd, defines the processInfo element. It is a sequence of (optional) qualifier and quantifier definitions. We will describe their definition and usage in Section 5. In particular we will see that they are useful during the fuzzy information processing.

An XML Schema for Managing Fuzzy Documents

<xs:element name="processInfo" minOccurs="0" maxOccurs="1"> <xs:complexType> <xs:sequence> <xs:element ref="xsproc:qualifiers" minOccurs="0" maxOccurs="1"/> <xs:element ref="xsproc:quantifiers" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType> </xs:element>

3.2 Basic Datatypes In the base namespace four basic datatypes, needed in all other namespaces, are defined. The simpleType probType represents the type of a probabilistic data, hence it is defined as a decimal value in the range [0, 1]. <xs:simpleType name="probType"> <xs:restriction base="xs:decimal"> <xs:minInclusive value="0"/> <xs:maxInclusive value="1"/> </xs:restriction> </xs:simpleType>

The datatype labelRefType represents a reference to the ID of a label definition contained in the FMB. It is essentially a renaming of the IDREF datatype (defined by W3C) given in order to clarify the meaning of some attributes used in other XML schemata. <xs:complexType name="labelRefType"> <xs:attribute name="label_id" type="xs:IDREF"/> </xs:complexType>

The datatype ftype is the set of integer values in the range [1, 7]. It is used in the FMB definition in order to keep information about the fuzzy type of a fuzzy object (see Section 3.7). <xs:simpleType name="ftype"> <xs:restriction base="xs:positiveInteger"> <xs:minInclusive value="1"/> <xs:maxInclusive value="7"/> </xs:restriction> </xs:simpleType>

Finally, datatype any defines a shorthand for the any element defined by the W3C and refering to any element and type already defined in the W3C namespace.

B. Oliboni and G. Pozzani

<xs:complexType name="any"> <xs:sequence> <xs:any namespace="http://www.w3.org/2001/XMLSchema" minOccurs="1" maxOccurs="1" /> </xs:sequence> </xs:complexType>

3.3 Fuzzy Data over an Ordered Domain The FuzzyOrdType.xsd file contains the definition of the fuzzy datatypes representing imprecise data over an ordered underlying domain. As happen in most systems allowing null values, the null value can be compared with any other type of data. The same happens in our proposal where values unknown, undefined, and null are defined both on ordered underlying domains and on non-ordered underlying domains. Hence, their definitions are present in this namespace, defining fuzzy ordered datatypes, and in the following ones, defining fuzzy non-ordered datatypes. The duplication of these definitions is needed because in some cases we have to process these special values in a different way on the base of their datatype class (i.e., on the underlying domain). <xs:element name="unknown" /> <xs:element name="undefined" /> <xs:element name="null" />

For the same reason, in FuzzyOrdType we allow one to introduce also any crisp data (on an ordered domain). <xs:element name="crisp" type="xsb:any" />

The namespace with prefix xsb refers to the XML Schema base.xsb reported in the previous section. We define that fuzzy data over an ordered domain can include: • Linguistic labels. The use of a label lies in an IDREF to its definition. This definition, given in a name and eventually a trapezoidal form, is reported in the FMB part of the XML document (see Section 3.7). The choice to use IDREFs, storing label definitions in the FMB, reduces the data redundancy in XML documents but, on the other hand, requires a more complex data processing for querying XML data. <xs:element name="label" type="xsb:labelRefType"/>

• Trapezoidal values. Trapezoidal values allow us to represent continuous possibility distributions defined by four decimal values [α , β , γ , δ ] (see Fig. 2). Values between β and γ have possibility degrees equal to one, values less than or equal to α and greater than or equal to δ have possibility degrees equal to zero, and values in ranges [α , β ] and [γ , δ ] have possibility degrees defined respectively by

An XML Schema for Managing Fuzzy Documents

(a) Trapezoidal distribution

(b) Interval

margin

d (c) Triangular distribution

Fig. 2 Continuous possibility distributions on an ordered domain

the lines connecting the two values. We will see that also labels have a trapezoidal definition; however, trapezoidal values allow us to define a trapezoidal distribution without having a label for it. Note that, trapezoidal distributions is a general case of interval values and triangular distributions. <xs:element name="trapezoidal"> <xs:complexType> <xs:sequence> <xs:element name="alpha" type="xs:decimal"/> <xs:element name="beta" type="xs:decimal"/> <xs:element name="gamma" type="xs:decimal"/> <xs:element name="delta" type="xs:decimal"/> </xs:sequence> </xs:complexType> </xs:element>

• Intervals. Intervals are special cases of trapezoidal values where α = β and γ = δ . They are then defined by two decimal values, in the Schema named lb and ub, such that all values in the range [lb, ub] (see Fig. 2(b)) have possibility degree equal to one, while all other values have possibility degree equal to zero. <xs:element name="interval"> <xs:complexType> <xs:sequence> <xs:element name="lb" type="xs:decimal" /> <xs:element name="ub" type="xs:decimal" /> </xs:sequence> </xs:complexType> </xs:element>

• Approximate values. Approximate values represent triangular possibility distributions. They are defined by a central value d and a margin value around the central value (see Fig. 2(c)). Hence, a triangular distribution is a special case of a trapezoidal one where β = γ and where α and δ are equidistant from the central value. Only value d has possibility degree equal to one. All values outside the range [d − margin, d + margin] have possibility degree equal to zero. In an approximate value the margin can be omitted, in this case we use the default margin stored in the FMB tables (see Section 3.7).

B. Oliboni and G. Pozzani

<xs:element name="approxvalue"> <xs:complexType> <xs:sequence> <xs:element name="d" type="xs:decimal" /> <xs:element name="margin" type="xs:decimal" minOccurs="0" /> </xs:sequence> </xs:complexType> </xs:element>

â&#x20AC;˘ Possibility distributions. The XML element possdistr allows one to define a discrete possibility distribution represented as a set (with finite unbounded maximum cardinality) of pairs (p, d) meaning that value d has possibility degree equal to p. We do not wrap any pair inside an ad-hoc element because we can recognize correctly pairs by reading elements two-by-two. The d value may by of any datatype on an ordered domain, however the system must check that all values inside the same possibility distribution have the same type. Possibility degrees p has got type probType defined in the base namespace, which possible values, we remark, are in the range [0, 1]. <xs:element name="possdistr"> <xs:complexType> <xs:sequence maxOccurs="unbounded" minOccurs="1"> <xs:element name="p" type="xsb:probType"/> <xs:element name="d" type="xsb:any"/> </xs:sequence> </xs:complexType> </xs:element>

3.4 Fuzzy Data over a Nonordered Domain with Similarity Relations The datatype FuzzyNonOrdSimType defines the possible values of fuzzy objects over a nonordered domain. As we said in the previous section, possible values in this datatype include unknown, undefined, and null values, defined exactly as for the ones on an ordered domain. This datatype allows one to define possibility distributions composed by pairs (p, d) where d is a label which possibility degree is p. The d XML element is defined as a reference to a label which definition is contained in the FMB. Note that, since the underlying domain is nonordered, these labels have not a trapezoidal definition (this constraint must be checked by the system). Moreover, values (represented by labels) are related by a similarity relation. For this reason the XML element possdistr in this Schema has also a required IDREF attribute (simRel) refering to a similarity relation defined in the FMB.

An XML Schema for Managing Fuzzy Documents

<xs:element name="possdistr"> <xs:complexType> <xs:sequence maxOccurs="unbounded" minOccurs="1"> <xs:element name="p" type="xsb:probType"/> <xs:element name="d" type="xsb:labelRefType"/> </xs:sequence> <xs:attribute name="simRel" type="xs:IDREF" use="required" /> </xs:complexType> </xs:element>

3.5 Fuzzy Data over a Nonordered Domain without Similarity Relations The datatype FuzzyNonOrdType is very similar to the previous one, FuzzyNonOrdSimilarityType. It represents fuzzy values over a nonordered domain, including unknown, undefined, and null values, and possibility distributions. However, conversely to the previous datatype, in this case values in a possibility distribution are not related by a similarity relation. For this reason, the element possdistr does not include an attribute refering to a similarity relation definition in the FMB. Hence, possibility distributions are defined just on labels without a trapezoidal definition. The use of these labels depends only from the application and its semantics. <xs:element name="possdistr"> <xs:complexType> <xs:sequence maxOccurs="unbounded" minOccurs="1"> <xs:element name="p" type="xsb:probType"/> <xs:element name="d" type="xsb:labelRefType"/> </xs:sequence> </xs:complexType> </xs:element>

3.6 Fuzzy Degrees Another way to incorporate uncertainty in classical databases consists in the use of degrees. The most common use of a degree is the membership degree associated to each instance of a tuple. The membership degree says how much the instance belongs to the tuple. However, other kinds of degree have been proposed in the literature. For example, the tuple degree may represent the fulfillment degree of a condition [21], the importance degree [2], the possibility degree or the uncertainty degree [28]. Any fuzzy data model makes a different choice in the interpretation of degrees. In [13], Galindo et al. classify the degrees with respect to their use instead of with respect to their meaning. A first classification distinguishes between associated and nonassociated degrees. The former applies their value to one or more attributes, the latter (FuzzyNonAssDegree) represents an imprecise information without associating it to another attribute. Moreover, Galindo et al. classify the associated degrees

B. Oliboni and G. Pozzani

in degrees associated to one attribute, to a set of attributes, and to a whole tuple (FuzzyInstDegree). Since degrees associated to one attribute is a particular case of degrees associated to a set of attributes where the set is a singleton, we chose to represent only the last one (FuzzyAttrDegree). Thus, our Schema allows the definition of three kinds of degrees: FuzzyAttrDegree, FuzzyInstDegree, and FuzzyNonAssDegree. • FuzzyAttrDegree introduces fuzzy degrees associated to one or more attributes of an entity instance. They are defined as an extension of the probType introduced in the base namespace. Then, they include a possibility value (in the range [0, 1]). Moreover, in order to keep information about the attributes to which a degree is associated, it has an IDREFS attribute (refTo) that refers to the IDs of these elements. These ID references refer to the FMB definition of the elements (see Section 3.7). In order to retrieve the actual values to which the degree is associated we must find the sibling elements of the degree in this tuple that have the same IDREF. Note that this query is supported by XPath [30]. Finally, each associated degree includes an info IDREF attribute refering to the metainformation in the FMB about its definition. <xs:complexType name="fuzzyAttrDegree"> <xs:simpleContent> <xs:extension base="xsb:probType"> <xs:attribute name="refTo" type="xs:IDREFS" use="required" /> <xs:attribute name="info" type="xs:IDREF" use="required" /> </xs:extension> </xs:simpleContent> </xs:complexType>

• FuzzyInstDegree represents degrees associated to the whole instance of an entity, thus they do not need to refer to something, and are just reported as child of the instance with which they are associated. Their definition is equal to the one for degrees associated to attributes but without the attribute refTo. <xs:complexType name="fuzzyInstDegree"> <xs:simpleContent> <xs:extension base="xsb:probType"> <xs:attribute name="info" type="xs:IDREF" use="required" /> </xs:extension> </xs:simpleContent> </xs:complexType>

• FuzzyNonAssDegree represents degrees that are associated neither to attributes nor to an instance. They are reported inside instances of an entity, but their

An XML Schema for Managing Fuzzy Documents

Table 1 ftype encoding f type 1 2 3 4 5 6 7

fuzzy object classical crisp data fuzzy data over an ordered domain fuzzy data over a nonordered domain with similarity relation fuzzy data over a nonordered domain without similarity relation degree associated to attributes instance degree non associated degree

meaning is not fixed in advance, but can be specified by the user in the string attribute meaning. As the other kinds of degrees, also non-associated degrees include a possibility value F and an IDREF attribute needed to retrieve the metainformation about degrees in FMB. The choice to include the meaning inside degrees, instead of inside their meta-information, allows the user to easier retrieve the meaning of degrees, reducing the data processing complexity. <xs:complexType name="fuzzyNonAssDegree"> <xs:sequence> <xs:element name="F" type="xsb:probType"/> <xs:element name="meaning" type="xs:string"/> </xs:sequence> <xs:attribute name="info" type="xs:IDREF" use="required" /> </xs:complexType>

3.7 The Fuzzy Metaknowledge Base The Fuzzy Metaknowledge Base (FMB) of an XML document contains the metainformation about all fuzzy objects defined and used in the document. The main FMB information are contained in the fcl (fuzzy column list) element that reports basic and common information about all elements that may contain fuzzy data. Information about each fuzzy object are contained in an fc (fuzzy column) element inside fcl. Among these information we note: • len reports the max lenght for possibility distributions in such element (it is valid only for elements which type includes possibility distributions); • ftype reports the type (from 1 to 7) of the fuzzy object (see Table 1); • com is an user comment; • um specifies the unit measure; • sym specifies for FuzzyNonOrdSimilarityType data whether they use a symmetric or an asymmetric similarity relation. Elements com, um, and sym are optional.

B. Oliboni and G. Pozzani

<xs:element name="fcl"> <xs:complexType> <xs:sequence minOccurs="1" maxOccurs="unbounded"> <xs:element ref="fc"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="fc"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="ftype" type="xsb:ftype"/> <xs:element name="len" type="xs:positiveInteger" minOccurs="0"/> <xs:element name="com" type="xs:string" minOccurs="0" /> <xs:element name="um" type="xs:string" minOccurs="0" /> <xs:element name="sym" type="xs:boolean" minOccurs="0" /> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> </xs:element>

Since these are the main elements, they have an ID that identifies the fuzzy object. As we explained in the previous sections, any fuzzy element has an IDREF to the ID associated to its auxiliary information. These IDs are also used in other auxiliary elements to give further type-specific information. For example the user may specify the default margin for approximate values. The margins are stored in elements of type fam (fuzzy approximate much) together with the value much that defines the minimum distance needed to consider two values to be very different. <xs:element name="fam"> <xs:complexType> <xs:sequence> <xs:element name="margin" type="xs:nonNegativeInteger"/> <xs:element name="much" type="xs:positiveInteger"/> </xs:sequence> <xs:attribute name="id" type="xs:IDREF"/> </xs:complexType> </xs:element>

The FMB contains also the definition of similarity relations used in the XML document. Definitions of all similarity relations are wrapped in the simRelDefs element. Inside it, each similarity relation is contained in one simRel element having an id attribute that identifies univocally the relation inside the document and a name. A similarity relation is defined by a set of triples (sim), each one composed by two IDREFs (fid1 and fid2) refering to the two related labels and a value (degree), in range [0, 1], that specifies the similarity degree between them. Obviously, labels may appear in several similarity relations, and two labels may be related with different degrees in different similarity relations.

An XML Schema for Managing Fuzzy Documents

<xs:element name="simRelDefs"> <xs:complexType> <xs:sequence minOccurs="1" maxOccurs="unbounded"> <xs:element ref="simRel" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="simRel"> <xs:complexType> <xs:sequence minOccurs="1" maxOccurs="unbounded"> <xs:element ref="sim" /> </xs:sequence> <xs:attribute name="id" type="xs:ID" /> <xs:attribute name="name" type="xs:string" /> </xs:complexType> </xs:element> <xs:element name="sim"> <xs:complexType> <xs:sequence> <xs:element name="fid1" type="xs:IDREF" /> <xs:element name="fid2" type="xs:IDREF" /> <xs:element name="degree" type="xsb:probType" /> </xs:sequence> </xs:complexType> </xs:element>

Finally, labelDefs stores label definitions, each one inside a labelinfo element. Each label has an ID, used to refer to this label, a name (required) and a trapezoidal definition made up of four decimal subelements. <xs:element name="labelDefs"> <xs:complexType> <xs:sequence minOccurs="1" maxOccurs="unbounded"> <xs:element ref="labelinfo" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="labelinfo"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string" /> <xs:sequence minOccurs="0"> <xs:element name="alpha" type="xs:decimal" /> <xs:element name="beta" type="xs:decimal" /> <xs:element name="gamma" type="xs:decimal" /> <xs:element name="delta" type="xs:decimal" /> </xs:sequence> </xs:sequence> <xs:attribute name="label_id" type="xs:ID" /> </xs:complexType> </xs:element>

B. Oliboni and G. Pozzani

Note that the trapezoidal distribution is required only for labels defined over ordered domains. However, this constraint (as any other one) must be checked by the system since it cannot be expressed directly in the XML Schema.

4 Example In this section we give a simple example of an XML document satisfying the proposed XML Schema, by considering information managed by a weather station. The document represents the tomorrow forecast and in particular the temperature and the weather at different times in the day. Each forecast is contained in a record element. The referred time in a record is a classical information but it is represented by using a fuzzy element, marking it to be processed by fuzzy querying. The temperature is a numerical datum represented with a FuzzyOrdType element (because it is based on an ordered domain), while possible weathers are represented by a FuzzyNonOrdSimType element because they are based on a nonordered domain. We associate a degree (accuracy) to the temperature for representing the accuracy of the forecasted temperature. Moreover, at each time several forecasts are calculated by using different meteorological models (e.g., LAM and GCM [22]). Thus, in each record a degree (precision) represents the precision of the forecast calculated by the model at the considered time. In this work, we focused only on the description of new elements enabling for representation of fuzzy information in XML documents. However, each document has also other classical elements and it must have its own schema. The XML Schema for the considered example has to define elements tomorrowForecast (containing all records), record, and so on, eventually by refering to proposed fuzzy elements. The following listing reports the definition of the record element in the Schema associated to the document for the weather station. We see that fuzzy objects have types refering to the proposed ones. <xs:element name="record"> <xs:complexType> <xs:sequence> <xs:element name="model" type="xs:string" /> <xs:element name="time" type="fuzzy:ClassicType" /> <xs:element name="temp" type="fuzzy:FuzzyOrdType" /> <xs:element name="accuracy" type="dgr:fuzzyAttrDegree" /> <xs:element name="weather" type="fuzzy:FuzzyNonOrdSimType"/> <xs:element name="precision" type="dgr:fuzzyNonAssDegree" /> </xs:sequence> </xs:complexType> </xs:element>

The following document portion reports a record about the 5 oâ&#x20AC;&#x2122;clock forecast calculated by the LAM model. Temperature is unknown, i.e., every value is possible, (hence, its accuracy is one), while the weather is undefined. The precision element has value zero, due to the lack of information in temperature and weather. Note that, since this degree is not associated to any attribute or instance (i.e., it has type FuzzyNonAssDegree), it contains also its own meaning.

An XML Schema for Managing Fuzzy Documents

<record> <model>LAM</model> <time type="T1" info="T0"> <hm>05:00:00</hm> </time> <temp type="T2" info="Te1"> <t2:unknown /> </temp> <accuracy refTo="Te1" info="D1"> 1 </accuracy> <weather type="T3" info="W1"> <t3:undefined /> </weather> <precision info="P1"> <dgr:F> 0 </dgr:F> <dgr:meaning>model forecast precision</dgr:meaning> </precision> </record>

At the same time, the GCM model may report temperature by a trapezoidal distribution [24, 25, 26, 27] with an accuracy of 0, 9, while possible weather is represented by a possibility distribution based on a similarity relation SR1. In the example, with a percentage of 80%, tomorrow the weather will be sunny (referred by the label “S”), while with a percentage of 30% it will be cloudy (referred by the label “C”). We remember that label and similarity relation definitions are contained in the FMB. <record> <model>GCM</model> <time type="T1" info="T0"> <hm>05:00:00</hm> </time> <temp type="T2" info="Te1"> <t2:trapezoidal> <t2:alpha>24</t2:alpha> <t2:beta>25</t2:beta> <t2:gamma>26</t2:gamma> <t2:delta>27</t2:delta> </t2:trapezoidal> </temp> <accuracy refTo="Te1" info="D1"> 0.9 </accuracy> <weather type="T3" info="W1"> <t3:possdistr simRel="SR1"> <t3:p>1</t3:p> <t3:d label_id="S"></t3:d> </t3:possdistr> </weather> <precision info="P1"> <dgr:F> 0.86 </dgr:F> <dgr:meaning>model forecast precision</dgr:meaning> </precision> </record>

In other cases, temperature may be represented also by an approximate value 28 ± 0, 5.

B. Oliboni and G. Pozzani

Otherwise, temperature may be represented by a label with a trapezoidal definition (contained in the FMB). <temp type="T2" info="Te1"> <t2:label label_id="k4" /> </temp>

The FMB portion of the XML document reports auxiliary information about fuzzy elements. As said in Section 3.7, the fc element contains main basic information about them. For example the fc element for the temperature may be the following one: <fmb:fc id="Te1"> <fmb:name>temp</fmb:name> <fmb:ftype>2</fmb:ftype> <fmb:com>the expected temperature</fmb:com> <fmb:um>Celsius degrees</fmb:um> </fmb:fc>

where Te1 is the unique ID identifying the temp fuzzy object. Hence, it is used inside the document to link data with auxiliary information and viceversa. In the FMB, we may retrieve also definitions of the labels with ID S (representing sunny weather), C (representing cloudy weather), and k4 (representing a possible value for the temperature). <fmb:labelDefs> <fmb:labelinfo label_id="S"> <fmb:name>sunny</fmb:name> </fmb:labelinfo> <fmb:labelinfo label_id="C"> <fmb:name>cloudy</fmb:name> </fmb:labelinfo> <fmb:labelinfo label_id="k4"> <fmb:name>temperature4</fmb:name> <fmb:alpha>27.5</fmb:alpha> <fmb:beta>29</fmb:beta> <fmb:gamma>30</fmb:gamma> <fmb:delta>30.5</fmb:delta> </fmb:labelinfo> </fmb:labelDefs>

An XML Schema for Managing Fuzzy Documents

Labels representing sunny and cloudy weathers are defined over a nonordered domain, thus they are pure linguistic labels and they have not a trapezoidal definition. The label used to represent a temperature is defined also by a trapezoidal distribution. Labels S and C are related also by a similarity relation defined inside a simRel element. This similarity relation is identified by the ID SR1 and it has also a name. Inside each sim element we may retrieve a pair of objects and their similarity degree. In the reported example sunny and cloudy is similar with a degree of 0, 3. <fmb:simRelDefs> <fmb:simRel id="SR1" name="SimilarityRelation1"> <fmb:sim> <fmb:fid1>S</fmb:fid1> <fmb:fid2>C</fmb:fid2> <fmb:degree>0.3</fmb:degree> </fmb:sim> </fmb:simRel> </fmb:simRelDefs>

Finally, the FMB contains information about default margin for approximate values representing temperatures. Moreover, the threshold necessary to consider two temperatures very different is defined. In the example these two parameters have value 1 and 5, respectively. <fmb:fam id="Te1"> <fmb:margin>1</fmb:margin> <fmb:much>5</fmb:much> </fmb:fam>

5 Fuzzy Information for Processing Documents As seen in Section 8 some approaches to fuzzy databases (including the ones in the XML context) extend query languages by introducing in them fuzzy features. A possible way to incorporate fuzziness in queries is defining quantifiers and qualifiers. In this section we present our proposal for representing them in an XML document. Moreover, we continue the example from the previous section presenting definitions of some quantifiers and qualifiers about weather information.

5.1 Representing Fuzzy Quantifiers and Qualifiers A qualifier is a fuzzy constant in the context of a particular attribute or degree. It is similar to a linguistic label but it is used in queries in order to set linguistic threshold and to make them more understandable. Moreover, qualifiers allow one to tune up queries simply modifying their definitions. Qualifier definitions are wrapped all together in the qualifiers element. Inside it, a single qualifier definition is reported in a qualDef element. Each qualifier

B. Oliboni and G. Pozzani

has: an id attribute that identifies it, a name that represents the qualifier in queries, and a value in the range [0, 1]. <xs:element name="qualifiers"> <xs:complexType> <xs:sequence minOccurs="1" maxOccurs="unbounded"> <xs:element ref="qualDef" /> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="qualDef"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="qualifier" type="xsb:probType"/> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> </xs:element>

Fuzzy quantifiers [17, 18, 34, 39] are linguistic labels that allow us to represent uncertain quantities. They may be used in queries in order to provide the approximate number of elements fulfilling a given condition. Quantifiers may be absolute or relative. The first ones express quantities with respect to the total number of objects in a set (e.g., “approximately between 25 and 35”, “close to 0”). Hence, absolute quantifiers range in R. The second ones represent the proportion between the total number of objects in a set and the number of objects in this set that complies with the stated condition. In other words, relative quantifiers measure the fulfillment quantity of a certain condition (e.g., “the majority”, “about half of”). For this reason relative quantifiers are valued in the range [0, 1]. Absolute and relative quantifiers may be represented in the same form by using a trapezoidal representation [α , β , γ , δ ] and keeping information about their type. Another classification of quantifiers divides them in those based on product and those based on sum. Moreover, they may have zero, one, or two arguments. A general definition of fuzzy quantifiers with respect to their arguments and operations is the following one: • quantifiers without arguments are defined simply by their trapezoidal distribution [α , β , γ , δ ]; • quantifiers with one argument x: – based on product: [x · α , x · β , x · γ , x · δ ]; – based on sum: [x + α , x + β , x + γ , x + δ ]; • quantifiers with two arguments x and y: – based on product: [x · α , x · β , y · γ , y · δ ]; – based on sum: [x + α , x + β , y + γ , y + δ ].

An XML Schema for Managing Fuzzy Documents

Note that, in some cases, a relative quantifier may not be inside the range [0, 1]. This problem can be addressed by considering only the intersection of trapezoidal distribution associated to the quantifier with the interval [0, 1]. In our Schema proposal, all these information about a quantifier definition are contained in a quantDef element. Each quantifier is internally identified by an unique id, while it is used by refering its name. Moreover, a quantifier definition has the following subelements: • args ∈ {0, 1, 2} specifies the number of arguments; • AR specifies whether the quantifier is absolute (A) or relative (R); • SP specifies whether the quantifier is based on sum (S) or product (P). When the quantifier has not arguments a ‘-’ is provided. Finally, all kinds of quantifiers have a trapezoidal definition provided by four elements alpha, beta, gamma, delta. <xs:element name="quantDef"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="args"> <xs:simpleType> <xs:restriction base="xs:nonNegativeInteger"> <xs:minInclusive value="0"/> <xs:maxInclusive value="2"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="AR"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="A"/> <xs:enumeration value="R"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="SP"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="S"/> <xs:enumeration value="P"/> <xs:enumeration value="-"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="alpha" type="xs:decimal"/> <xs:element name="beta" type="xs:decimal"/> <xs:element name="gamma" type="xs:decimal"/> <xs:element name="delta" type="xs:decimal"/> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> </xs:element>

B. Oliboni and G. Pozzani

Although quantifiers and qualifiers are information used during the processing phase of XML documents and they are not really data, it may be useful to represent them inside documents. In fact, the processing phase is a very important issue about fuzzy databases and information. Consider cases in which XML documents are exchanged between several users. In these cases, it may be interesting also to exchange processing information in order to share not only data but also semantics and processing operators. In such a way, different users can query a document obtaining the same results. However, an user may be free to use his own qualifier and quantifier definitions instead of the document ones.

6 An Example of Quantifiers and Qualifiers Continuing the example about information managed by a weather station presented in Section 4, we may define some quantifiers and qualifiers. Their definitions are reported in the last part of an XML document. Forecasted temperature is a fuzzy data and it may be processed by fuzzy queries. Hence, an absolute quantifier Hot, without arguments and defined by the distribution [30, 35, 72, 72] (expressed in Celsius degrees), may be used in queries to classify temperatures overlapping it as â&#x20AC;&#x153;hotâ&#x20AC;?. <proc:quantifiers> <proc:quantDef id="H13"> <proc:name>Hot</proc:name> <proc:args>0</proc:args> <proc:AR>A</proc:AR> <proc:SP>-</proc:SP> <proc:alpha>30</proc:alpha> <proc:beta>35</proc:beta> <proc:gamma>72</proc:gamma> <proc:delta>72</proc:delta> </proc:quantDef> </proc:quantifiers>

On the other hand, we may define a qualifier High, with value 0, 8, that may be used as threshold in queries about temperature. It may be used to constraint query results to comply with the query condition with a fulfillment degree greater than 80%. <proc:qualifiers> <proc:qualDef id="H12"> <proc:name>High</proc:name> <proc:qualifier>0.8</proc:qualifier> </proc:qualDef> </proc:qualifiers>

Note that, in fuzzy queries, quantifiers and qualifiers may be used together in order to constraint results. Considering, for example, queries about temperature cited above, we may retrieve records which temperature is Hot with a High fulfillment

An XML Schema for Managing Fuzzy Documents

degree (i.e., temperature overlaps for at least 80% the trapezoidal distribution defining the quantifier Hot).

7 Incorporating Fuzziness in Classical XML Documents In this section we show how classical XML documents, and their schemata, can be modified to integrate our fuzzy XML Schema. This modification allows to represent also uncertain data, in addition to already represented classical data. The first step of this integration consists of to modify the Schema of the original document by using fuzzy datatypes defined by us. In particular the Schema of the original document must declare new namespaces importing our proposed Schema. The namespace declarations can be done with some definitions similar to the following ones: xmlns:fuzzy="first-2" xmlns:degree="degrees" ... <xs:import namespace="first-2" schemaLocation="./first-2.xsd" /> <xs:import namespace="degrees" schemaLocation="./degrees.xsd" />

Then, the designer must decide which data must be represented with a fuzzy data type and over which kind of domain, ordered or nonordered, the interested data are. Once the domains have been decided, each original element must be redefined changing its type to one of the fuzzy proposed types. Data over an ordered domain must be declared with type FuzzyOrdType, data over an nonordered domain and with an associated similarity relation must be declared with type FuzzyNonOrdSimType, and, finally, data over an nonordered domain and without an associated similarity relation must be declared with type FuzzyNonOrdType. For instance, let us consider an XML element age representing the age of a person. The original definition of this element may be something like: <xs:element name="age" type="xs:integer" />

On the other hand, one possible its fuzzy definition may be: <xs:element name="age" type="fuzzy:FuzzyOrdType" />

After this change the age can be represented by using any kind of element defined for datatype FuzzyOrdType, e.g., interval, trapezoidal distribution, approximate value, and so on (see Section 3.3). Similar considerations and changes must be done also for all other elements that the designer want to be able to represent fuzzy information. Changes to different elements differ only on the fuzzy datatype the designer needs to use to represent them: classicType, FuzzyOrdType, FuzzyNonOrdSimType, or FuzzyNonOrdType. The second step of the translation of a classical XML document to one its fuzzy version consists of the modification of the document itself. Of course, the usage of elements which definition has been changed must be replaced accordingly to their new definition.

B. Oliboni and G. Pozzani

Continuing the example here introduced, the usage of the age element changes from: <age>32</age>

to something like: <age type="T2" info="a1"> <t2:interval> <t2:lb>31</t2:lb> <t2:ub>34</t2:ub> </t2:interval> </age>

Note that the transition, from a classical XML document to a fuzzy one based on our Schema, allows one not only to change the definition of the elements to a fuzzy compliant version but also to enrich the XML document by using degrees, quantifiers, and qualifiers.

8 Related Work and Discussion In this section we briefly describe other proposals presented in the literature and related to the representation and querying of fuzzy information in XML documents. Fuzzy features may be incorporated into databases and XML data by using two main ways. The former allows the representation of fuzzy information directly in data, e.g., by extending the data model with fuzzy datatypes. The latter obtains fuzzy information by processing crisp data by using query languages extended with uncertain operators. However, considering general and complete systems for fuzzy data management, these two ideas are orthogonal and can be combined obtaining three approaches to fuzzy databases: 1. crisp querying of fuzzy information; 2. fuzzy querying of crisp information; 3. fuzzy querying of fuzzy information. The proposal by Galindo et al. is based on the last approach. They define new datatypes in order to represent fuzzy information, and, at the same time, they extend the SQL [12] query language with fuzzy operators and capabilities. In the following sections we introduce related work on the representation of XML fuzzy data (Section 8.1) and related work about fuzzy querying of XML documents (Section 8.2).

8.1 Representing XML Fuzzy Data In [14], Gaurav et al. incorporated fuzzy data in XML documents extending the XML schemata associated to these documents. They observed that fuzziness may be incorporated in values and structures of XML elements. Hence, they extended

An XML Schema for Managing Fuzzy Documents

the definition of values and elements introducing special elements representing possibility distributions and similarity relations. Possibility distributions may be introduced through the two elements <fuzzyValue> and <fuzzyDegree>. The first one allows the specification of the possibility degree associated to a classical value, while the second one allows the specification of the possibility with which a sub-element belongs to its parent element. The Schema proposed by Gaurav et al. permits to introduce similarity relations by using the new element <SimilarityRelation> that defines pairs composing the similarity relation. The <SimilarityRelationRef> attribute may be used to refer to an already defined similarity relation. Differently from our proposal, they do not allow the use of linguistic labels and generic degrees, thus the example described in Section 4 cannot be fully implemented by using the approach proposed in [14]. The impossibility to define linguistic labels does not allow to Gaurav et al. to define trapezoidal distributions (note that trapezoidal distributions can represent also triangular distributions and intervals) with a unique name that can be referred in several point of a document. Thus, when a trapezoidal distribution is used more times inside a document, Gaurav at al. proposal must specify more times the distribution itself. Conversely, our solution permits to associate a name (i.e., a linguistic label) to a trapezoidal distribution in order to refer it by using that name instead of by specifying distribution values. This approach allows us to reuse distribution definitions, reducing documents size. Gaurav et al. do not allow to represent fuzzy degrees too. Thus, they cannot associate fuzzy information to classical data. For instance, they cannot represent fuzzy information similar to the accuracy of a forecasted temperature or the precision of a whole forecast, as we reported in the example in Section 4. We note that all fuzzy constructs proposed by Gaurav et al. have a corresponding rappresentation also in our Schema. A similarity relation, defined by Gaurav et al. through the <SimilarityRelation> element, is defined in our proposal in the FMB simRel element and it is referred by specifying its IDREF inside the element possdistr of datatype FuzzyNonOrdSimType. Elements <fuzzyValue> and <fuzzyDegree> defined by Gaurav et al. represent possibility distributions and tuple degrees, respectively. Possibility distributions can be represented, by using our proposal, defining a possibility distribution possdistr as specified in the FuzzyOrdType, FuzzyNonOrdSimType, and FuzzyNonOrdType datatypes. Tuple degrees are represented in our proposal through degrees associated to a whole tuple, by using FuzzyInstDegree. In [20, 19], Ma et al. defined a model for representing fuzzy information modifying the DTD associated to an XML document. In particular they modified the DTD wrapping the original element definitions inside the new element <Val poss=""> which associates to the current element its possibility degree. The new element <Dist>, composed by one or more <Val> elements, allows one to define a possibility distribution in an XML document. Moreover, Ma et al. defined two types of distribution: disjunctive and conjunctive. The former represents a set of possible values where actually only one of them is true at any moment, the latter represents a set of fuzzy values everyone true with different degrees at any moment.

B. Oliboni and G. Pozzani

In [35], Ma et al. extend their previous work in order to incorporate fuzziness in XML documents by using XML Schema. Hence, they define <Val> and <Dist> elements also using XML Schema and then they explain how classical schemata can be modified to incorporate their new fuzzy objects. However, notice that Ma et al. introduce neither similarity relations, nor linguistic labels, nor other fuzzy datatypes. About the impossibility for Ma et al. to use linguistic labels, remarks similar to those reported about [14] are valid. Moreover, since Ma et al.â&#x20AC;&#x2122;s proposal is not able to represent similarity relations, they cannot represent data similar to the weather situation we reported in Section 4. We note that similarity of two values cannot be inferred if these values are not numerical, thus our proposal can actually represent more information than [35]. Constructs introduced by Ma et al., <Dist> and <Val>, correspond and can be represented by using possibility distributions, on ordered or nonordered domains, defined in our proposal. An approach similar to those reported in [20, 19], based on an extension of DTD, is used in [26]. Turowski et al. introduced new appropriate DTDs defining the elements that allow one to represent discrete fuzzy sets (that can represent possibility distributions), continuous fuzzy sets, and linguistic variables, that can be associated to fuzzy sets. Then they do not allow the use of similarity relations or degrees. On the other hand, using fuzzy sets and variables, they also define the DTDs needed to implement a fuzzy inference system able to infer the truths implied by some given facts, by using user-defined rules. Since, Turowski et al.â&#x20AC;&#x2122;s proposal cannot represent similarity relations, it suffers of lacks similar to those reported for previous discussed approaches. On the other hand, we note that they can also represent continuous generic fuzzy sets that cannot be represented by our proposal. Special cases of continuous fuzzy sets are trapezoidal and triangular distributions, and intervals. Our proposal can represent these distributions, while it cannot explicitely represent distributions with a generic trend. However, these distributions can be interpoled from discrete ones, as Turowski et al. do. In our proposal the distinction from discrete and continuous distributions are implicitely defined by the semantics of data, while in [26] it is explicitely specified. In our proposal we allow the user to represent all the aspects related to fuzzy information. In particular, we define all fuzzy datatypes (e.g. possibility distributions, approximate values, intervals), fuzzy degrees (with several meanings) and labels already proposed separately in several proposals in the literature. Moreover, we define XML schemata instead of DTDs to overcome limitations due to the use of DTDs.

8.2 Fuzzy Querying of XML Documents Several proposals in literature deal with fuzzy querying of XML documents. In [3, 5, 9, 10], Campi et al. propose an extension for the XPath query language [30] by adding new constructions in order to introduce fuzzy querying capabilities. XPath language is based on path expressions able to state the structure and the value of elements required by the user. With respect to path expressions,

An XML Schema for Managing Fuzzy Documents

Campi et al. take into account two kinds of fuzziness: fuzziness on structure and fuzziness on values. With respect to the first one, users can submit queries without to specify in a precise way the structure of the XML document and of the required elements, while, with respect to values, queries do not look only for exact value matching but also for similar values. These features are introduced by defining new fuzzy path predicates (e.g., NEAR, ABOUT, and BESIDES). Fuzzy predicates allow one to search elements, attributes, and values similar to those really required. For example, the expression /proceedings/article[@year NEAR 2009] retrieves article elements, child of an element proceedings, which attribute year has a value close to 2009. On the other hand, the user may retrieve article elements that are close descendant of proceedings by using the expression /proceedings{/NEAR}/article. Fuzzy predicates can be partially satisfied by XML elements with several degrees. Hence, conversely to classical XPath queries, fuzzy queries return a ranked set of nodes. Ranks associated to elements represent the similarity of returned elements with the ones required by the query. Moreover, Campi et al. define a method allowing one to choose how the ranks for a query may be calculated. Users may associate to each part of a query a variable which value represents the degree of satisfaction of the conditions. Users may define how the ranks must be calculated combining values bound to variables. Finally, Campi et al. proposal allows users to use fuzzy quantifiers (e.g., tall) and qualifiers (e.g., very) inside predicates (e.g., height = very tall). A very similar approach to fuzzy querying is proposed by Goncalves and Tineo [15]. Using a different approach, Amer-Yahia et al. [1] do not extend XPath expressions with new predicates and operators, but they introduce fuzziness by query relaxations. They define four operations (e.g., axis generalization and leaf deletion) on the structure of queries that, given a query, produce an its relaxed version (i.e., a query containing the original one). Relaxations broaden the scope of the path expressions provided in the original query. A ranking strategy associates a penalty to each modification applied to a query through a relaxation operation. Penalties are then used to calculate how much retrieved elements satisfy the original query. Note that, in all proposals about fuzzy querying in the literature, query results are sets of ranked elements where ranks represent the fulfillment degrees of retrieved elements with respect to the query conditions.

9 Conclusion In this work, we proposed a general XML Schema definition for representing fuzzy information in XML documents. In our proposal, we represent different aspects of fuzzy information by adapting a data type classification already proposed for the relational database context, and by integrating different kinds of fuzzy information to compose a complete definition. For future work we plan to start from documents valid with respect to the XML Schema proposed in this paper and to study topics related to querying and

B. Oliboni and G. Pozzani

retrieval of fuzzy information. As we explained in Section 8, fuzzy information can be queried by using fuzzy or crisp query languages. We note that the starting point of our future research will be different from the one assumed by previous works, that have been presented in literature (see Section 8.2). Conversely from other approaches, our work will be based on fuzzy information rather than crisp information. This difference will lead, in our opinion, to a less modification of existing query languages. As a matter of fact, since fuzzy capabilities are already incorporated in the document schema, query languages can exploit the structure of documents without the need to use ad-hoc sintax constructs and features. In this case, we do not need to enrich the query language but the query engine (i.e., the part of the system liables to interpret and execute queries). On the other hand, some features (e.g., qualifier and quantifier usage) will require some little modifications to the query language. Thus, first of all, future work in this direction must understand which features must be incorporated in a query language (e.g., XPath) for fuzzy XML documents and which others need only a particular interpretation from the query engine. After that, an extended query language with desired fuzzy capabilities will be designed. Another possible research direction is about how fuzzy XML documents may be used for XML Schema versioning. Considering XML documents that are instances of different versions of a given XML Schema, fuzzy XML may be used to represent the uncertainity associated to the information contained in the documents. Moreover, considering different versions of an XML Schema, our proposal may be used to represent the uncertainity associated to elements and attributes used in the versions. Finally, fuzzy XML may represent the uncertainity associated to operations and to sequences of operations that can be used to obtain a new version of an XML Schema from other ones.

References 1. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: flexible structure and fulltext querying for XML. In: ACM (ed.) Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data 2004, Paris, France, June 13–18, pp. 83–94. ACM Press, New York (2004) pub-ACM:adr 2. Bosc, D., Pivert, P.: Flexible queries in relational databases – the example of the division operator. TCS: Theoretical Computer Science 171 (1997) 3. Braga, D., Campi, A., Damiani, E., Pasi, G., Lanzi, P.: FXPath: Flexible querying of XML documents. In: Proceedings of EuroFuse 2002 (2002) 4. Buckles, B.P., Petry, F.E.: A fuzzy representation of data for relational databases. Fuzzy Sets and Systems 7(3), 213–226 (1982) 5. Campi, A., Guinea, S., Spoletini, P.: A fuzzy extension for the XPath query language. In: Larsen, H.L., Pasi, G., Ortiz-Arroyo, D., Andreasen, T., Christiansen, H. (eds.) FQAS 2006. LNCS (LNAI), vol. 4027, pp. 210–221. Springer, Heidelberg (2006) 6. Codd, E.F.: A relational model of data for large shared data banks. CACM: Communications of the ACM 13 (1970) 7. Codd, E.F.: Extending the database relational model to capture more meaning. ACM Transactions on Database Systems 4(4), 397–434 (1979) 8. Codd, E.F.: The relational model for database management. Addison-Wesley Longman Publishing Co. Inc., Boston (1990)

An XML Schema for Managing Fuzzy Documents

9. Damiani, E., Marrara, S., Pasi, G.: FuzzyXPath: Using fuzzy logic an IR features to approximately query XML documents. In: Melin, P., Castillo, O., Aguilar, L.T., Kacprzyk, J., Pedrycz, W. (eds.) IFSA 2007. LNCS (LNAI), vol. 4529, pp. 199–208. Springer, Heidelberg (2007) 10. Damiani, E., Marrara, S., Pasi, G.: A flexible extension of xpath to improve XML querying. In: Myaeng, S.H., Oard, D.W., Sebastiani, F., Chua, T.S., Leong, M.K. (eds.) Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, pp. 849–850. ACM, New York (2008) 11. Dubois, D., Prade, H.: Possibility Theory: An Approach to Computerized Processing of Uncertainty. Plenum Press, New York (1988) 12. Elmasri, R.A., Navathe, S.B.: Fundamentals of Database Systems. Addison-Wesley Longman Publishing Co. Inc., Boston (1999) 13. Galindo, J., Urrutia, A., Piattini, M.: Fuzzy Databases: Modeling, Design, and Implementation. IGI Publishing (2006) 14. Gaurav, A., Alhajj, R.: Incorporating fuzziness in XML and mapping fuzzy relational data into fuzzy XML. In: Haddad, H. (ed.) Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 456–460. ACM, New York (2006) 15. Goncalves, M., Tineo, L.: A new step towards flexible XQuery. Avances en sistemas e Inform´atica 4, 27–34 (2007) 16. ISO: ISO 8879:1986: Information processing — Text and office systems — Standard Generalized Markup Language, SGML (1986), http://www.iso.ch/cate/d16387.html 17. Liu, Y., Kerre, E.E.: An overview of fuzzy quantifiers. (I). interpretations. Fuzzy Sets Syst. 95(1), 1–21 (1998) 18. Liu, Y., Kerre, E.E.: An overview of fuzzy quantifiers (II). reasoning and applications. Fuzzy Sets Syst. 95(2), 135–146 (1998) 19. Ma, Z.: Fuzzy Database Modeling with XML (The Kluwer International Series on Advances in Database Systems). Springer-Verlag New York, Inc. (2005) 20. Ma, Z.M., Yan, L.: Fuzzy XML data modeling with the UML and relational data models. DKE 63(3), 972–996 (2007) 21. Medina, J.M., Pons, O., Vila, M.A.: GEFRED: A generalized model of fuzzy relational databases. Information Sciences 76(1-2), 87–109 (1994) 22. Nebeker, F.: Calculating the Weather: Meteorology in the 20th Century. International Geophysics Series, vol. 60. Academic Press, London (1995) 23. Paoli, J., Bray, T., Sperberg-McQueen, C.M., Yergeau, F., Maler, E.: Extensible markup language (XML) 1.0 (fourth edition). W3C recommendation, W3C (2006), http://www.w3.org/TR/2006/REC-xml-20060816 24. Prade, H.: Lipski’s approach to incomplete information databases restated and generalized in the setting of Zadeh’s possibility theory. Information Systems 9(1), 27–42 (1984) 25. Prade, H., Testemale, C.: Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences 34, 115– 143 (1984) 26. Turowski, K., Weng, U.: Representing and processing fuzzy information - an XML-based approach. Knowl.-Based Syst. 15(1-2), 67–75 (2002) 27. Umano, M.: FREEDOM-O: A fuzzy database system. In: Gupta, M.M., Sanchez, E. (eds.) Fuzzy Information and Decision Processes, pp. 339–349. North-Holland, Amsterdam (1982) 28. Umano, M., Fukami, S.: Fuzzy relational algebra for possibility-distribution-fuzzyrelational model of fuzzy data. J. Intell. Inf. Syst. 3(1), 7–27 (1994)

B. Oliboni and G. Pozzani

29. W3C: World-Wide Web Consortium (1994), http://www.w3.org/ 30. XML Path Language (XPath) Version 1.0, W3C Recommendation (1999), http://www.w3c.org/TR/xpath 31. XQuery 1.0: An XML Query Language, W3C Recommendation (2007), http://www.w3.org/TR/xquery/ 32. XSD: XML Schema Definition (2004), http://www.w3.org/XML/Schema 33. XSL Transformations (XSLT), W3C Recommendation (1999), http://www.w3.org/TR/xslt 34. Yager, R.R.: Quantified propositions in a linguistic logic. International Journal of ManMachine Studies 19(2), 195–227 (1983) 35. Yan, L., Ma, Z.M., Liu, J.: Fuzzy data modeling based on XML schema. In: Proceedings of the 2009 ACM Symposium on Applied Computing (SAC), Honolulu, Hawaii, USA, March 9-12, pp. 1563–1567. ACM, New York (2009) 36. Zadeh, L.A.: Fuzzy sets. Information and Control 8(3), 338–353 (1965) 37. Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3, 177–200 (1971) 38. Zadeh, L.A.: Fuzzy sets as a basis for possibility. Fuzzy Sets and Systems 1, 3–28 (1978) 39. Zadeh, L.A.: A computational approach to fuzzy quantifiers in natural language. Computers and Mathematics with Applications 9(1), 149–184 (1983) 40. Zemankova, M., Kandel, A.: Fuzzy Relational Databases — A Key to Expert Systems. Verlag TUV Rheinland (1984) 41. Zemankova, M., Kandel, A.: Implementing imprecision in information systems. Information Sciences 37(1-3), 107–141 (1985)

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema Li Yan, Jian Liu, and Z.M. Ma

Abstract. XML has been the de-facto standard of information representation and exchange over the web. In addition, imprecise and uncertain data are inherent in the real world. Although fuzzy information has been extensively investigated in the context of relational model, the classical relational database model and its fuzzy extension to date do not satisfy the need of modeling complex objects with imprecision and uncertainty, especially when the fuzzy relational databases are created by mapping the fuzzy conceptual data models and the fuzzy XML data model. Based on possibility distributions, this chapter concentrates on fuzzy information modeling in the fuzzy XML model and the fuzzy nested relational database model. In particular, the formal approach to mapping a fuzzy DTD model to a fuzzy nested relational database (FNRDB) schema is developed.

1 Introduction With the prompt development of the Internet, the requirement of managing information based on the Web has attracted much attention both from academia and industry. XML is widely regarded as the next step in the evolution of the World Wide Web, and has been the de-facto standard. It aims at enhancing content on the World Wide Web. XML and related standards are flexible that allow the easy development of applications which exchange data over the web such as e-commerce (EC) and supply chain management (SCM). However, this flexibility makes it challenging to develop an XML management system. To Li Yan School of Software, Northeastern University, Shenyang, 110819, China Jian Liu School of Information Science & Engineering, Northeastern University, Shenyang, 110819, China Z.M. Ma School of Information Science & Engineering, Northeastern University, Shenyang, 110819, China e-mail: mazongmin@ise.neu.edu.cn

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 35â&#x20AC;&#x201C;54. springerlink.com ÂŠ Springer-Verlag Berlin Heidelberg 2010

L. Yan, J. Liu, and Z.M. Ma

manage XML data, it is necessary to integrate XML and databases [3]. Various databases, including relational, object-oriented, and object-relational databases, have been used for mapping to and from the XML document. At the same time, some data are inherently imprecise and uncertain since their values are subjective in the real world applications. For example, consider values representing the satisfaction degree for a film, different person may have different satisfaction degree. Information fuzziness has also been investigated in the context of EC and SCM [25, 30, 31]. It is shown that fuzzy set theory is very useful in Web-based business intelligence. Fuzzy information has been extensively investigated in the context of relational model [6, 24, 26, 28]. However, the classical relational database model and its fuzzy extension do not satisfy the need of modeling complex objects with imprecision and uncertainty. The requirements of modeling complex objects and information imprecision and uncertainty can be found in many application domains (e.g., multimedia applications) and have challenged the current database technology [2, 7]. In order to model uncertain data and complex-valued attributes as well as complex relationships among objects, current efforts have concentrated on the conceptual data models [15, 16, 21, 33], the fuzzy nested relational data model (also known as an NF2 data model) [34], and the fuzzy object-oriented databases [4, 10, 12, 13, 20]. Also there are efforts to conceptually design the fuzzy databases using the fuzzy conceptual data models [15, 16, 21, 33]. More recently, the fuzzy object-relational databases are proposed [9] which combine both characters of fuzzy relational databases and fuzzy object-oriented databases. Ones can refer to [17, 18] for recent surveys of these fuzzy data models. Despite fuzzy values have been employed to model and handle imprecise information in databases since Zadeh introduced the theory of fuzzy sets [35], relative little work has been carried out in extending XML towards the representation of imprecise and uncertain concepts. Abiteboul et al. [1] provide a model for XML documents and DTDs and a representation system for XML with incomplete information. The representations of probabilistic data in XML are proposed in other previous research papers, such as [14, 22, 27, 29]. Without presenting XML representation model, the data fuzziness in XML document is discussed directly according to the fuzzy relational databases in [11], and the simple mappings from the fuzzy relational databases to fuzzy XML document are provided also. Oliboni and Pozzani [23] propose a XML Schema definition for representing fuzzy information. They adopt the data type classification for the XML data context. A fuzzy XML data model which is based XML DTD is proposed in [19], in which the mapping of the fuzzy XML DTD (Document Type Definition) from the fuzzy UML data model and to the fuzzy relational database schema are discussed, respectively. In [32], a fuzzy XML data model based on XML Schema is developed. The classical relational database model and its fuzzy extension do not satisfy the need of modeling complex objects with imprecision and uncertainty. It is also true when the fuzzy relational databases are created by mapping the fuzzy conceptual data models and the fuzzy XML data model. Being the extension of relational data model, the NF2 database model is able to handle complex-valued attributes and may be better

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

suited to some complex applications such as office automation systems, information retrieval systems and expert database systems [34]. In [8], the fuzzy NF2 database model is proposed for managing uncertainties in images. This chapter, based on possibility distributions, concentrates on fuzzy information modeling in the fuzzy XML model and the fuzzy nested relational database model. In particular, the formal approach to mapping a fuzzy DTD model to a fuzzy nested relational database (FNRDB) schema is developed. The remainder of this chapter is organized as follows. Section 2 discusses fuzzy sets and possibility distributions. The fuzzy XML data model and fuzzy nested relational databases are introduced in Section 3. In Section 4, the approaches to mapping the fuzzy XML model to the fuzzy nested relational schema are developed. Section 5 concludes this chapter.

2 Fuzzy Sets and Possibility Distributions Different models have been proposed to handle different categories of data quality (or lack thereof). Five basic kinds of imperfection have been identified in [5], which are inconsistency, imprecision, vagueness, uncertainty, and ambiguity. Instead of giving the definitions of the imperfect information, we herewith explain their meanings. Inconsistency is a kind of semantic conflict, meaning the same aspect of the real world is irreconcilably represented more than once in a database or in several different databases. For example, the age of George is stored as 34 and 37 simultaneously. Information inconsistency usually comes from information integration. Intuitively, the imprecision and vagueness are relevant to the content of an attribute value, which means that a choice must be made from a given range (interval or set) of values without knowing which one to choose. In general, vague information is represented by linguistic values. Assume that, for example, we do not know exactly the age of two persons named Michael and John, and only know that the age of Michael may be 18, 19, 20, or 21, and the age of John is old. Then the information of Michaelâ&#x20AC;&#x2122;s age is an imprecise one, denoted by a set of values {18, 19, 20, 21}. The information of Johnâ&#x20AC;&#x2122;s age is a vague one, denoted by a linguistic value, "old". The uncertainty is related to the degree of truth of its attribute value. With uncertainty, we can apportion some, but not all, of our belief to a given value or a group of values. For example, the possibility that the age of Chris is 35 right now should be 98%. The random uncertainty, described using probability theory, is not considered in this chapter. The ambiguity means that some elements of the model lack complete semantics, leading to several possible interpretations. Generally, several different kinds of imperfection can co-exist with respect to the same piece of information. For example, the age of Michael is a set of values {18, 19, 20, 21} and their possibilities are 70%, 95%, 98%, and 85%, respectively. Imprecision, uncertainty, and vagueness are three major types of imperfect information and can be modeled with fuzzy sets [35] and possibility theory [36].

L. Yan, J. Liu, and Z.M. Ma

Many of the existing approaches dealing with imprecision and uncertainty are based on the theory of fuzzy sets. The concept of fuzzy sets was originally introduced by Zadeh [35]. Let U be a universe of discourse and F be a fuzzy set in U. A membership function μF: U → [0, 1] is defined for F, where μF (u), for each u ∈ U, denotes the membership degree of u in the fuzzy set F. Thus, the fuzzy set F is described as follows: F = {μF (u1)/u1, μF (u2)/u2, ..., μF (un)/un} The fuzzy set F is consisted of some elements just like the conventional set. But, not being the same as the conventional set, each element in F may or may not belong to F, having a membership degree to F which needs to be explicitly indicated. So in F, an element (say ui) is associated with its membership degree (say μF (ui)), and they occur together in form of μF (ui)/ui. When the membership degrees that all elements in F belong to F are exactly 1, the fuzzy set F reduces to a conventional one. When the membership degree μF (u) above is explained to be a measure of the possibility that a variable X has the value u, where X takes values in U, a fuzzy value is described by a possibility distribution πX (Zadeh, 1978). πX = {πX (u1)/u1, πX (u2)/u2, ..., πX (un)/un} Here, πX (ui), ui ∈ U denotes the possibility that ui is true. Let πX be the possibility distribution representation for the fuzzy value of a variable X. It means that the value of X is fuzzy, and X may take one from some possible values u1, u2, ..., and un and each one (say ui) taken possibly is associated with its possibility degree (say πX (ui)). Definition: A fuzzy set F of the universe of discourse U is convex if and only if for all u1, u2 in U, μF (λu1 + (1 − λ) u2) ≥ min (μF (u1), μF (u2)) where λ ∈ [0, 1]. Definition: A fuzzy set F of the universe of discourse U is called a normal fuzzy set if ∃ u ∈ U, μF (u) = 1. Definition: A fuzzy set is a fuzzy subset in the universe of discourse U that is both convex and normal.

3 Representation of Fuzzy Data in XML and Nested Relational Databases This section focuses on fuzzy data modeling in XML data model and nested relational model. First we introduce some notions and notations of the fuzzy XML model proposed in [19] and then we present an extension of the extended possibility-based fuzzy nested relational databases.

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

3.1 Fuzzy XML Model There are two kinds of fuzziness in XML documents: the first is the fuzziness in elements (we use membership degrees associated with such elements); the second is the fuzziness in attribute values of elements (we use possibility distribution to represent such values). Note that, for the latter, there exist two types of possibility distribution (i.e., disjunctive and conjunctive possibility distributions) and they may occur in child elements with or without further child elements in the ancestordescendant chain. Fig. 1 gives a fragment of an XML document with fuzzy information, which appeared in [19].

1. <universities> 2. <university UName = “Oakland University”> 3. <Val Poss = 0.8> 4. <department DName = “Computer Science and Engineering”> 5. <employee FID = “85431095”> 6. <Dist type = “disjunctive”> 7. <Val Poss = 0.8> 8. <fname>Frank Yager</name> 9. <position>Associate Professor</position> 10. <office>B1024</office> 11. <course>Advances in Database Systems</course> 12. </Val > 13. <Val Poss = 0.6> 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.

<fname>Frank Yager</name> <position>Professor</position> <office>B1024</office> <course>Advances in Database Systems</course> </Val > </Dist> </employee> <student SID = “96421027”> <sname>Tom Smith</name> <age> <Dist type = “disjunctive”> <Val Poss = 0.4>23</Val> <Val Poss = 0.6>25</Val> <Val Poss = 0.8>27</Val> <Val Poss = 1.0>29</Val> <Val Poss = 1.0>30</Val> <Val Poss = 1.0>31</Val> Fig. 1 A Fragment of an XML Document with Fuzzy Data

L. Yan, J. Liu, and Z.M. Ma

31. <Val Poss = 0.8>33</Val> 32. <Val Poss = 0.6>35</Val> 33. <Val Poss = 0.4>37</Val> 34. </Dist> 35. </age> 36. <sex>Male</sex> 37. <email> 38. <Dist type = “conjunctive”> 39. <Val Poss = 0.60>TSmith@yahoo.com</Val> 40. <Val Poss = 0.85>Tom_Smith@yahoo.com</Val> 41. <Val Poss = 0.85>Tom_Smith@hotmail.com</Val> 42. <Val Poss = 0.55>TSmith@hotmail.com</Val> 43. <Val Poss = 0.45>TSmith@msn.com</Val> 44. </Dist> 45. </email> 46. </student> 47. </department > 48. </Val> 49. </university> 50. <university Uname = “Wayne State University”> 51. </university> 52. </universities > Fig. 1 (continued)

The example above talks about the universities in an area of a given city, say, Detroit, Michigan, in the USA. The Wayne State University is located in downtown Detroit, and the possibility that it is included in the universities in Detroit is 1. Oakland University, however, is located in a nearby county of Michigan, named Oakland. Whether Oakland University is included in the universities in Detroit depends on how to define the area of Detroit, the Greater Detroit Area or only the city of Detroit. Assume that it is unknown and the possibility that Oakland University is included in the universities in Detroit is assigned 0.8. Also suppose that an employee, Frank Yager, at Oakland University is under the stage of promotion. The possibility that he is an associate professor, teaches a course called Advances in Database Systems, and occupies the office called B1024 is 0.8. The possibility that he is a professor, teaches a course called Advances in Database Systems, and occupies the office called B1024 is 0.6. A student, Tom Smith, has fuzzy values in the attributes age and email, which are represented by a disjunctive possibility distribution and conjunctive possibility distribution, respectively. The basic data structure of fuzzy XML data model is the data tree. In the following, we will introduce some important concepts used in our proposed fuzzy XML model.

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

Definition: Let V be a finite set (of vertices), E ∈ V × V be a set (of edges) and l : E → Γ be a mapping from edges to a set Γ of strings called labels. The triple G = (V, E, l ) is an edge labeled directed graph. Based on the data tree, we introduce the definition of fuzzy XML data tree. Definition: Fuzzy XML data tree F is a 6-tuple, F = (V, ψ, l , τ, κ, δ) where z V = {V1,…,Vn} is a finite set of vertices. z ψ ⊂ {(Vi, Vj) | Vi, Vj ∈ V}, (V, ψ) is a directed tree. z l : V → (L ∪ {null}), here L is a set of labels. For each object v ∈ V and each label∇ ∈L, l (v, ∇) specifies the set of objects that may be children of v with label∇. z τ→T, T is a set of types. z κ is mapping which constrains the number of children with a given label. κ associates with each object v ∈V and each label∇ ∈ L, an integer-valued interval function. κ (v, ∇) = [min, max], where min ≥ 0, max ≥ min. We use κ to represent the lower and upper bounds. z δ is a mapping from the set of objects v ∈V to local possibility functions. It defines the possibility of a set of children of an object existing given that the parent object exists. Definition: Suppose F = (V, ψ, l , τ, κ, δ) and f’ = (V’, ψ’, l' , τ’, κ’, δ’) are two fuzzy data trees. f’ is a sub-tree of F, written f’ ∝ F, when z V’ ⊆ V, ψ’ = ψ ∩ V’ × V’. z if i ∈ V’ and (j, i) ∈ψ, then j∈V’. l' and τ’ indicate the restriction of l and τ to the nodes in V’, z respectively. z κ’∈κ. Definition: Let fuzzy data trees f1 = (V1, ψ1, l1 , τ1, κ1, δ1) and f1 = (V2, ψ2, l 2 , τ2, κ2, δ2) be the sub-trees of F = (V, ψ, l , τ, κ, δ). f1 and f2 are isomorphic (recorded f1 ≌ f2), when z V1 ∪ V2 ⊆V, ψ1 ∪ ψ2 ⊆ ψ and τ1 ∪τ2 ⊆ τ. There is a one-to-one mapping, ξl : l1 → l 2 , which makes ∀ ξl ( l1 ) z = l2 . Theorem: Fuzzy data tree F and its sub-tree f’ are isomorphic. The above theorem follows the analysis of Definition 3 and Definition 4. It is quite straightforward. Several fuzzy constructs have been introduced for fuzzy XML data modeling. In order to accommodate these fuzzy constructs, it is clear that the DTD of the source XML document should be correspondingly modified. Next, we focus on DTD modification for fuzzy XML data modeling. First we define Val element as follows: <!ELEMENT Val (#PCDATA| original-definition)> <!ATTLIST Val Poss CDATA “1.0”>

Then we define Dist element as follows:

L. Yan, J. Liu, and Z.M. Ma

<!ELEMENT Dist (Val+)> <!ATTLIST Dist type (disjunctive|conjunctive) “disjunctive”>

Now we modify the element definition in the classical DTD so that all of the elements can use possibility distributions (Dist). For a leaf element which only contains text or #PCDATA, say, leafElement, its definition in the DTD is changed from <!ELEMENT leafElement (#PCDATA)> to <!ELEMENT leafElement (#PCDATA | Dist)>.

That is, leaf element leafElement may be a crisp one (e.g., sname of student in Fig.1), and then could be defined as <!ELEMENT leafElement (#PCDATA)>.

Also, it is possible that leaf element leafElement may be a fuzzy one, taking a value represented by a possibility distribution (e.g., age of student in Fig.1). Then it may be defined as <!ELEMENT leafElement (Dist)>.

Furthermore, we have the following definition. <!ELEMENT Dist (Val+)> <!ATTLIST Dist type (disjunctive|conjunctive) “disjunctive”> <!ELEMENT Val (#PCDATA)> <!ATTLIST Val Poss CDATA “1.0”>

For the non-leaf element, say nonleafElement, first we should change the element definition from <!ELEMENT nonleafElement (original-definition)> to <!ELEMENT nonleafElement (original-definition| Val+ | Dist)> and then add <!ELEMENT Val (original-definition)>

That is, the non-leaf element nonleafElement may be crisp (e.g., student in Fig.1) and then may be defined as <!ELEMENT nonleafElement (original-definition)>

When the non-leaf element nonleafElement is a fuzzy one, we differentiate two situations: the element takes a value connected with a possibility degree (e.g., university in Fig.1), and, second, the element takes a set of values and each value is connected with a possibility degree (e.g., employee in Fig.1). The former element is defined as follows. <!ELEMENT nonleafElement (Val+)> <!ELEMENT Val (original-definition)> <!ATTLIST Val Poss CDATA “1.0”>

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

The later element is defined as <!ELEMENT nonleafElement (Dist)> <!ELEMENT Dist (Val+)> <!ATTLIST Dist type (disjunctive|conjunctive) “disjunctive”> <!ELEMENT Val (original-definition)> <!ATTLIST Val Poss CDATA “1.0”>

Then the DTD of the XML document in Fig.1 is shown in Fig.2. <!ELEMENT universities (university*)> <!ELEMENT university (Val+)> <!ATTLIST university UName IDREF #REQUIRED> <!ELEMENT Val (department*)> <!ATTLIST Val Poss CDATA “1.0”> <!ELEMENT department (employee*, student*)> <!ATTLIST department DName IDREF #REQUIRED> <!ELEMENT employee (Dist)> <!ATTLIST employee FID IDREF #REQUIRED> <!ELEMENT Val (fname?, position?, office?, course?)> <!ATTLIST Val Poss CDATA “1.0”> <!ELEMENT student (sname?, age?, sex?, email?)> <!ATTLIST student SID IDREF #REQUIRED> <!ELEMENT fname (#PCDATA)> <!ELEMENT position (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT course (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT age (Dist)> <!ELEMENT Dist (Val+)> <!ATTLIST Dist type (disjunctive)> <!ELEMENT sex (#PCDATA)> <!ELEMENT email (Dist)> <!ELEMENT Dist (Val+)> <!ATTLIST Dist type (conjunctive)> <!ELEMENT Val (#PCDATA)> <!ATTLIST Val Poss CDATA “1.0”> Fig. 2 The DTD of the Fuzzy XML Document in Fig.1

3.2 Fuzzy Nested Relational Model A fuzzy NF2 relational schema is a set of attributes (A1, A2, ..., An, pM) and their domains are D1, D2, ..., Dn, D0, respectively, where Di (1 ≤ i ≤ n) can be one of the following:

L. Yan, J. Liu, and Z.M. Ma

(1) The set of atomic values. For each element ai ∈ Di, it is a typical simple crisp attribute value. (2) The set of null values, denoted ndom, where null values may be unk, inap, nin, and onul. (3) The set of fuzzy subset. The corresponding attribute value is an extended possibility-based fuzzy data. (4) The power set of the set in (1). The corresponding attribute value, say ai, is multivalued one with the form of {ai1, ai2, ..., aik}. (5) The set of relation values. The corresponding attribute value, say ai, is a tuple of the form <ai1, ai2, ..., aim> which is an element of Di1 × Di2 × ... × Dim (m > 1 and 1 ≤ i ≤ n), where each Dij (1 ≤ j ≤ m) may be a domain in (1), (2), (3), and (4) and even the set of relation values. The domain D0 is a set of atomic values and each value is a crisp one from the range [0, 1], representing the possibility degree that the corresponding tuple is true in the NF2 relation. We assume that the possibilities of all tuples are precisely one in the chapter. Then for an attribute Ai ∈ R (1 ≤ i ≤ n), its attribute domain is formally represented as follows: τi = dom | ndom | fdom | sdom | <B1 : τi1, B2 : τi2, …, Bm : τim>

where B1, B2, …, Bm are attributes. A relational instance r over the fuzzy NF2 schema (A1 : τ1, A2 : τ2, ..., An : τn) is a subset of Cartesian product τ1 × τ2 × ... × τn. A tuple in r with the form of <a1, a2, ..., an> consists of n components. Each component ai (1 ≤ i ≤ n) may be an atomic value, null value, set value, fuzzy value, or another tuple. An example of the fuzzy NF2 relation is shown in Table 3.1. It can be seen that Tank_Id and Start_data are crisp atomic-valued attributes, Tank_body is a relation-valued attribute, and Responsibility is a set-valued attribute. In the attribute Tank_body, two component attributes Volume and Capacity are fuzzy ones. Table 1 Pressured air tank relation Tank_Id TA1

Body_Id BO01

TA2

BO02

Tank_body Material Volume Alloy about 2.5e+03 Steel about 2.5e+04

Capacity about 1.0e+06 about 1.0e+07

Start_Date

Responsibility

01/12/99

John

28/03/00

{Tom, Mary}

In the following, we focus on the fuzzy nested relational algebraic operations. We will start by introducing some important concepts used in our operations [21]. Definition. Let U = {u1, u2, …, un} be an universe of discourse. Let πA and πB be two fuzzy data on U based on possibility distribution. The semantic inclusion

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

degree of πA and πB SID (πA, πB), which means πA semantically includes πB, is then defined as follows: n

SID (πA, πB) = ∑ min (π B (u i ), π A (u i )) / ∑ π B (u i ) i =1 ui ∈U

i =1

Definition. Let πA and πB be two fuzzy data and SID (πA, πB) be the degree that πA semantically includes πB. The semantic equivalent degree of πA and πB SE ( π A, π B), denoting the degree that π A and π B are equivalent to each other, is defined as follows.

SE (πA, πB) = min (SID (πA, πB), SID (πB, πA)) Two fuzzy data πA and πB are considered β-redundant if and only if SE (πA, πB) ≥ β. For two crisp data, atomic or set-valued, their equivalent degree is one if they are equal to each other, where the same set-valued data are considered equal. Consequently, the notion of equivalence degree of structured attribute values can be extended for the tuples in the fuzzy nested relations to assess tuple redundancies. Informally, any two tuples in a nested relation are redundant, if, for pair of the corresponding attribute values, the equivalence degree is greater than or equal to the threshold value. If the pair of the corresponding attribute values is simple, the equivalence degree is one for two values. For two values of structured attributes, however, the equivalence degree is one for structured attributes. Two redundant tuples t and t’ are written t ≡ t’. Union and Difference. Let r and s be two union-compatible fuzzy nested relations. Then

r ∪ s = min ({t | t ∈ r ∨ t ∈ s}) and r − s = {t | t ∈ r ∧ (∀v ∈ s) (t ≡/ v)} Here, the operation min () means to remove the fuzzy redundant tuples in r and s. Of course, the threshold value should be provided for the purpose. Cartesian Product. Let r and s be two fuzzy nested relations on schemas R and S, respectively. Then r × s is a fuzzy nested relation with the schema R ∪ S. The formal definition of Cartesian product operation is as follows:

r × s = {t | t (R) ∈ r ∧ t (S) ∈ s} Projection. Let r be a fuzzy nested relation on the schema R and S ⊂ R. Then the projection of r on the schema S is formally defined as follows:

ΠS (r) = min ({t | (∀ v ∈ r) (t = v (S)}) Here, an attribute in S may be of the form B.C, in which B is a structured attribute and C is its component attribute. Being the same as union operation, projection operation also needs to remove fuzzy redundant tuples in the result relation after the operation.

L. Yan, J. Liu, and Z.M. Ma

Selection. In classical relational databases, the selection condition is of the form X θ Y, where X is an attribute, Y is an attribute or a constant value, and θ ∈ {=, ≠, >, ≥, <, ≤}. In order to implement fuzzy query for fuzzy relational databases, “θ” should be fuzzy, denoted ≈, ≈/ , f , p , f , and p . In addition, X is only a simple attribute or the simple attribute of a structured attribute but Y may be one of the following.

(a) A constant, crisp or fuzzy one; (b) A simple attribute; (c) The simple component attribute of a structured attribute, having the form A. B, where A is a structured attribute and B is its simple component attribute. Assume that there is a resemblance relation on the universe of discourse and α is the threshold on it. Then the fuzzy comparison operations are defined as follows (1) X ≈ Y iff SEα (X, Y) ≥ β, where β is a selected cut (the followings are the same). (2) X ≈/ Y iff SEα (X, Y) < β. (3) X f Y iff X ≈/ Y and min (Supp (X)) > min (Supp (Y)). (4) X f Y iff X ≈ Y or X f Y. (5) X p Y iff X ≈/ Y and min (Supp (X)) < min (Supp (Y)). (6) X p Y iff X ≈ Y or X p Y. Depending on Y, the following situations can be identified for the selection condition X θ Y. Let X be the attribute Ai: τi in a fuzzy nested relation. (1) Ai θ c, where c is a crisp constant. According to τi, the definition of Ai θ c is as follows: if τi is dom, Ai θ c is a traditional comparison and θ ∈ {=, ≠, >, <, ≥, ≤}, if τi is fdom, Ai θ c is a fuzzy comparison and θ ∈ {≈, ≈/ , f , p , f , p }, if τi is ndom, Ai θ c is a null comparison and regarded as the special fuzzy comparison, if τi is sdom, Ai θ c is a element-set comparison. Then Ai θ c if c and any element in the value of Ai of a tuple satisfy the “θ”. (2) Ai θ f, where f is a fuzzy value. if τi is dom, fdom, or ndom, Ai θ f is a fuzzy comparison and θ ∈ {≈, ≈/ , f , p , f , p }, if τi is sdom, Ai θ f is a fuzzy set comparison. Then Ai θ f if f and any element in the value of Ai of a tuple satisfy the fuzzy “θ”, where θ ∈ {≈, ≈/ , f , p , f ,

p }.

(3) Ai θ Aj, where Aj: τj is a simple attribute and i ≠ j. if τi and τj are all dom, Ai θ Aj is a traditional comparison, if τi and τj are dom and fdom, fdom and fdom, or ndom and fdom, Ai θ Aj is a fuzzy comparison,

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

if τi and τj are dom and ndom, Ai θ Aj is a null comparison, if τi and τj are dom and sdom, Ai θ Aj is a element-set comparison, if τi and τj are fdom and sdom, Ai θ Aj is a fuzzy set comparison, if τi and τj are all ndom, Ai θ Aj is a null-null comparison. Then Ai θ Aj if they have the same null values on the same universe of discourse, if τi and τj are ndom and sdom, Ai θ Aj is a null-set comparison and regarded as the special element-set comparison, if τi and τj are sdom and sdom, Ai θ Aj is a set-set comparison and regarded as the special element-set comparison. (4) Ai θ Aj. B, where Aj is a structured attribute (i ≠ j) and B is a simple attribute. The situations are the same as those in above case (3). In the fuzzy nested relational databases, the selection condition is similar to the selection condition in the fuzzy relational databases except that the attribute may be of the form B.C, where B is a structured attribute and C is its component attribute. Let Q be a predicate denoting the selection condition. The selection operation for a fuzzy nested relation r is defined as follows:

σQ (r) = {t | t ∈ r ∧ Q (t)} In addition to some traditional relational operations, two restructuring, called Nest and Unnest (called Pack and Unpack, Merge and Unmerge also in literature), are also crucial in the fuzzy nested relational databases. The Nest operator can gain the nested relation with structured attributes. The Unnest operator is used to flatten the nested relation. That is, it takes a nested relation on a set of attributes and desegregates it, creating a "flatter'' structure. Nest Operation. Let r be a fuzzy nested relation with the schema R = {A1, A2, …, Ai, …, Ak, …, An}, where 1 ≤ i, k ≤ n. Now Y = {Ai, …, Ak} is merged into a structured attribute B and a new fuzzy nested relation s is formed, which schema is S = {A1, A2, …, Ai-1, B, Ak+1, …, An}. The following notation is used to represent the Nest operation above: s (S) = ΓY → B (r (R)) = {ω [(R − Y) ∪ B] | (∃u) (∀v) (u ∈ r ∧ v ∈ r ∧ SE (u [R − Y], v [R − Y]) < β ∧ ω [R − Y] = u [R − Y] ∧ ω [B] = u [Y]) ∨ (∀u) (∀v) (u ∈ r ∧ v ∈ r ∧ SE (u [R − Y], v [R − Y]) ≥ β ∧ ω [R − Y] = u [R − Y] ∪f v [R − Y] ∧ ω [B] = u [Y] ∪ v [Y])}

It can be seen that in the process of the Nest operation on attribute sets Y to B, multiple tuples in r which are fuzzily equivalent on the attribute set R − Y are merged to form a tuples of s. Such merging operation is respectively completed on attribute sets R − Y and Y. On the R − Y, fuzzy union ∪f is used and for an attribute C ∈ R − Y, the value of C of the created tuple is an atomic value, crisp or fuzzy. The value of an attribute B. C ∈ Y of the created tuple, however, is a set value and the common union is used.

L. Yan, J. Liu, and Z.M. Ma

Another restructuring operation, called Unnest, is an inverse of Nest under certain conditions. In a classical nested relation, this condition is that the nested relation is in Partitioned Normal Form (PNF). A relation is in PNF if and only if (a) all of a subset of the simple attributes forms a relation key and (b) every sub-relation is in PNF. Unnest Operation. Let s be a fuzzy nested relation with the schema S = {A1, A2, …, Ai-1, B, Ak+1, …, An}, where B is a structured attribute and B : {Ai, …, Ak}. Unnest operation products a new fuzzy nested relation r, which schema is R = {A1, A2, …, Ai-1, Ai, …, Ak, Ak+1, …, An}, i.e., R = S − B ∪ {Ai, …, Ak}. The following notation is used to represent the Unnest operation above: r (R) = Ξ B (s (S)) = {t [(R − B) ∪ { Ai, …, Ak }] | (∀u) (u ∈ s ∧ t [R − B] = u [R − B] ∧ t [Ai … Ak] ∈ u [B])}

4 Mapping Fuzzy XML DTD to Fuzzy Nested Relational Schema This section presents the transformation of the fuzzy XML DTD to the fuzzy nested relational database model. Here we need a fuzzy DTD tree created from the hierarchical fuzzy XML DTD. During the creation, we first construct a DTD tree through parsing the given fuzzy DTD, and then map the DTD tree into the fuzzy nested relational schema. In the following, we will introduce how to create a fuzzy DTD tree from the fuzzy hierarchical XML DTD. Generally, nodes in a fuzzy DTD tree are element and attributes, in which each element appears exactly once in the graph, while attributes appear as many time as they appear in the DTD. The element nodes can be further classified into two kinds, that is, leaf element nodes and nonleaf element nodes. Thus in the DTD tree, we have three kinds of nodes, which are attribute nodes, leaf element nodes, and nonleaf element nodes. There exists a special nodleaf element node in the DTD tree, namely the root node. We also need to identify such attribute nodes that the corresponding attributes are associated with ID #REQUIRED or IDREF #REQUIRED in DTD. We call these attribute nodes key attribute nodes. In addition, different to the classical DTD tree, fuzzy DTD tree contains some new attribute and element types, which are attribute Poss and element Val and Dist. Then a fuzzy DTD tree can be constructed when parsing the given DTD following the ensuring processing: (a) Take the first nonleaf element r of the given hierarchical DTD and create a DTD tree rooted at r. r’s children come from the attributes and elements connecting with r. Here, the key attribute(s) should become the primary key attribute(s) of the created DTD tree. (b) Take the nonleaf element s of r’s child in the given hierarchical DTD and create a DTD sub-tree rooted at s. We apply the processing given in (a) to treat the s’s children.

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

(c) For other nonleaf elements in the given hierarchical DTD, apply the same processing given in (b) until all nonleaf elements are transformed. (d) For all the generated sub-trees, we stitch them together and construct the fuzzy DTD tree. According to (Ma and Yan, 2007), in the fuzzy DTD tree, in addition to (key) attribute nodes, leaf element nodes, and nonleaf element nodes, there are three special nodes, which are Poss attribute nodes, Val element nodes, and Dist element nodes. The Dist element nodes created from Disk elements are used to indicate the type of a possibility distribution, being disjunctive or conjunctive. In addition, each Dist element node has a Val element node as its child node, and a nonleaf element node as its parent node. Also we can identify four kinds of Val element nodes as follows: (a) They do not have any child node except the Poss attribute nodes (type-1). (b) They only have leaf element nodes as their child nodes except the Poss attribute nodes (type-2). (c) They only have nonleaf element nodes as their child nodes except the Poss attribute nodes (type-3). (d) They have leaf element nodes as well as nonleaf element nodes as their child nodes except the Poss attribute nodes (type-4). In the transformation of the fuzzy DTD tree to the fuzzy nested relational model the Poss attribute nodes, Val element nodes, and Dist element nodes in the fuzzy DTD tree do not take part in composing the created relational schema and only determine the model of the created fuzzy relational databases. To illustrate, we have the following process: (a) Take the root node of the given fuzzy DTD tree and create a relational table. Its attributes first come from the attribute nodes and leaf element nodes connecting with the root node. Here, the key attribute node(s) should become the primary key attribute(s) of the created table. Then determine if the root node has any Val element nodes or Dist element nodes as its child nodes. If yes, we need to further determine the type of each Val element node (we can ignore Dist element nodes because each Dist element node must have a Val element node as its child node only). (i) If it is the Val element node of type-2, all of the leaf element nodes connecting with the Val element node become the attributes of the created relational table. An additional attribute is also added into the created relational table, representing the possibility degree of the tuples. (ii) If it is the Val element node of type-3, and the Val elementâ&#x20AC;&#x2122;s children except the Poss attribute nodes, namely nonleaf element nodes only have leaf element nodes as their child, a relation-valued attribute and an additional attribute which represents the possibility degree of the tuples are added into the created relational table. If it is the Val element node of type3, and the Val elementâ&#x20AC;&#x2122;s children except the Poss attribute nodes, namely

L. Yan, J. Liu, and Z.M. Ma

nonleaf element nodes have nonleaf element nodes as their child, only an additional attribute is added into the created relational table, representing the possibility degree of the tuples, and we leave the nonleaf element nodes for further treatment in (b). (iii) If it is the Val element node of type-4, we do the same thing as (ii) for the leaf element nodes and the nonleaf element nodes that only have leaf element nodes as their child. And leave the nonleaf element nodes that have nonleaf element as their child for further treatment in (b). It is impossible that the Val element nodes of type-1 arise in the root node. (b) For each nonleaf element node that has nonleaf element as their child connecting with the root node, create a separate relational table. Its attributes come from the attribute nodes and leaf element nodes connecting with this nonleaf element node, its primary key attribute(s) will come from the key attribute node(s) and the foreign key attribute(s) will be created for reference. Furthermore, determine if this nonleaf element node has any Val element nodes or Dist element nodes as its child nodes, and identify the type of these nodes, if any. We still apply the processing given in (i)-(iii) of (a) to treat the Val element nodes of type-2, type-3, and type-4. For the Val element nodes of type-1, each of them should become an attribute of another relational table created from the parent node of the current nonleaf element. Note that this attribute is one that may take fuzzy values. (c) For other nonleaf element nodes in the fuzzy DTD tree, apply the same processing given in (b) until all nonleaf element nodes are transformed. Note that in the practical requirements, the number of nesting level is usually less than two, as a consequence, considering easy maintenance of data in the table as well as the normalization of relational schema, we choose a relational table for one nonleaf element that only has leaf element nodes as their child in the chapter. Next, we will use an example to illustrate the transformation of the fuzzy XML DTD to the fuzzy nested relational database model. Fig. 3 shows the fuzzy DTD tree created from the hierarchical fuzzy XML DTD of Fig. 2. In this fuzzy DTD tree, the Dist element nodes created from Dist elements are used to indicate the type of a possibility distribution, being disjunctive or conjunctive. In addition, each Dist element node has a Val element node as its child node, and a nonleaf element node as its parent node. During the transformation, we first create a university table, in which Uname is the primary key attribute of the university table, and the leaf element node address connecting with the university is a single attribute of the university table. For Val element, according to (ii), we created an additional attribute to represent the possibility degree of the tuples. For the nonleaf element department, we find that it has two children employee and student, and both of them are nonleaf element, according to (b), we create another table department, where ref is the foreign key attribute to refer to the university table. Because the children of employee and student are leaf elements, thus we generate two relation-valued attributes employee and student in the

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

department table. Particular, employee attribute contains five attributes EID, ename, position, office and an additional attribute to represent the possibility degree of the tuples. Attribute student contains four attributes SID, sname, sex and age. After transformation, the fuzzy DTD tree in Fig. 3 is mapped into the fuzzy nested relational schema shown in Fig. 4, in which attribute age is one that may take fuzzy values. university UName

address

Val Poss department DName

employee EID

location

student SID

Dist sname

age

sex

Val

Dist

Poss Val ename

position

office

attribute node

Dist

Poss

key attribute node

Poss attribute node

Dist element node

Vol element node

leaf element node

nonleaf element node

Fig. 3 A Simple fuzzy DTD Tree

L. Yan, J. Liu, and Z.M. Ma university UNam

address

department DName

location

ref

EID

employee ename position

office

SID

student sname sex

age

Fig. 4 The Fuzzy Nested Relational Schema Created by the Fuzzy DTD Tree in Fig. 3

5 Conclusion With the prompt development of the Internet, the requirement of managing information based on the Web has attracted much attention both from academia and industry. XML is widely regarded as the next step in the evolution of the World Wide Web, and has been the de-facto standard. This creates a new set of data management requirements involving XML, such as the need to store and query XML documents. On the other hand, fuzzy sets and possibility distributions have been extensively applied to deal with information imprecision and uncertainty in the practical applications, and fuzzy database modeling is receiving increasing attention for intelligent data processing. Our focus in this chapter has been to study the fuzzy information modeling in the fuzzy XML model and the fuzzy nested relational database model. In order to efficiently manage such complex objects with imprecision and uncertainty translation, we investigate the fuzzy DTD tree construction based on the hierarchical XML DTD and develop the formal approach to mapping a fuzzy DTD model to a fuzzy nested relational database (FNRDB) schema. Future work will concentrate on optimizing queries by using corresponding algebraic operations.

Acknowledgment The work is supported by the National Natural Science Foundation of China (60873010) and the Fundamental Research Funds for the Central Universities (N090504005, N090604012 and N090104001), and in part by the Program for New Century Excellent Talents in University (NCET- 05-0288) and the MOE Funds for Doctoral Programs (20050145024).

References 1. Abiteboul, S., Segoufin, L., Vianu, V.: Representing and Querying XML with Incomplete Information. In: Proc. 12th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 150â&#x20AC;&#x201C;161 (2001)

Formal Translation from Fuzzy XML to Fuzzy Nested Relational Database Schema

2. Aygun, R.S., Yazici, A.: Modeling and Management of Fuzzy Information in Multimedia Database Applications. Multimedia Tools and Applications 24(1), 29–56 (2004) 3. Bertino, E., Catania, B.: Integrating XML and Databases. IEEE Internet Computing, 84–88 (July-August 2001) 4. Bordogna, G., Pasi, G., Lucarella, D.: A Fuzzy Object-Oriented Data Model for Managing Vague and Uncertain Information. International Journal of Intelligent Systems 14, 623–651 (1999) 5. Bosc, P., Prade, H.: An Introduction to Fuzzy Set and Possibility Theory Based Approaches to the Treatment of Uncertainty and Imprecision in Database Management systems. In: Proceedings of the Second Workshop on Uncertainty Management in Information Systems: From Needs to Solutions (1993) 6. Buckles, B.P., Petry, F.E.: A Fuzzy Representation of Data for Relational Database. Fuzzy Sets and Systems 7(3), 213–226 (1982) 7. Chamorro-Martínez, J., Medina, J.M., Barranco, C.D., Galán-Perales, E., SotoHidalgo, J.M.: Retrieving Images in Fuzzy Object-Relational Databases Using Dominant Color Descriptors. Fuzzy Sets and Systems 158(3), 312–324 (2007) 8. Chianese, A., Picariello, A., Sansone, L., Sapino, M.L.: Managing Uncertainties in Image Databases: A Fuzzy Approach. Multimedia Tools and Applications 23, 237–252 (2004) 9. Cuevasa, L., Marínb, N., Ponsb, O., Vilab, M.A.: pg4DB: A Fuzzy Object-Relational System. Fuzzy Sets and Systems 159(12), 1500–1514 (2008) 10. Dubois, D., Prade, H., Rossazza, J.P.: Vagueness, Typicality, and Uncertainty in Class Hierarchies. International Journal of Intelligent Systems 6, 167–183 (1991) 11. Gaurav, A., Alhajj, R.: Incorporating Fuzziness in XML and Mapping Fuzzy Relational Data into Fuzzy XML. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 456–460 (2006) 12. George, R., Srikanth, R., Petry, F.E., Buckles, B.P.: Uncertainty Management Issues in the Object-Oriented Data Model. IEEE Transactions on Fuzzy Systems 4(2), 179–192 (1996) 13. Gyseghem, N.V., Caluwe, R.D.: Imprecision and Uncertainty in UFO Database Model. Journal of the American Society for Information Science 49(3), 236–252 (1998) 14. Hung, E., Getoor, L., Subrahmanian, V.S.: PXML: A Probabilistic Semistructured Data Model and Algebra. In: Proc. 19th the International Conference on Data Engineering (ICDE 2003), pp. 467–478 (2003) 15. Ma, Z.M.: A Conceptual Design Methodology for Fuzzy Relational Databases. Journal of Database Management 16(2), 66–83 (2005) 16. Ma, Z.M., Shen, D.: Modeling Fuzzy Information in the IF2O and Object-Oriented Data Models. Journal of Intelligent & Fuzzy Systems 17(6), 597–612 (2006) 17. Ma, Z.M., Yan, L.: A Literature Overview of Fuzzy Conceptual Data Modeling. Journal of Information Science and Engineering 26(2), 427–441 (2010) 18. Ma, Z.M., Yan, L.: A Literature Overview of Fuzzy Database Models. Journal of Information Science and Engineering 24(1), 189–202 (2008) 19. Ma, Z.M., Yan, L.: Fuzzy XML Data Modeling with the UML and Relational Data Models. Data & Knowledge Engineering 63, 972–996 (2007) 20. Ma, Z.M., Zhang, W.J., Ma, W.Y.: Extending Object-Oriented Databases for Fuzzy Information modeling. Information Systems 29(5), 421–435 (2004)

L. Yan, J. Liu, and Z.M. Ma

21. Ma, Z.M., Zhang, W.J., Ma, W.Y., Chen, G.Q.: Conceptual Design of Fuzzy ObjectOriented Databases Using Extended Entity-Relationship Model. International Journal of Intelligent Systems 16, 697–711 (2001) 22. Nierrman, A., Jagadish, H.V.: ProTDB: Probabilistic Data in XML. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 646–657 (2002) 23. Oliboni, B., Pozzani, G.: Representing Fuzzy Information by Using XML Schema. In: Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, pp. 683–687 (2008) 24. Prade, H., Testemale, C.: Generalizing Database Relational Algebra for the Treatment of Incomplete or Uncertain Information and Vague Queries. Information Sciences 34, 115–143 (1984) 25. Petrovic, D., Roy, R., Petrovic, R.: Supply Chain Modeling Using Fuzzy Sets. International Journal of Production Economics 59, 443–453 (1999) 26. Raju, K.V.S.V.N., Majumdar, K.: Fuzzy Functional Dependencies and Lossless Join Decomposition of Fuzzy Relational Database Systems. ACM Transactions on Database Systems 13(2), 129–166 (1988) 27. Senellart, P., Abiteboul, S.: On the Complexity of Managing Probabilistic XML Data. In: Proc. 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 283–292 (2007) 28. Umano, M., Fukami, S.: Fuzzy Relational Algebra for Possibility-Distribution-FuzzyRelational Model of Fuzzy Data. Journal of Intelligent Information Systems 3, 7–27 (1994) 29. Van Keulen, M., De Keijzer, A., Alink, W.: A Probabilistic XML Approach to Data Integration. In: Proceedings of the 2005 International Conference on Data Engineering, pp. 459–470 (2005) 30. Yager, R.R.: Targeted e-Commerce Marketing Using Fuzzy Intelligent Agents. IEEE Intelligent Systems 15(6), 42–45 (2000) 31. Yager, R.R., Pasi, G.: Product Category Description for Web-Shopping in eCommerce. International Journal of Intelligent Systems 16, 1009–1021 (2001) 32. Yan, L., Ma, Z.M., Liu, J.: Fuzzy Data Modeling Based on XML Schema. In: Proceedings of the 2009 ACM International Symposium on Applied Computing, Hawaii, USA, March 8-12, pp. 1563–1567 (2009) 33. Yazici, A., Buckles, B.P., Petry, F.E.: Handling Complex and Uncertain Information in the ExIFO and NF2 Data Models. IEEE Transactions on Fuzzy Systems 7(6), 659–676 (1999) 34. Yazici, A., Soysal, A., Buckles, B.P., Petry, F.E.: Uncertainty in a Nested Relational Database Model. Data & Knowledge Engineering 30(3), 275–301 (1999) 35. Zadeh, L.A.: Fuzzy Sets. Information and Control 8(3), 338–353 (1965) 36. Zadeh, L.A.: Fuzzy Sets as a Basis for a Theory of Possibility. Fuzzy Sets and Systems 1(1), 3–28 (1978)

Human Centric Data Representation: From Fuzzy Relational Databases into Fuzzy XML ¨ Keivan Kianmehr, Tansel Ozyer, Anthony Lo, Jamal Jida, Alnaar Jiwani, Yasin Alimohamed, Krista Spence, and Reda Alhajj

Abstract. The Extensible Markup Language (XML) is emerging as the dominant data format for data exchange between applications. Many translation techniques have been devised to publish large amounts of existing conventional relational data in XML format. There also exists a need to be able to represent imprecise data in both relational databases and XML. This paper describes a fuzzy XML schema model for representing a fuzzy relational database in XML format. It also outlines a translation algorithm to include fuzzy relations and similarity matrices with their associated Keivan Kianmehr Computer Science Department, University of Calgary, Calgary, Alberta, Canada ¨ Tansel Ozyer Department of Computer Engineering, TOBB ETU, Economic and Technology University, Sogutozu Cad. No:43 06560 Ankara - Turkey Anthony Lo Computer Science Department, University of Calgary, Calgary, Alberta, Canada Jamal Jida Department of Informatics, Faculty of Sciences III, Lebanese University, Tripoli, Lebanon Alnaar Jiwani Computer Science Department, University of Calgary, Calgary, Alberta, Canada Yasin Alimohamed Computer Science Department, University of Calgary, Calgary, Alberta, Canada Krista Spence Computer Science Department, University of Calgary, Calgary, Alberta, Canada Reda Alhajj Computer Science Department, University of Calgary, Calgary, Alberta, Canada Department of Computer Science, Global University, Beirut, Lebanon e-mail: alhajj@ucalgary.ca

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 55–77. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

K. Kianmehr et al.

conventional relations. We also describe how the proposed fuzzy XML schema and the conversion process from fuzzy relational data into fuzzy XML have been integrated into VIREX, which is a prototype with powerful visual tool for transforming relational data into XML. Keywords: Fuzzy XML, Fuzzy Relational Database, Schema Conversion, Visual Interface.

1 Introduction In the past two decades, there has been extensive research examining how imprecise and uncertain data can be represented in databases given that it is pervasive in most real-world applications [5]. Examples of imprecise data include subjective opinions and judgments in areas such as personnel evaluation, policy preferences and economic forecasting [4]. A particular vein of research that is immediately applicable is how the conventional relational model can be extended to incorporate this fuzzy data. Another highly researched area focuses on how relational data can be represented in the Extensible Markup Language (XML), e.g., [7, 8, 10, 11]. Furthermore, Lee et al [12] and Turowski and Weng [22] describe examples of XML representation for fuzzy data modeling. However, they do not describe how to incorporate the fuzzy XML with data from conventional relations. The approach described by Lee et al applies to the object-oriented paradigm, and not simple relational data. Turowski’s approach is more general; however, it does not utilize the currently accepted technique of XML schema in defining the XML document class (it instead uses the DTD format to represent the XML structure). In 1995, Buckles and Petry [3] noted that commercial industry had failed to pick up on the large amount of research performed in the area of fuzzy databases. To our knowledge, there is no commercial implementation of a fuzzy relational or fuzzy object-oriented database. We hypothesize that this may be because of the significant investment that companies have made towards the conventional relational database model and because many of the proposed fuzzy relational database models do not fit well within the existing conventional model. Many of the proposed models work with non-atomic values, e.g., [3, 4, 5, 19, 24], or attributes that can contain values of different types,e.g., [3, 15, 19], which do not fit into the current constraints implied by conventional relational databases. Because of this, we propose ways of incorporating fuzzy data within the conventional relational model so that a fuzzy relational database can be implemented within an existing conventional framework. This paper presents a novel approach to incorporate fuzziness in the XML model. We will introduce a schema for representing FRDB structure in XML. We will also provide detail an translation technique that convert an instance

Human Centric Data Representation

of a FRDB model to an XML document which conforms to the schema is also describe. The fuzziness is incorporated within the source data in the underlying relational model by specifying fuzzy values for attributes intended to be fuzzy. This technique is reﬂected into the XML generation and schema conversion models, where fuzzy and crisp data should be equally considered. Fuzziness in the query interface is handled by extending the condition part of the query interface to allow specifying conditions that address fuzzy attributes. The proposed conversion of fuzzy relational data into fuzzy XML approach has been implemented as part of the VIREX (VIsual RElational to XML) system [13, 14]. The motivation of this work is not only to come up with a generic representation using a standardize language to represent fuzzy information. It is also to allow easy transformation of fuzzy information stored in fuzzy relational database into XML format. Most works in the literature focus on only one of our objectives thus fails to provide a complete solution for automated fuzzy relation to XML querying and transformation. The outline of this paper is as follows. Section 2 presents an overview of the representation of fuzzy data in XML. Section 3 covers translation techniques between relational databases and XML. In Section 4, we propose an XML schema to represent its structure, including fuzzy relations and similarity matrices; we also describe the algorithm for translation from the fuzzy relational database to the fuzzy XML; and we ﬁnally demonstrate the conversion process by describe an example implementation of a fuzzy relational database and emphasize how it is realized in VIREX. Section 5 is conclusions.

2 XML Schemas and Fuzzy Data in XML XML is an excellent method of transmitting data between software applications. In order for an application to interpret the XML, certain constraints must be placed on an XML document. This can be accomplished by describing classes of XML documents through XML schemas or Document Type Declarations (DTDs). The application can then use the speciﬁed XML schema or DTD to parse the information contained in a speciﬁc XML document.

2.1 XML Schemas The DTD is the declarative part of an XML document, and it defines the XML elements and their attributes, including constraints on how they are structured. However, with the advent of the W3C XML Schema, DTDs are becoming a thing of the past [12, 21]. An XML schema defines the structure of an XML document instance. Unlike DTDs, XML schemas allow for strong data typing, modularization, and reuse. The XML schema specification allows a developer to define new data types (using the <complexType> tag), and also uses built-in data types

K. Kianmehr et al.

provided by the specification. The developer can also define the structure of an XML document instance and constrain its contents. In addition, the XML schema language supports inheritance so that developers do not have to start from scratch when defining a new schema. These features of the W3C XML schema specification allow for schemas that are effective in defining and constraining attributes and element values in an XML document [12].

2.2 Representing Fuzzy Relational Data in XML There has already been some research completed on representing fuzzy data in XML. The fuzzy object-oriented modeling technique (FOOM) schema proposed by Lee et al [12] is one such approach. This method builds upon objectoriented modeling (OOM) to also capture requirements that are imprecise in nature and therefore ‘fuzzy’. The FOOM schema defines a class of XML document that can describe fuzzy sets, fuzzy attributes, fuzzy rules, and fuzzy associations. This method is useful in representing data contained in objectoriented databases. However, it is too specific in terms of its object-oriented nature to be applied directly to relational databases. Another more general approach is proposed by Turowski et al [22]. The method described in their study is aimed at creating a common interchange format for fuzzy information using XML to reduce integration problems with collaborating fuzzy applications. XML tags with a standardized meaning are used to encapsulate fuzzy information. A formal syntax for important fuzzy data types is also introduced. This technique of using XML to represent fuzzy information is general enough to be built upon to apply to relational databases. However, it uses DTDs, rather than the currently accepted method of XML schemas to define and constrain the information held in an XML document. It will be beneficial to extend this approach to define the XML document class for holding data from fuzzy relational databases with an XML schema rather than a DTD. In [25], Yan et al. propose a fuzzy data modeling based on XML schema. Based on the possibility distribution theory, they identify two kinds of fuzziness in XML: the first type is the fuzziness in elements and membership degrees are associated with such elements; the second one is the fuzziness in attribute values of elements and possibility distribution are used to represent such values. For the latter, there exist two types of possibility distribution (i.e., disjunctive and conjunctive possibility distributions) and they may occur in child elements with or without further child elements in the ancestor-descendant chain. A possibility attribute, denoted ‘Poss’, is introduced first, which takes a value in [0, 1]. This possibility attribute is applied together with a fuzzy construct called ‘Val ’ to specify the possibility of a given element existing in the XML document. Based on pair < V al P oss > and < /V al >, possibility distribution for an element can be expressed. In addition possibility distribution can be used to express fuzzy element values.

Human Centric Data Representation

For this purpose, another fuzzy construct is introduced called ‘Dist ’ to specify a possibility distribution. Typically a ‘Dist ’ element has multiple ‘Val ’ elements as children, each with an associated possibility. Since there are two types of possibility distribution, the ‘Dist ’ construct should indicate the type of a possibility distribution, being disjunctive or conjunctive. Oliboni et al. propose a general XML schema definition for representing fuzzy information [17]. The XML Schema definition includes a set of new fuzzy data types and elements needed to represent fuzzy information. The fuzzy data types are classified into four categories (named classicType, fuzzyOrdType, fuzzyNonOrdSimilarityType and fuzzyNonOrdType) and can be processed in different ways. The classicType is the classical non-fuzzy (crisp) data type that can be processed with the fuzzy operations. The fuzzyOrdType covers imprecise data over an ordered underlying domain. It allows for both crisp and fuzzy data representations. The possible fuzzy data types of fuzzyOrdType are intervals, approximate values, linguistic labels, trapezoidal and possibility distributions. The fuzzyNonOrdSimilarityType and fuzzyNonOrdType data types classes represent imprecise data over a discrete non-ordered domain, but in fuzzyNonOrdSimilarityType a similarity measure is used to relate labels, while in fuzzyNonOrdType no similarity relation between the labels is used. The fuzzyNonOrdType defines a fuzzy data type which is more generic than the fuzzyNonOrdSimilarityType. The above data type classification is employed to represent different aspects of fuzzy information by adapting already proposed technique for fuzzy representation in the context of the relational database, and by integrating different kinds of fuzzy information to compose a complete definition.

3 Converting Relational Databases to XML Since XML has become universally adopted as one of the main formats for information exchange and representation on the Internet, the need for data in XML format has dramatically increased [11]. Most of the information however, is stored and maintained by relational databases, thus applications generally convert data into XML for exchange purposes. Beneﬁts for the conversion are: cross platform independence and re-mapping XML data into target applications/databases from anywhere in the world. This situation is routinely found in business situations where most data resides in relational databases and there is a need to transfer such data over the Internet where other departments/clients may access it. To date, there have been numerous methods for translating relation data into XML documents. DB2XML, XML-DBMS, XML Extender from IBM, SilkRoute, and XPERANTO all entail users to specify mappings to XML from relational models. In XML-DBMS, a template-driven mapping language is provided to specify the mappings. A language such as XML Extender Transform Language or DAD is used to stipulate the mapping in XML

K. Kianmehr et al.

Extender. SilkRoute provides RXL, a declarative query language, for viewing relational data in XML. XPERANTO uses the XML query language for viewing relational data in XML [11]. All the above tools require input from the user and are built specifically for the conversion between relational databases to XML documents. Flat Translation (FT), Nesting-based Translation (NeT) and Constraintbased Translation (CoT) are three additional approaches analyzed further next. FT is the most straightforward approach where: 1) tables in a database are mapped to XML elements and; 2) columns in each table are mapped to attributes (in attribute-oriented mode) or elements (in element-oriented mode) in the XML. Attribute-oriented and element-oriented modes are analogous except that the element-oriented mode adds unnecessary ordering semantics to the resulting schema [10, 11]. Since the XML represents the “flat” relational tuples faithfully, this method is called Flat Translational. FT is a simple and effective translation algorithm. FT translates the “flat” relational model to a “flat” XML model in almost a one-to-one manner [10, 11]. Hence, for every tuple in a table, one corresponding element is generated in XML. A shortcoming to this approach is that it does not use a number of basic “non-flat” features (such as “∗”, “+”) provided by XML for data modeling. Nesting-based Translation (NeT) was brought about to remedy the problems found in the FT algorithm. This algorithm derives nested structures from a flat relational model by the use of the nest operator so that the resulting XML model is more intuitive and accurate (less data redundancy) than otherwise [10, 11]. The drawback to this approach is that it can only be applied to one table at a time resulting in not being able to depict the overall picture of a relational schema where multiple tables are interconnected with each other through other dependencies. The Constraint-based Translation (CoT) algorithm uses semantic constraints (principally inclusion dependencies) during the translation to produce a more intuitive XML model for the entire relation. CoT considers inclusion dependencies during the translation, and merges multiple interconnected tables into a coherent and hierarchical parent-child structure in the final XML model [10]. This approach provides a good XML model, but more research is needed to determine an efficient implementation [11]. The work of Shanmugasundaram et al [20] allows users to create XML data by using SQL to query the relational database and sent the query output to some XML constructor function, which needs to be defined for each XML document. The constructors seem to be fairly simple from the example given in their paper. Visual SQL-X [18] is a graphical tool for generating XML document from relational databases. It provides the user interface for users to edit the query and then the query can be executed later to generate XML documents. Although in their interface, they provide help for users to manage the textual query, the method is not as intuitive as visual query construction. In addition, users need to learn a new query language in order to use Visual SQL-X. BBQ [16] is a system with strong support for interactive query

Human Centric Data Representation

formulation. A tree is used in the interface to represent the tree construct of the DOM object. They allow users to perform simple operations visually on the tree in order to query, join, and filter data from the source. The XML data sources are queried and the results are presented as new XML documents and DTD. The functionality supported in BBQ and VIREX are quite similar, except that VIREX has a more flexible user interface, focuses on XML document generation from relational data, and uses document schema instead of DTD. Finally, VIREX [13, 14] is a more flexible and user friendly approach to handle the querying of relational data to produce XML documents. The basic structure of VIREX is shown in Figure 1. VIREX allows users to create query interactively for data stored in relational databases and for transforming the results into XML format. Most of the steps involved in the manual relational to XML transformation process have been automated within VIREX. Using an easy-to-use interactive diagram with minimum keyboard input, users are allowed to specify views by filtering unwanted data and also to specify a desired structure (nested or flat) for the results. The manipulated diagram (illustrated in Figure 2) is very similar to an entity relationship diagram with which most database users are familiar, and which can be easily understood by end-users as it summarizes the database structure. VIREX supports several querying operators in an interactive way. These operations include selection, projection, nesting, union, and ordering. In addition, simple scheme evolution operations as well as materialized views are supported by VIREX. When fuzziness is incorporated in the underlying relational data, VIREX produces a corresponding fuzzy XML schema and fuzzy XML document(s). Querying fuzzy data is possible by using the extended condition box of the visual interface that allows specifying fuzzy terms in the query predicate. In the next section, we describe the process; then we use and example to illustrate the different aspects.

4 Mapping Fuzzy Relational Database into Fuzzy XML In this section, we express our motivations for our work and describe in detail our implementation of a fuzzy relational database, XML schema structure and the algorithm to convert database content to XML document conforming to the schema.

4.1 Database Structure We chose to create a hybrid-type fuzzy relational database that incorporates both similarity relations [4], to represent fuzzy equality, and possibility relations, which could be used to translate crisp data based on a number of linguistic terms [3, 15] or to represent a possibility distribution [19]. Any

K. Kianmehr et al.

Fig. 1 VIREX system architecture

Fig. 2 Interactive Diagram for a sample database

Human Centric Data Representation

attribute in a relation may have an associated fuzzy relation and/or an associated similarity relation. Each of these can be joined into a query to retrieve information based on imprecise conditions. The results of these queries themselves can then be considered a sort of fuzzy relation that has all the attributes requested by the query as well as an attribute that describes tuple’s membership in the relation [15]. Table 1 An instance of a Student relation FName LName Avg Marks Jeremy Scott A Jenny Wong A George Yuzwak C Jose Sanchez B

Attitude Unhappy Negative Positive Cheerful

Table 2 Similarity Relation for the Attitude attribute of the Student relation (Table 1) Unhappy Negative Positive Cheerful

Unhappy Negative Positive Cheerful 1 0.8 0.2 0 0.8 1 0 0 0.2 0 1 0.95 0 0 0.95 1

Table 3 Example ’Student’ relation STUENTID 1 2 3 4 5

FNAME Jeremy Jenny George Jose Elizabeth

LNAME ATTENDANCE AVG MARKS ATTITUDE ADVISOR Scott 0.56 3.60 Unhappy 1 Wong 0.98 3.87 Motivated 5 Yuzwak 0.80 2.74 Lazy 3 Sanchez 0.9 3.20 Cheerful 1 Reichs 0.35 1.87 Lazy 1

An example relation, ‘Student’, is illustrated in Table 3 (an extension of the relation in Table 1). We suppose that the information stored in the STUDENTID, FNAME, LNAME and ADVISOR columns is crisp. The data in ATTENDANCE and AVG MARKS is also crisp, but fuzzy relations based on linguistic terms are deﬁned on each. The values within the domain of ATTITUDE have an associated similarity relation deﬁned to provide fuzzy equivalence. 4.1.1

Similarity Relations

In our fuzzy relational database model, we allow any column to have an associated similarity relation [4], which assigns all elements in the domain a degree of similarity to all other elements in the domain. Normally, this is

K. Kianmehr et al. Table 4 A portion of the similarity relation SM STUDENT ATTITUDE VALUE1 VALUE2 MATCH Positive Positive 1.00 Positive Negative 0.00 Positive Cheerful 0.95 Positive Unhappy 0.00 Positive Lazy 0.00 Positive Motivated 0.40 Negative Positive 0.00 Negative Negative 1.00

Fig. 3 Fuzzy Sets over AVG MARKS

visually represented in a matrix, but to construct this matrix in a relation by naming attributes after each domain element is very inflexible and difficult to modify if one wanted to add another element to the domain. Instead, we flatten the matrix. Every similarity matrix is named under the following convention: ‘SM TABLENAME COLNAME’, where TABLENAME and COLNAME are the relation and attribute the similarity matrix applies to, respectively. Within the similarity relation there are three attributes: VALUE1, VALUE2, and MATCH. VALUE1 and VALUE2 hold the combination of domain values and will be assigned a type according to the type of the attribute being compared. MATCH contains the result of the similarity relation (s(x, y) [4]) for the pair described in VALUE1 and VALUE2, and so will contain a value on the unit interval [0,1] (see Section 2.2.1). Table 4 contains part of the similarity relation defined on the attribute ATTITUDE in the STUDENT relation (Table 3) given the domain of ATTITUDE as {Positive, Negative, Motivated, Cheerful, Unhappy, Lazy}. To see the matrix representation of similar data, refer to Table 2.

Human Centric Data Representation

4.1.2

Fuzzy Relations

Our model also allows any attribute to have an associated fuzzy relation, which can contain the data for a number of fuzzy sets [27] defined over the domain of the attribute. Each fuzzy set is identified by a linguistic term. If the set is discrete, the fuzzy relation will contain each ordered pair within the set. However, if the set is defined by a continuous function, the fuzzy relation will contain points along the graph of the function that can be interpolated to find the exact value of the membership function (see Figure3). Table 5 FR STUDENT AVG MARKS - defines fuzzy sets over the attribute AVG MARKS in the relation STUDENT LINGUISTIC TERM COLUMN VALUE MEMBERSHIP Excellent 3.20 0.29 Excellent 3.70 1.00 Excellent 4.00 1.00 Poor 0.00 1.00 Poor 1.50 1.00 Poor 1.70 0.60 Poor 2.00 0.00 Typical 1.00 0.60 Typical 1.50 0.50 Typical 1.70 0.70 Typical 2.00 1.00 Typical 2.30 0.75 Typical 2.70 0.41 Typical 3.00 0.16 Typical 3.20 0.00

Each fuzzy relation is named under the following convention: ‘FR TABLENAME COLNAME’, where TABLENAME and COLNAME are the relation and attribute the fuzzy relation applies to, respectively. Within the fuzzy relation there are three attributes: LINGUISTIC TERM, COLUMN VALUE, and MEMBERSHIP. LINGUISTIC TERM is the word that describes the meaning of the fuzzy set. COLUMN VALUE and MEMBERSHIP can be interpreted as the (x, y) values of a point on the graph of the fuzzy set. COLUMN VALUE is the same type as the attribute this relation applies to, and MEMBERSHIP is the result of the membership function that maps the COLUMN VALUE to the unit interval. Table 5 contains a sample fuzzy relation on the AVG MARKS attribute in the Student relation described in Table 3, deﬁning the continuous fuzzy sets ‘Excellent’, ‘Poor’ and ‘Typical’. Figure 3 contains the graphical representations of these sets.

K. Kianmehr et al.

4.2 XML Schema Structure The schema structure we developed for representing our fuzzy relational database in XML provides a direct relationship between the database and the resulting XML document. This implementation allows for an XML representation of the data that is simple to interpret and query.

Fig. 4 XML Schema - Top View

The schema defines the outermost element of the XML document as the Database element. A Database element contains a name attribute used to indicate the name of the fuzzy relational database and a sequence of Table elements representing each relation in the database. A Table element also has a name attribute that will be set to the name of the table. This outer structure of the XML schema is represented in Figure 4. The schema further defines a Table element as being composed of the database content related to each relation. This includes the relation’s row and column data (database records) and any fuzzy relations or similarity relations associated with its attributes. Figure 5 illustrates how record information is stored in the XML document. The schema defines a Table element as a complex type composed of Row elements (the Table element is also composed of SimilarityMatrix and FuzzyRelation elements which are discussed later in this section). A Row element is composed of Column elements. Each Row element in the XML document holds a record, whose column values are stored as the value for each Column element. Column elements are also described by the name, type, and nullable attributes. The database structure described in Section 4 stores similarity and fuzzy relations as separate tables. To keep the corresponding XML document simple, our XML schema stores an attribute’s fuzzy data along with the table

Human Centric Data Representation

Fig. 5 DeďŹ nition of the Row Element for Storing Table Records

Fig. 6 DeďŹ nition of the SimilarityMatrix Element

K. Kianmehr et al.

Fig. 7 Deﬁnition of the FuzzyRelation Element

that contains the attribute. Thus, the schema deﬁnes a Table element as a complex type containing Row elements for each record (see Figure 5), SimilarityMatrix elements for similarity relations (see Figure 6), and FuzzyRelation elements for fuzzy relations (see Figure 7). Figure 6 depicts the schema deﬁnition of a SimilarityMatrix element, which represents a similarity relation [4] on an attribute. Each SimilarityMatrix element is described by the assocColumn and Type attributes that are set to

Human Centric Data Representation

the name and data type of the column that the relation is associated with. The definition also describes the SimilarityMatrix element as being composed of a number of CrossRef elements, each of which contains Value1, Value2, and Match element. By following this definition, the XML document can describe the similarity (Match) between two possible values (Value1 and Value2) of a database attribute. Figure 7 depicts the schema definition of a FuzzyRelation element. Our representation of fuzzy relations using linguistic terms is similar to the method described by Turowski and Weng [22]. Each FuzzyRelation element is described by the assocColumn and Type attributes that are set to the name and data type of the column that the relation is associated with. The definition also describes the FuzzyRelation element as being composed of a number of LinguisticTerm elements. A LinguisticTerm element contains one FuzzySet element consisting of a number of Point elements. Each Point element is described by an x value and membership element. By following this definition, the XML document can describe a point’s membership (based on the x value) to the set (FuzzySet) described by a linguistic term (LinguisticTerm). By following our schema, an XML document can store all of the information contained in a fuzzy relational database. This information will be easily accessed since the XML document follows a logical structure that is easy to both read and query. For a summary of the schema, refer to Figure 8.

4.3 Fuzzy Relational to XML Conversion Algorithm The fuzzy relational to XML conversion algorithm we chose to integrate as part of VIREX following our schema structure can be outlined as follows: 1. Add the Database start tag to the XML document with the name attribute set to the name of the database. Set the xsi:schemaLocation to point to the location of the XML Schema this document is to adhere to. 2. Retrieve all table names from the database 3. For each table: a. Add Table start tag to the XML document with the name attribute set to the table name b. Query the database to get the column data (name, type, nullable value) for the current table and then query for the row data using the column names c. For each row: i. Add the Row start tag to the XML document ii. For each column: A. Add the Column start tag to the XML document and set the name, type, and nullable attributes to their corresponding values (from part b.) B. Set the value of the Column element to the value retrieved for the current row/column

K. Kianmehr et al.

Fig. 8 XML schema summary for the example Student table

C. Add the Column end tag to the XML document iii. Add the Row end tag to the XML document

Human Centric Data Representation

d. Query the database to get all similarity matrix data for the current table. A similarity matrix belonging to a table is identiﬁed by appending ’SM’ to the name of the table, followed by the name of the matrix (Section 4). e. For each similarity matrix: i. Add the SimilarityMatrix start tag to the XML document and set the assocColumn and type attributes to their corresponding values (from part d.) ii. For each cross reference: A. Add the crossRef start tag to the XML document B. Add the Value1, Value2, and Match elements and set their corresponding values (from part d) C. Add the crossRef end tag to the XML document iii. Add the SimilarityMatrix end tag to the XML document f. Query the database to get all fuzzy relations for the current table. A fuzzy relation belonging to a table is identiﬁed by appending ‘FR’ to the name of the table, followed by the name of the relation (Section 4) g. For each fuzzy relation: i. Add the FuzzyRelation start tag to the XML document ii. For each linguistic term: A. Add the LinguisticTerm start tag to the XML document and set the term attribute to its corresponding value (from part f.) B. Add the FuzzySet start tag to the XML document C. For each point in the fuzzy set: • Add the Point start tag to the XML document • Add the x value and membership elements and set their corresponding values • Add the Point end tag to the XML document • Add the FuzzySet end tag to the XML document • Add the LinguisticTerm end tag to the XML document iii. Add the FuzzyRelation end tag to the XML document h. Add the Table end tag to the XML document 4. Add the Database end tag to the XML document

4.4 Illustrative Fuzzy Relational Implementation and the Corresponding XML Schema After deﬁning the structure of our fuzzy relational database, XML schema, and the conversion algorithm, we implemented a real life instance to demonstrate the power of VIREX in the conversion from fuzzy relational database into fuzzy XML. We implemented the fuzzy relational database using Cloudscape V5.1, which is the DBMS provided by IBM WebSphere Studio Application Developer V5.1.2 (WSAD). We created a fuzzy relational database based on the

K. Kianmehr et al.

Fig. 9 The example database schema; only the student table and all its attributes are selected

Student example described in Section 4. VIREX connects to the relational database and derives a corresponding visual diagram, which is displayed on the screen; the diagram that corresponds to the implemented example database is shown in Figure 9. VIREX allows the conversion of all or part of the fuzzy relational database into fuzzy XML, this is made possible by allowing the user to specify (by ticking inside the small boxes next to each relation/attribute name) from the relational model the elements to be converted into XML. The user selects the tables and attributes to be converted into XML. For the example diagram displayed in Figure 9, only the Student table and all its attributes have been selected. Here it is worth noting that the fuzziness is hidden inside the actual relational database at the backend and nothing related to such information is reflected into the displayed visual diagram. Rather the fuzziness is considered during the conversion into XML as evident in Figure 10, which reflects the fuzziness as described next. After specifying the tables and the attributes to be converted into XML, we run the conversion process of VIREX by specifying on the screen (shown in Figure 11) the fuzzy relational database to be converted into fuzzy XML. VIREX starts the conversion process by connecting to the specified database using JDBC connection to create the resulting fuzzy XML document, which conformed to the XML schema outlined in Section 4. The XML document created to represent the example Student table is given in Figure 10; the XML schema is summarized in Figure 8 (some data was removed to simplify the example). Figure 10 illustrates how VIREX considers the available

Human Centric Data Representation

Fig. 10 Student Table -XML Representation

K. Kianmehr et al.

Fig. 11 VIREX Screen for Specifying Fuzzy Relational to Fuzzy XML Conversion

fuzziness in the conversion process. Both the similarity matrix and fuzzy relations are included as part of the produced XML schema. In particular, the similarity matrix for the ATTITUDE attribute and the fuzzy relations for the two attributes ATTENDANCE and MARKS are derived by VIREX and are partially displayed in Figure 10. A similarity matrix mostly reflects the information related to a discrete membership function and a fuzzy relation represents points along the curve that corresponds to a continuous fuzzy membership function. Finally, the output fuzzy XML is displayed on the screen and may be stored in a file to be named by the user. As a result, it is worth noting that an XML representation of the fuzzy relational database has a very practical value. One particular example for our Student database would be for formatting the data in the fuzzy relational database to be viewed through an Internet application. For example, an Internet application may allow students to sign on to view their current standing in a course (or overall). It is straightforward to parse an XML document using the Extensible Stylesheet Language (XSL) to create formatted HTML. Various charts could be created on the Student data - including the fuzzy and similarity relations. Querying the database to get this information and then creating HTML would be a lot more complicated compared to applying XSL to an XML document. Other practical uses for generating an XML representation of a fuzzy relational database would be for transmitting data between applications running on different platforms or DBMSs, and sending data over the Internet.

Human Centric Data Representation

5 Conclusions and Future Work XML schemas have replaced DTDs as the new standard for setting constraints on XML documents. As such, we have described a fuzzy XML schema to represent an implementation of a fuzzy relational database that allows for similarity relations and fuzzy sets. We have also provided a flat translation algorithm to translate from the fuzzy database implementation to a fuzzy XML document that conforms to the suggested fuzzy XML schema. The proposed algorithm has been implemented within VIREX and a demonstrating example has been reflected into the paper to illustrate the power of VIREX in converting fuzzy relational data into fuzzy XML. Currently, we are investigating and working on the following extensions to the presented approach. The fuzzy database model and the fuzzy XML schema are to be expanded to incorporate other sorts of fuzziness such as fuzzy rules [12], fuzzy integrity constraints [19, 24] and non-atomic data values [3, 4, 5, 19, 24]. As well, the flat translational techniques used to convert from the fuzzy relational database to an XML document conforming to the proposed XML schema is to be optimized. Whether nested conversion techniques might apply to a fuzzy relational database is also be investigated. For this aspect, we will try to benefit from and extend the already functional implemented of VIREX for producing nested XML without fuzziness. As we have successfully produced flat fuzzy XML, we anticipate the process of producing nested fuzzy XML to be a straightforward extension of the existing nested implementation of VIREX.

References 1. Anvari, M., Rose, G.F.: Fuzzy Relational Databases. In: Bezdek (ed.) Analysis of Fuzzy Information, vol. II. CRC Press, Boca Raton (1987) 2. Bosc, P., Galibourg, M., Hamon, G.: Fuzzy querying with SQL: extensions and implementation aspects. Fuzzy Sets and Systems 28, 333–349 (1988) 3. Buckles, B.P., Petry, F.E.: Fuzzy Databases in the New Era. In: Proceedings of ACM Symposium on Applied Computing, pp. 497–502 (1995) 4. Buckles, B.P., Petry, F.E.: A Fuzzy Representation of Data for Relational Databases. Fuzzy Sets and Systems 7, 213–226 (1982) 5. Dey, D., Sumit, S.: A Probabilistic Relational Model and Algebra. ACM Transactions on Database Systems 21, 339–369 (1996) 6. Duta, A., Barker, K., Alhajj, R.: Converting Relationships to XML Nested Structures. Journal of Information and Organizational Sciences 28, 1–2 (2004) 7. Fernandez, M., Tan, W.-C., Suciu, D.: SilkRoute: Trading between Relations and XML. In: Proceedings of the International Conference on World Wide Web, Amsterdam (May 2000) 8. Fong, J., Pang, F., Bloor, C.: Converting Relational Database into XML Document. In: Proceedings of the International Workshop on Electronic Business Hubs, September 2001, pp. 61–65 (2001)

K. Kianmehr et al.

9. Jang, J., Sun, C.: Neuro-Fuzzy Modeling and Control. Proceedings of the IEEE 83, 378–406 (1995) 10. Lee, D., Mani, M., Chiu, F., Chu, W.W.: Schema Conversion Methods between XML and Relational Models. In: Knowledge Transformation for the Semantic Web (2003) 11. Lee, D., Mani, M., Chiu, F., Chu, W.W.: NeT & CoT: translating relational schemas to XML schemas using semantic constraints. In: Proceedings of ACM International Conference on Information and Knowledge Management (2002) 12. Lee, J., Fanjiang, Y., Kuo, J., Lin, Y.: Modeling Imprecise Requirements with XML. Fuzzy Systems 2, 861–866 (2002) 13. Lo, A., Alhajj, R., Barker, K.: Flexible User Interface for Converting Relational Data into XML. In: Proceedings of the International Conference on Flexible Query Answering Systems, June 2004. Springer, Lyon (2004) 14. Lo, A., Alhajj, R., Barker, K.: VIREX: Visual relational to xml conversion tool. Journal of Visual Languages and Computing 17(1), 25–45 (2006) 15. Medina, J.M., Pons, O., Vila, M.A.: GEFRED: A Generalized Model of Fuzzy Relational Databases Version 1.1. Information Sciences (1994) 16. Munroe, K.D., Papakonstantinou, Y.: BBQ: A visual interface for browsing and querying of XML. In: Proceedings of IFIP Working Conf. on Visual Database Systems, pp. 277–296 (2000) 17. Oliboni, B., Pozzani, G.: Representing Fuzzy Information by Using XML Schema. In: Proceedings of the 2008 19th international Conference on Database and Expert Systems Application, DEXA, pp. 683–687. IEEE Computer Society, Washington (2008) 18. Orsini, R., Pagotto, M.: Visual SQL-X: A Graphical Tool for Producing XML Documents from Relational Databases. In: Proceedings of the International Conference on World Wide Web, Hong Kong (2001) 19. Raju, K.V., Majumdar, A.K.: Fuzzy Functional Dependencies and Lossless Join Decomposition of Fuzzy Relational Database Systems. ACM Transactions on Database Systems 13, 129–166 (1988) 20. Shanmugasundaram, J., Shekita, E., Barr, R., Carey, M., Lindsay, B., Pirahesh, H., Reinwald, B.: Eﬃciently Publishing Relational Data as XML Documents. The VLDB Journal 10, 133–154 (2001) 21. Thompson, H.S., Beech, M., Maloney, M., Mendelsohn, N.: XML Schema Part 1: Structures. W3C Recommendation (October 2004) 22. Turowski, K., Weng, U.: Representing and processing fuzzy information - an XML-based approach. Knowledge-Based Systems 15, 67–75 (2002) 23. Wang, C., Lo, A., Alhajj, R.: Novel Approach for Reengineering Relational Databases into XML. In: Proceedings of XSDM (in conjunction with IEEE International Conference on Data engineering), Tokyo, Japan (April 2005) 24. Wang, S., Shen, J., Hong, T., Chang, B.C.H.: Incremental Discovery of Functional Dependencies From Similarity-bases Fuzzy Relational Databases Using Partitions. In: Proceedings of the National Conference on Fuzzy Theory and Its Applications, pp. 629–636 (2001) 25. Yan, L., Ma, Z.M., Liu, J.: Fuzzy data modeling based on XML schema. In: Proceedings of the 2009 ACM symposium on Applied Computing, pp. 1563– 1567 (2009)

Human Centric Data Representation

¨ 26. Yang, K.Y., Lo, A., Ozyer, T., Alhajj, R.: DWG2XML: Generating XML Nested Tree Structure from Directed Weighted Graph. In: Proceedings of ICEIS, Miami (May 2005) 27. Zadeh, L.: Fuzzy Sets. Information and Control 8, 338–353 (1965) 28. Zvieli, A., Chen, P.P.: Entity-relationship modeling and fuzzy databases. In: Proceedings of IEEE International Conference on Data engineering, Los Angeles, pp. 320–327 (1986) 29. Zemankova, M., Kandel, A.: Fuzzy Relational Data Bases - A key to Expert Systems. Verlag TUV Rheinland, Koln (1984)

Data Integration Using Uncertain XML Ander de Keijzer

Abstract. Data integration has been a challenging problem for decades. In an ambient environment, where many autonomous devices have their own information sources and network connectivity is ad hoc and peer-to-peer, it even becomes a serious bottleneck. In addition, the number of information sources per device, as well as in total, increases as well. To enable devices to exchange information without the need for interaction with a user at data integration time and without the need for extensive semantic annotations, a probabilistic approach seems rather promising. It simply teaches the device how to cope with the uncertainty occurring during data integration. Unfortunately, without any kind of world knowledge, almost everything becomes uncertain, hence maintaining all possibilities produces huge integrated information sources. Automatically integrating data sources, using very simple knowledge rules to rule out most of the nonsense possibilities, combined with storing the remaining possibilities as uncertainty in the database and resolving these during querying by means of user feedback, seems the promising solution. In this chapter we introduce this â&#x20AC;&#x153;good is good-enoughâ&#x20AC;? integration approach and explain the uncertainty model that is used to capture the remaining integration possibilities. We show that using this strategy, the time necessary to integrate documents drastically decreases, while the accuracy of the integrated document increases over time.

1 Introduction Data integration is a difficult task, even when tools to assist the user are available. The problem with most approaches is that the integrated data source can only be used after the entire source documents are integrated and (semantic) problems are resolved. In this chapter, we introduce a method to automate the data integration Ander de Keijzer Institute of Technical Medicine, University of Twente, PO Box 217, 7500AE Enschede, The Netherlands e-mail: a.dekeijzer@utwente.nl

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 79â&#x20AC;&#x201C;103. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

A. de Keijzer

process. First the user can formulate (simple) knowledge rules that allow the system to determine which integrated results are highly likely, or very unlikely. Next, results are stored, even if there is uncertainty about the integrated data. The uncertainty in this case is simply stored as such. This allows the system to integrate without the user having to be present. The (possibly uncertain) integration result can be queried. Since the document can contain uncertainty, the results to a query can be uncertain too. A feedback mechanism allows the users to indicate if answers are definitely true, or definitely false. Both statements eliminate possibilities stored in the database. As a result, querying the document and providing feedback on the results, ultimately leads to a fully integrated, certain document.

2 Uncertain Data Integration Although schema integration still is a challenging topic (see for a survey [5], and [1] for a more recent, multi-strategy approach), from this point on, we assume schema integration to be resolved and we will focus on the problem of data integration. At data integration time, elements from different sources have to be merged into a new data source. There can be, and usually is, overlap between data from one source and data from the other source. By overlap we mean elements referring to the same real world object, but not necessarily containing exactly the same information, or even the same kind of information. This is shown in Table 1 where the two tables contain address books, but only the first table also contains the email address. Table 1 Two semantically similar data sources with different schemas (a) Data source A name firstname room email Doe John ZI-3122 john@doe.com King Ed ZI-2012 ed@king.edu

(b) Data source B name building room phone John Doe ZI 3122 4243 Ed King ZI 2012 3519

Because the data sources may not contain the same kind of information, it is difficult to make a positive decision about equality of elements. Partly because of the semantics hidden in the schema. If, for example, the name elements don’t correspond the likelihood of address book elements referencing the same person is low, whereas if just the email addresses don’t correspond and taking into account that people have more email addresses and even change their address from time to time, the likelihood here is higher.

2.1 General Approach A device’s database is a probabilistic XML document. When data integration with a foreign probabilistic XML document is initiated, the foreign document is considered to be a source of ‘new’ information on real world objects the device either already knows about or not. New information on ‘new’ real world objects is simply added to the database. Any differences in information on ‘existing’ real world objects are

Data Integration Using Uncertain XML

regarded as different possibilities for that object. Note that we disregard possibilities concerning order. New information on ‘new’ real world objects is simply considered to come after information on known objects in document order. Since it is often not possible to determine with certainty that two specific XML elements correspond to the same real world object, we use a rule engine that determines the probability of two elements referring to the same real world object. In special cases, this rule engine may obviously decide on a probability of 0 (with certainty not the same real world object) or 1 (with certainty the same real world object). We abstract from the details of the Oracle, but imagine that it uses schema information to rule out possibilities. Or it may, for example, consult a digital street map to declare a certain street name very improbable as there exists no such street in that city. Or it may use Semantic Web techniques to reason away possibilities. On top of the Oracle, the integration system uses two rules to further limit the decisions about integration. • The schema states that a certain element can appear only once. We assume that this means that the elements of both documents refer to the same real world object, hence, the subtrees are correspondingly merged. For example, if two person elements refer to the same real world person, their descendant elements that are declared in the schema as appearing only once (e.g., nm and tel) are merged. If two corresponding descendant elements differ, we store this as two possibilities for that element. • The schema states that a certain element can appear multiple times. We assume that this means that the foreign document may contain new elements for this list. For example, the database contains knowledge about two persons “John” and “Rita”. The foreign document holds information on a person “Jon”. Note that “Jon” may be the same person as “John” only misspelled, or it may refer to a different person. The data integrator will store both possibilities, i.e., one whereby it merges “John” and ”Jon”, and one whereby it adds a new person element. Each possibility is assigned a probability by the Oracle. For example, it is not unthinkable that “Jon” and “John” are actually the same person. On the other hand, it is rather improbable that “Rita” and “Jon” are the same person. The minimal set of rules used by our prototype also includes that there can only be one root in an XML document and schema’s of integrated documents are the same, so different tag names are assumed to refer to different real world objects.

2.2 Integrating Sequences In general, integrating sequences produces possibilities for all elements referring to either the same or different real world objects. Since we made an assumption that the schemas are the same and that elements with different tag names refer to different real world objects, many of those possibilities are ruled out. However, this rule does not limit the possibilities for sequences of elements with the same tag name.

A. de Keijzer Table 2 Possibilities for merging sequences x = {A, B} and y = {C, D} Referral to real world object resulting sequence A = B = C = D A, B,C, D A = C, B = C = D A/C, B, D A = D, B = C = D A/D, B,C A = C = D, B = C A, B/C, D A = C = D, B = D A, B/D,C A = C, B = D A/C, B/D A = D, B = C A/D, B/C

Example 1. We integrate address information of people. We are confronted with integrating sequences of person elements. Because our basic Oracle contains very limited knowledge, any two elements, one from each sequence, possibly refer to the same real world object. Therefore, when merging two sequences, X and Y , the resulting number of possibilities can be huge. Let, for example, X = [A, B] and Y = [C, D]. The possibilities to be generated during integration of X and Y are listed in Table 2. In the table, A = C indicates that A and C are considered to refer to the same real world object, hence, they should result in a single possibility where A and B are merged: A/B. Since the database already represents all possibilities explicitly, we do not need to consider two elements from one sequence to refer to the same real world object, so A = B and C = D are not valid possibilities. Integration Formally Defined We now formally define integration of two sequences of elements. There are two sets that are important during this phase. The first set is those element combinations that are certainly referring to the same real world object. This subset is called Must and contains exactly those (element, element) combinations for which the Oracle predicted a confidence score of 1. The other important set during this phase, is called Not and contains all those (element, element) combinations for which the Oracle predicted a confidence score 0 and these combinations should therefore not be included in the integrated document. We first introduce some notation. A → B = { f ⊆ A × B|(∀a∃1b • (a, b) ∈ f } A ↔ B = { f : A → B|(∀a, a • f a = f a ⇒ a = a )} Given A, B, Must : A ↔ B, Not ⊆ A × B The integrated document R is defined as follows: R = { f : A ↔ B|Must ⊆ f ∧ ( f ∩ Not = 0} / Document R contains all those sets of (element, element) combinations, such that it includes Must and does not include any of the (element, element) combinations of Not.

Data Integration Using Uncertain XML

We now define a function Compl that completes the integrated document with all possibly integrated elements. We start with Must and add to it those (element, element) combinations that are not in Not. The result of Compl is the fully integrated, uncertain document. Compl( f , A , B ) = {g : A ↔ B |g ∩ Not = 0/ • f ∪ g} then, Compl(Must, A \ dom(Must), B \ ran(Must)) = {g : (A \ dom(Must)) ↔ (B \ ran(Must))|g ∩ Not = 0/ • Must ∪ g}

(1) (2)

= { f : A ↔ B| f ∩ Not = 0/ ∧ f ⊇ Must • f }

(3)

From step 2 to 3 we assume Must ∩ Not = 0. / This assumption always holds, since Must contains exactly those element combinations where the Oracle predicted a confidence score of exactly 1, whereas Not contains all those combinations where the Oracle returned a confidence score of 0. The Oracle only provides one confidence score for each (element, element) combination and can therefore never return both 1 and 0 as scores. As a result Must ∩ Not = 0/ always holds. As can be seen from the formal definition, the number of possibilities generated from integrating two sequences is large, even for small documents. We show how many possibilities are generated if no world knowledge is taken into account. When all elements of X and Y refer to other real world objects, the number of resulting possible worlds is 1. But, when one element from X refers to the same real world object as an element from Y , there are X × Y possible ways how this can be done, since every element from X can in principle be matched with every element from Y . In general, if i elements from X match with i elements from Y , then the number of possible ways to merge i elements from X with i elements from Y can be computed as follows. In the following, i < min(x, y), where x is the number of elements in X and y is the number of elements in Y . Choose i different elements from X, where the order of choosing the elements is unimportant, but an element cannot be chosen more than once. This can be done x! ways. Then, we choose i elements from Y to merge with those in xi = (x−i)!i! chosen from X. Since the first chosen element from X should be merged with the first element chosen from Y , order is important when choosing elements from Y . y! . The number of ways to choose the i elements from Y is (y−i)! The process of merging sequences is commutative, we assume x ≤ y. In determining all possibilities, any i (0 ≤ i ≤ x) elements of X may refer to the same real world object as elements of Y . Therefore, the resulting total number of possibilities for a merged sequence is x y! x ∑ i × (y − i)! i=0 We see from this formula, that merged documents can become huge quite rapidly. If we take, for example, x = 5 and y = 5, then the maximum number of possibilities is

A. de Keijzer

1546. The rule engine, however, may rule out certain possibilities. For example, if in the case of Table 2, A refers to a person named “John” and C to a person named “Rita”, the rule engine may assign probability 0 to the likelihood that A = C. In this way, it rules out two of the seven possibilities. Integration Method Implementation We implemented the integration method in XQuery. The implementation is kept as close to the formal definition as possible. A pseudo code implementation of the algorithm is given in Figure 1. Below, we first show an example of integrating two certain trees to illustrate the recursive process. The data integration function integrate takes two parameters D1 and D2 . It returns the integration result as a probabilistic XML tree. In the diagrams below, we have omitted probability and possibility nodes whenever there is only one possibility. The example below shows how we can recursively integrate two certain trees. integrate() < << person • • person John •

• Rita

After the first integration step, we obtain: kk SSSSSS SS kkkk k k ◦ ◦<<<< person • • person • person John •

• Rita

integrate() < << • Rita John •

The second integration step integrate(’John’, ’Rita’) results in the final integrated document kk SSSSSS SS kkkk k k ◦ ◦<<<< person • person • • person John •

• Rita

< << ◦ ◦ John •

• Rita

Observe the difference between integrating person elements, which are specified as being part of a sequence, and other elements for which there can only be one, for example the name node. The former produces an additional possibility for the case that there exist two persons. In general, text nodes are also part of a sequence (e.g., paragraphs in a text document). Concatenating names of persons, however, does

Data Integration Using Uncertain XML complete(A,B,Not) : P(A × B) begin result := 0/ for each a ∈ A, b ∈ B if ( (a → b) ∈ / Not ) then c := complete(A \ {a},B \ {b},Not) result := result ∪ {c, {(a → b)} ∪ c} return result end

combinations(A,B) : P(A × B ∪ A ∪ B) begin mustbe := 0; / not := 0; / result := 0/ for each a ∈ A, b ∈ B o := oracle(a,b) if (o = 1) then mustbe := mustbe ∪ (a → b) else if (o = 0) then not := not ∪ (a → b) c := complete(A \ dom(mustbe), B \ ran(mustbe), not) for each f in c f’ := f ∪ mustbe result := result ∪ {f’ ∪ A \dom(f’) ∪ B \ran(f’)} return result end

integrate(E1,E2) begin if (E1 and E2 are text nodes) then if (E1/text() = E2/text()) then result := <prob><poss>E1</poss></prob> else result := <prob><poss>E1</poss><poss>E2</poss></prob> else A:=E1/child::node(); B:=E2/child::node(); E:=E1/name() comb := combinations(A,B) result := <prob/> for each f in comb p := <poss/> for each m in f if (”m of the form (a → b)”) then p.addchild(<E>integrate(a,b)</E>) else p.addchild(m) result.addchild(p) return result end Fig. 1 Integration Algorithm

A. de Keijzer

not make sense, so the integration system decides that, for example, the name of a person can not be “JohnRita”.

2.3 Equivalence Preserving Operation An interesting property of the data integration approach described above, is that it preserves equivalence. Let D1 and D2 be two probabilistic XML documents and A = integrate(D1 , D2 ) the result of integrating them. Suppose D 1 and D 2 are equivalent to D1 and D2 respectively. Is then A = integrate(D 1 , D 2 ) equivalent to A? Although proving equivalence preservation is feature research, we give an example that illustrates this property. There is a special case for which this property is especially interesting. The set of possible worlds can be represented as a probabilistic tree with one probabilistic node as root and all possible worlds as possibilities directly below it. Figure 4(a) is of this form. Since the above property holds, integrating two probabilistic trees amounts to integrating all combinations of possible worlds of both trees. We first show the integration of a compact tree with two possibilities with a certain tree. Next, we show the integration of an equivalent tree in set-of-possibleworlds representation with the same certain tree. The algorithm presented in the previous section, integrates two certain documents, producing one uncertain integrated document. Here we show based on an example, the method to integrate a probabilistic tree and a certain document. integrate() MMM qq M q person • q • person • Rita

< << ◦ ◦ John • Jon • We would first integrate both person elements:

kk VVVVVVVV VVV kkkk k k ◦ ◦MMMMM M person person • person • • integrate() MMM qqq M q • << < Rita ◦ ◦ • John where

• Jon

< << ◦ ◦ • John

• Jon

◦ • Rita

Data Integration Using Uncertain XML

integrate() MMM qqq M q • << < Rita ◦ ◦ • John

• Jon

intuitively leads to q MMMM M qqq ◦q ◦ ◦ • John

• Jon

• Rita

The entire resulting tree looks like: kk VVVVVVVV VVV kkkk k k ◦ ◦MMMMM M person person • person • • q MMMM qqq M ◦q ◦ ◦ • John

• Jon

• Rita

< << ◦ ◦ • John

• Jon

◦ • Rita

If we restrict ourselves to names of persons, the resulting document can be described using a simplified form of boolean notation, as: (Rita ∨ John ∨ Jon) ∨ ((John ∨ Jon) ∧ Rita)

(4)

Moving the local possibility upwards in the tree, we get an equivalent less compact tree that is in all-possible-worlds representation. The integrate function now behaves as being applied to each possible world separately. MMM hintegrate() M person hhhhh h h • M M q M q MM qq q • ◦ ◦ Rita person • person • • John

• Jon

We integrate the person named “Rita” over both possibilities resulting in the following:

A. de Keijzer

VV hhhhh VVVVVVVV hhhh V◦ h h ◦ integrate() < << person • • person • John

integrate() < << person • • person • Jon

• Rita

The final result is: ddd S[S[S[S[[[[[[[[[ SSSS ddddkdkdkdkkkk [[[[[[[[ d d d d d d [[◦ k ddd ◦ ◦ ◦ < < < < << << person • person • • person • person

◦

• Jon

• Rita

• John

• person • Rita < << ◦ ◦ • John

• Rita

person • < << ◦ ◦ • Jon

• Rita

The boolean representation is ((John ∧ Rita) ∨ (John ∨ Rita)) ∨ ((Jon ∧ Rita) ∨ (Jon ∨ Rita))

(5)

Note that this is equivalent to the earlier obtained boolean representation. The trees are equivalent.

3 Knowledge Rules In the previous section, we did not use any world knowledge when integrating information sources. As a result, the number of possibilities in the resulting information source was huge. The size of this result can be reduced drastically[3], just by using very simple rules about the real world. Extensive experiments about the effect of knowledge rules can be found in [6]. Knowledge rules can be either generic, such as If two elements have at most one element for which the value differs, the elements can possibly refer to the same real world object

and domain specific rules, such as If title elements of movies match, then the movies themselves match

Data Integration Using Uncertain XML

As can be seen from the above two examples, knowledge rules give an absolute statement about if two elements refer to the same real world object. According to such a rule two elements are either referring to the same real world object, or they are not. A knowledge rule can therefore be defined as a function that takes two elements as input and gives a boolean as output, indicating if the elements refer to the same real world object (true), or not (false). Definition 1. Let r : (element × element) → boolean be an interface to a function that returns if, based on its implementation the two elements given as parameters possibly refer to the same real world object. In case the elements are considered not to refer to the same real world object, they are not integrated, hence not passed to the Oracle for evaluation. All element combinations that are positively evaluated by all enabled knowledge rules, are being passed to the Oracle. If none of the knowledge rules is enabled the Oracle determines if elements possibly refer to the same real world object alone. Knowledge rules, therefore are used to reduce the number of possible matches, instead of indicating if two elements actually refer to the same real world object. By using combinations of knowledge rules, the accuracy of the process increases. Determining the probability that two items are equal is not the topic of this chapter. In [6], we also argue that the exact method used to choose this probability is not very important to obtain a good result. Example 2. In the movie example, we have defined several knowledge rules. The first knowledge rule is a refinement of the one given earlier, that states that two movies are not equal if their titles are not similar. Similar in this case is based on the edit-distance of the movies. Movies with titles “King Kong” and “Die Hard” would, according to this rule, not be considered to refer to the same real world object, given some value for the threshold. As a result, they are not passed to the Oracle for comparison. Note that for some combinations of films, this does not hold, e.g. sequels like Die Hard and Die Hard II. Since in XML elements are nested, a knowledge rule can also use subelements of elements passed to the rule. In this way, more elements at the same time can be included in the decision process.

4 Storing Uncertainty in XML In this section we will discuss the data model for probabilistic or uncertain XML[4], which is used to store the uncertain integrated data source. A formal definition of the uncertain XML structure is given and the semantics behind the data model is discussed. Some properties of the model are highlighted and two storage improvements on the data model are presented.

A. de Keijzer

4.1 Possible Worlds The semantics used in the probabilistic XML model is that of the possible worlds. This semantics is used in several other uncertain and probabilistic models and projects and is an intuitive interpretation of the uncertainty associated with the data. If a database is considered to hold information on real world objects, then an uncertain database holds possible representations of those real world objects. Each of those possible representations can have an associated probability. If one of the possibilities for a real world object is not to exist, then this also is considered to be one of the possible representations. A possible world is constructed by choosing one representation for each of the real world objects in the database. Instead of one database, an uncertain database can be seen as a set of possible databases. Or, if a database represents (part of) the real world, an uncertain database represents a set of (parts of) possible worlds. As an example consider Table 3. In this table information about 2 people, named John and James, is stored. For both “John” and “James” the phone number is uncertain and in both cases there are two possibilities, or alternatives for the value of the attribute Phone. From this table 2 × 2 = 4 possible worlds can be constructed, all combinations between different possibilities for each of the people stored in the database. Table 3 Construction of Possible Worlds (a) Source Database Addresses Name Phone John 555-1234 John 555-4321 James 555-5678 James 555-8765 (b) Possible Worlds World 1 World 2 Name Phone Name Phone John 555-1234 John 555-1234 James 555-5678 James 555-8765

World 3 Name Phone John 555-4321 James 555-5678

World 4 Name Phone John 555-4321 James 555-8765

4.2 Probabilistic XML In this section, we will introduce the notion of probabilistic XML, using the possible world approach described earlier. Following the possible world approach, we store possible appearances of the database instead of one actual appearance using XML as underlying data model. Consequently, our data model is a probabilistic XML data model. The simplest way to construct uncertain XML using the possible world approach, is by enumerating all possible worlds in different subtrees and

Data Integration Using Uncertain XML

combining those subtrees into one XML document. If desired, probabilities indicating the relative likelihood of each of the worlds, can be associated with the subtrees. This representation is called the possible world representation. Figure 2 shows the probabilistic XML representation of the possible worlds in Table 3. In this figure the actual XML nodes are replaced by (· · · ) to increase readability. These should be replaced by certain XML trees representing that particular world.

hh hhhh

◦hh ···

U hhhhv HHUHHUUUUU P(World4) UU P(World3) HH UUUUUUU v H v UUUU H v ◦v ◦ ◦

hhh vvv P(World1) hhhhP(World2) hhhh ···

···

Fig. 2 Possible world representation of Address Book Example (XML)

Figure 2 shows that only the top level of the document contains a choice and all of the subtrees of the top level nodes are certain XML documents. Since most possible worlds largely overlap, most nodes in the document are duplicated in several possible worlds. Therefore, the possible world representation, although theoretically interesting, semantically sound and easy to understand, is not practical. However, it is used to demonstrate concepts and functionality in the probabilistic XML DBMS. The possible world representation is used as a starting point and in subsequent sections we will show improvements on this general possible world representation.

4.3 Compact Representation This section builds upon normal XML and the possible world model described earlier. We improve the storage model by reducing redundancy in storage. Our model is viewed as a tree, made up of nodes, containing subtrees. We distinguish between three different kinds of nodes to be able to store possibilities and associated probabilities. The use of three different kinds of nodes increases expressiveness, as we will later show. Since order is important in XML, we first introduce some notation for handling sequences. Notational convention 1. Analogous to the powerset notation P , we use a power sequence notation S A to denote the domain of all possible sequences built up of elements of A. We use the notation [a1 , . . . , an ] for a sequence of n elements ai ∈ A (i = 1..n). We use set operations for sequences, such as ∪, ∃, ∈, whenever definitions remain unambiguous. We start by defining the notions of tree and subtree as abstractions of an XML document and fragment. We model a tree as a node and a sequence of child subtrees.

A. de Keijzer

Definition 2. Let n = {id,tag, kind, attr, value} be a node, with • id the node identity • tag the tag name of the name • kind the node kind • attr the list of attributes, which can be empty • value the text value of the node, which can be empty Equality on nodes is defined as equality on all of their properties. Deep-equality on nodes is defined as equality on nodes and their subtrees. We indicate that a certain node n is a root node by n. Except for equality, however, we abstract from the details of nodes. Definition 3. Let N be the set of nodes. Let Ti be the set of trees with maximum level i inductively defined as follows: T0 = {(n, 0) / |n ∈ N } Ti+1 = Ti ∪ {(n, ST) | n ∈ N ∧ST ∈ S Ti ∧(∀T ∈ ST • n ∈ N T ) ∧(∀T, T ∈ ST • T = T ⇒ N T ∩ N T = 0)} /

where N T = {n} ∪ T ∈ST N T . Let Tfin be the set of finite trees, i.e., T ∈ Tfin ⇔ ∃i ∈ N • T ∈ Ti . In the sequel, we only work with finite trees. Definition 3 requires the document to be a tree instead of a graph. A node has a sequence of child nodes, which can be empty and can have only one parent. We define some functions to obtain a subtree. We obtain a subtree from a tree T by indicating a node n in T which is the root node of the desired subtree. We also define a function child that returns the child nodes of a given node in a tree. Definition 4. Let subtree(T, n) be the subtree within T = (n, ST) rooted at n. T if n = n • subtree(T, n) = subtree(T , n) otherwise

where T such that (n , T ) ∈ ST ∧ n ∈ N T . For subtree(T, n) = (n, [(n1 , ST 1 ), . . . , (nm , ST m )]), let child(T, n) = [n1 , . . . , nm ]. 4.3.1

Probabilistic Tree

The central notion in our model is the probabilistic tree. In an ordinary XML document, all information is certain. In probabilistic XML each XML node can have zero or more possibilities, or alternatives. More generally, if we consider a node to be the root node of a subtree, then there may exist zero or more possibilities for an

Data Integration Using Uncertain XML

◦ • persons

V .7kkkk VVVVV .3 VVVV k k k V◦M k ◦ qqq MMMMM q q personq•<< person •<q< •<<person q < qqq < < q .5 <<.5 1 1 1 1 1 < ◦ ◦ ◦ ◦ ◦ ◦ ◦ nm • tel • tel • nm • tel • nm • tel • John 1111 2222 John 1111 John 2222 Fig. 3 Example probabilistic XML tree

entire subtree. We model a probabilistic tree by introducing two special kinds of nodes: 1. probability nodes depicted as , and 2. possibility nodes depicted as ◦, which have an associated probability. The root of a probabilistic XML document is always a probability node. Children of a probability node are always possibility nodes and enumerate all possibilities. The probabilities associated with the possibility nodes sum up to at most 1, or all probabilities of sibling possibility nodes are unknown. Ordinary XML nodes are depicted as • and are always child nodes of possibility nodes. A probabilistic tree is well-structured, if the children of a probability node are possibility nodes, the children of a possibility node are XML nodes, and the children of XML nodes are probability nodes. Using this layered structure, each level of the tree only contains one kind of nodes. Figure 3 shows an example of a probabilistic XML tree. The tree represents an XML document with a root node ‘persons’ (which exists with certainty). The root node has either one or two child nodes ‘person’ (with probabilities .7 and .3, respectively). In the case there is only one child, the name of the person is ‘John’ and the telephone number is either ‘1111’ or ‘2222’. The probabilities for both phone numbers are uniformly distributed. The second case, where there are two persons with name ‘John’ is less likely if we consider names to be a key like element. However, we can store this more unlikely situation and in that case, the information of both persons is certain, i.e., they both have name ‘John’ and one has telephone number ‘1111’ and the other has phone number ‘2222’. Figure 3 can be seen as a possible result of two documents having been integrated. One document stating the telephone number of a person named ‘John’ to be ‘1111’, and the other stating the telephone number of a person named ‘John’ to be ‘2222’. It is uncertain if both represent the same person (in the real world). A data integration matching rule apparently determined that, with a probability of .7, they represent

A. de Keijzer

the same person. Therefore, the combined knowledge of the real world is described accurately by the given tree. A probabilistic tree is defined as a tree, a kind function that assigns node kinds to specific nodes in the tree, and a prob function that assigns probabilities to possibility nodes. The root node is defined to always be a probability node. A special type of probabilistic tree is a certain one, which means that all information in it is certain, i.e., all probability nodes have exactly one possibility node with an associated probability of 1. Definition 5. A probabilistic tree PT = (T, kind, prob) is defined as follows • kind ∈ (N → {prob, poss, xml}) • NkT = {n ∈ N T | kind(n) = k}. • kind(n) = prob where T = (n, ST) T ∀n ∈ child(T, n) • n ∈ N T • ∀n ∈ Nprob poss T ∀n ∈ child(T, n) • n ∈ N T • ∀n ∈ Nposs xml

T ∀n ∈ child(T, n) • n ∈ N T • ∀n ∈ Nxml prob T • prob ∈ Nposs [0, 1] T • ∀n ∈ Nprob • ((∑n ∈child(T,n) prob(n )) = 1 ∨ (∀n ∈ child(T, n) • prob(n ) = ⊥)). Where A B is a partial function. A probabilistic tree PT = (T, kind, prob) is certain iff there is only one possibility T • |child(T, n)| = 1. node for each probability node, i.e., certain(PT) ⇔ ∀n ∈ Nprob To clarify definitions, we use b to denote a probability node, s to denote a possibility node, and x to denote an XML node.

Subtrees under probability nodes denote local possibilities. In the one-person case of Figure 3, there are two local possibilities for the phone number, it is either ‘1111’ or ‘2222’. The other uncertainty in the tree are the possibilities that there are one or two persons. Viewed globally and from the perspective of a device with this data in its database, the real world could look like one of the following • one person with name ‘John’ and phone number ‘1111’ (probability .5 × .7 = .35), • one person with name ‘John’ and phone number ‘2222’ (probability .5 × .7 = .35), or • two persons with name ‘John’ and respective phone numbers ‘1111’ and ‘2222’ (probability .3). We get these possible worlds by making a decision for one of the possibility nodes at each of the probability nodes. For this reason, we also refer to probability nodes as decision points. Definition 6. A certain probabilistic tree PT is a possible world of another probabilistic tree PT, i.e., pw(PT , PT), with probability pwprob(PT , PT) iff • PT = (T, kind, prob) ∧ PT = (T , kind , prob ) • T = (n, ST n ) ∧ T = (n, ST n )

Data Integration Using Uncertain XML

• ∃s ∈ child(T, n) • child(T , n) = [s] • X = child(T, s) = child(T , s) • ∀x ∈ X • child(T, x) = child(T , x) • B = x∈X child(T, x) • ∀b ∈ B • PT b = subtree(PT, b) ∧PT b = subtree(PT , b) ∧pw(PT b , PT b ) • ∀b ∈ B • pb = pwprob(PT b , PT b ) • pwprob(PT , PT) = prob(s) × ∏b∈B pb The set of all possible worlds of a probabilistic tree PT is PWSPT = {PT | pw(PT , PT)}. A probabilistic tree is a compact representation of the set of all possible worlds, but there is not necessarily one unique representation. The optimal representation is the one with the least number of nodes obtained through a process called simplification. Definition 7. Two probabilistic trees PT 1 and PT 2 are equivalent iff PWSPT 1 = PWSPT 2 . PT 1 is more compact than PT 2 if N PT 1 < N PT 2 . The transformation of a probabilistic tree to an equivalent more compact one is called simplification. The number of possible worlds captured by a probabilistic tree is determined by the number of decision points and possibilities at those points. We also define a function leaf that returns all the leaf nodes of a tree. PW (T) is equal to the The number of possible worlds defined by the tree PT, NPT PW (T) where number of possible worlds at node n, defined by Nn • leaf(T) = {n|n ∈ N T • child(n) = 0} / PW (T) = 1, if n ∈ leaf(T) • Nn PW (T) PW (T) = ∏n ∈child(T,n) Nn , if kind(n) = poss • Nn PW (T)

• Nn •

PW (T) Nn

PW (T)

= ∑n ∈child(T,n) Nn

, if kind(n) = prob

PW (T) = ∏n ∈child(T,n) Nn ,

if kind(n) = xml

Note that the above calculation gives the calculation for |PWSPT |. Figure 4 shows an example of two equivalent probabilistic trees. They both denote the set of possible worlds containing trees with • two nodes ‘nm’ and ‘tel’ with child text nodes ‘John’ and ‘1111’ respectively (probability .8) and • two nodes ‘nm’ and ‘tel’ with child text nodes ‘John’ and ‘2222’ respectively (probability .2).

4.4 Expressiveness As mentioned earlier, relational approaches often disallow dependencies among attributes. The higher expressiveness of the probabilistic tree makes such a restriction

A. de Keijzer .8 qq MMM.2 MMM qqq q ◦< < ◦ << << < <

nm •

tel • nm •

◦

• John (a) PT 1

◦

• 1111

◦

• John

◦MMMMM M • tel nm •

tel •

◦

• 2222

◦

• John (b) PT 2

< .2 << ◦ ◦ .8

• 1111

• 2222

Fig. 4 Probabilistic XML tree equivalence

personq•MMM

q qqq <q < <

◦

MMM < << ◦ ◦

person •

qq MMMMM M qqq q ◦<<<< ◦<<< <

< << ◦ ◦ person •

nm • nm • tel • tel • nm • tel • nm • tel • Jon 1111 2222 John John 1111 Jon 2222 (c) Uncertainty (a) Independence (b) Dependence about existence Fig. 5 Expressiveness of probabilistic tree model

unnecessary. Figure 5 illustrates three common patterns: independence between attributes (Figure 5(a)), where any combination of ‘nm’ and ‘tel’ is possible. The advantage in XML is that values only have to be stored once, if they are independent of other elements or values. The second pattern is dependency between attributes (Figure 5(b)), where only the combinations ‘John’/‘1111’ and ‘Jon’/‘2222’ are possible. In this case the value of one element depends on the value of another element. The last pattern is uncertainty about the existence of an object (Figure 5(c)). Here one possibility is empty, i.e., has no subtree. The meaning of this empty subtree is not that the value is unknown, but rather that the subtree simply doesn’t exist. These patterns can occur on any level in the tree, which allows a much larger range of situations to be expressed.

5 Providing Feedback When a query result is returned to the user, he is already involved with the system and feedback on the validity of the query result can easily be given[2]. Uncertainty can be reduced by giving feedback on query results. Because the user posing the query also observes the real world, he can determine whether certain query answers are for certain correct or incorrect. By giving feedback in such cases, the database may conclude that certain possible worlds can no longer be correct and eliminate

Data Integration Using Uncertain XML

them. Feedback, in contrast with both integration and updating, can never introduce new worlds, or new elements in worlds. This feedback mechanism removes most of the work at integration time, but it does add some work at query time. Providing feedback is not mandatory, and can be omitted. Of course, in that case, the amount of uncertainty is not reduced. The cycle of repeated observations and information integration introduces possible worlds, the cycle of repeated user feedback eliminates them. In this way, the uncertainty in the information in the database keeps reflecting the actual uncertainty about the state of affairs in the real world. Feedback reduces the amount of uncertainty in the document. For experiments and results on the effect of feedback, we refer to [6].

observations

Real Realworld world

External DBs

observations

obse rvatio ns

in da te ta gr at io n

Database possible worlds

query Possible Possible query Possible query answer query answer answer User Feedback

Fig. 6 Information Cycle

5.1 Types of Feedback Consider the query given previously asking for the phone number of persons named “John”. The answer (see Figure 7) is uncertain: either ("1111"), ("2222"), or ("1111","2222"). A user could readily verify these answers, for example, by calling one or both phone numbers and checking if the person on the other end of the line is named “John”. He could then indicate his findings by stating for some query results whether they are true or false in the real world. The goal of our user feedback technique is to use this information to update the information in the database accordingly, thus reducing uncertainty. We claim that a semantically correct way of doing this, is by invalidating entire possible worlds that disagree with the statement on the query result. For example, if a person named “John” picks up the phone when dialing “1111”, then this is apparently a correct answer, hence any possible world not producing “1111” as an answer can be eliminated. This leaves two possibly correct possible worlds. Note that stating that “1111” is a correct answer, does not imply that “2222” in the answer is incorrect, since “2222” may be the phone number of another person named “John”; this corresponds with the third possible world. We distinguish two types of feedback: positive and negative feedback. With negative feedback, the user indicates that one or more possibilities from the query result do not correspond with his knowledge of the real world. Positive feedback indicates that the user is certain that one or more possibilities from the query result correspond

A. de Keijzer .35

{

.35

()*+ /.-, seq

/.-, ()*+ seq

/.-, ()*+ seq < <

◦

tel • 1111

◦

}

◦

tel • tel • 1111 2222

tel • 2222

(a) Set of possible query results .35 qq SSSS.3 SSS qqq.35 S

◦q

◦

()*+ /.-, seq

/.-, ()*+ seq

/.-, ()*+ seq < <

◦

tel • tel • tel • tel • 1111 2222 1111 2222 (b) Query result as probabilistic tree Fig. 7 Probabilistic query result

with the real world. Let RWuser be a user’s certain knowledge of the real world. For simplicity, we represent RWuser with an XML tree, i.e. RWuser ∈ Tfin . Definition 8. Let Q q (PT) be a set of possible query answers for some query q and probabilistic XML tree PT, and S ∈ Q q (PT) In XQuery and XPath, a query answer is always a sequence, so we assume S to be a sequence. Negative feedback is a statement “a is false” for some a ∈ S. The meaning of this statement is a ∈ Q q (RWuser ). Analogously, positive feedback is a statement “a is true” meaning a ∈ Qq (RWuser ). Q q (PT) = {(”1111”), (”2222”), (”1111”, ”2222”)} in our example, which in a system will probably be represented as {”1111”, ”2222”}. The positive feedback that “1111” is a correct phone number means that ”1111” ∈ Qq (RWuser ), i.e. the user states that the combination (“John”, “1111’) is for certain known by him. As a result, all worlds represented by the database, that do not contain (“John”, “1111’) are deleted from the database.

5.2 Effect of Feedback As stated before, our approach is to invalidate, or rather eliminate, those possible worlds from the database that do not correspond with the user’s knowledge of the real world.

Data Integration Using Uncertain XML

Definitions 9 and 10 show how to construct the new possible worlds after giving positive and negative feedback, respectively. Definition 9. Let PT be the result of user feedback “a is true” for some database PT, query q, and a ∈ S, where S ∈ Q q (PT). PT is defined by PWSPT = {T ∈ PWSPT | a ∈ Qq (T)} Definition 10. Let PT be the result of user feedback “a is false” for some database PT, query q, and a ∈ S, where S ∈ Q q (PT). PT is defined by PWSPT = {T ∈ PWSPT | a ∈ Qq (T)} hh SSSSS P(T |PT ) SSSn ◦

P(T1 |PT ) hhhhhh h P(T |PT ) ◦hh 2

◦

··· ···

Fig. 8 Construction of a probabilistic tree representation from its set of possible worlds

Observe that definitions 9 and 10 shows that we only need to eliminate possible worlds from the database. It is never necessary to create a possible world, create or delete a local possibility, or change or delete a part of a possible world. This is explained by the fact that from the set of original worlds, we select only those new worlds satisfying the feedback constraints, leaving out those that do not satisfy the feedback. As a result, whole possible worlds are either kept, or deleted. Because feedback deletes entire worlds, it is a powerful mechanism and should be used with caution. We will address this more thoroughly in section 5.5. We have defined PT by means of its possible worlds. Note that it is not hard to construct PT from the set of possible worlds. Simply create a probability node with as many children as there are possible worlds. Attach each possible world, which is a certain probabilistic tree as subtree (see Figure 8). In this way, we obtain a probabilistic tree representing exactly this set of possible worlds. Any probabilistic tree equivalent with a PT constructed in this way, preferably the compact representation, can be used as resulting database.

5.3 Recalculating Probabilities When possible worlds are removed from the database as a result of feedback, the probabilities of all remaining possible worlds have to be recalculated. Unfortunately, the databases from which the probabilistic information source originated are typically unavailable. Therefore, re-integrating sources taking feedback into account is not a viable approach. Instead, we recalculate the new probabilities based on the probabilities that the remaining possible worlds had in the original database. Below, we argue that the correct way of recalculation amounts to simple normalization.

100

A. de Keijzer

Our notation P(T | PT) suggests that we consider the database PT as the universe. To emphasize this fact, we use the symbol U for the original database. Eliminating possible worlds from this universe, means constructing a (new) database PT . Let us first consider the case of a possible world T that is eliminated. Its probability P(T | PT ) is, of course, 0. In the other case, we can calculate the probability of the possible world in the new universe using the laws of conditional probabilities as follows: P(T ∧ PT ) P(PT | T)P(T) P(T | PT ) = = P(PT ) P(PT ) P(PT | T) = 1, because we are considering the case that T is a member of the universe, hence the existence of the new universe given possible world T is certain. The probability of the occurrence of the new database, i.e. the new set of possible worlds, is P(PT ) = ∑T∈PWSPT P(T). Note that P(T) is the probability of T given our universe, hence P(T) = P(T | U). After substitution we finally derive 0 if T is eliminated P(T|U) P(T | PT ) = P(T|U) otherwise ∑ T∈PWS PT

As one can observe, the new probabilities can be obtained by simply normalizing probabilities. However, the calculation given above shows that normalizing probabilities semantically fits the possible world approach.

5.4 Properties of Feedback For validation purposes, we analyze some desirable properties of our user feedback technique. This is a kind of analytical validation. Property 1. Given an original database PT and a resulting database PT after user feedback, the amount of uncertainty does not grow, i.e. PWSPT ⊆ PWSPT This property follows directly from Definitions 9 and 10. Property 2. Given an original database PT and a resulting database PT after user feedback, we observe that probabilities of possible worlds do not decrease ∀T ∈ PWSPT • P(T | PT ) ≥ P(T | PT) This property follows from the formula derived in Section 5.3. T ∈ PWSPT means T is not eliminated. Since P(T | PT) = P(T | U), we conclude that P(T | PT ) is P(T | PT) divided by some number. Since ∑T∈PWSPT P(T | PT) = 1 and PWSPT ⊆ PWSPT (Property 1), this number is guaranteed to be larger than 0 and no larger than 1. Hence P(T | PT ) ≥ P(T | PT). Property 3. The probabilities in the new database, P(T|PT ), are indeed a probabilistic distribution, i.e.

Data Integration Using Uncertain XML

∑

101

P(T | PT ) = 1

T∈PWSPT

The property follows directly from substituting the formula from Section 5.3:

∑

P(T | PT ) =

T∈PWSPT

∑

T∈PWSPT

P(T | U) ∑T∈PWSPT P(T | U)

∑T∈PWSPT P(T | U) ∑T∈PWSPT P(T | U)

5.5 Give Feedback Carefully We mentioned earlier that a database is a representation of the real world. Although this is true, there is a need for caution, because the real world changes, hence the observation of the real world can be different from the observation at a later time. Furthermore, knowledge about the real world is always incomplete. We denote the representation of the real world as captured in the database by RW. The representation of the real world as seen by the user at query time will be denoted by RWuser . Due to the possible (non-)overlap between real world knowledge of the database and the user, feedback to a query in terms of absolute statements should be given with caution. We will show different scenarios of mismatch in knowledge and their impact on the feedback process. Figure 9 shows four examples of observations from the real world contained by the database and the user. The examples are restricted to a set of names of people with the same name. In this case we show all people named “John”. In each example, the left figure shows people named “John” known by the database and the right figure shows people named “John” known by the user. Figures 9(a) and 9(b) show an ideal situation, where both the database and the user have knowledge about the same persons. Even though their respective knowledge of these persons may differ, there is no significant mismatch between database and user and the risk of wrong feedback is minimal. Figures 9(c) and 9(d) show the situation where the knowledge of the database and that of the user is different, but the number of real world objects is equal. Here, the database has information on a person John2 and the user doesn’t, while the user has information on a person John3 that is unknown to the database. In other words RW = RWuser . Feedback about the non-existence of John2 by this particular user could result in the deletion of all possible worlds containing John2 , while in fact that person does exist, but is just not known to the user querying the database. The user should only give such negative feedback if he is certain that it is universal, i.e. that a database containing John2 is for certain incorrect. Figures 9(e) and 9(f) as well as Figures 9(g) and 9(h) show situations where the number of real world objects known by the database is also different than that known

102

A. de Keijzer

•<<addresses << person • • person

• • John1 John2

(a) DB1

(b) Person1

•<<addresses << person • • person

• • John1 John2

• • John1 John3

(d) Person2

MMM qq•Maddresses MM qqq q • • •

•<<addresses << person • • person

• • • John1 John2 John3 (e) DB3

• • John1 John2 (f) Person3

•<<addresses << person • • person • • John1 John2 (g) DB4

MMM qq•Maddresses MM qqq q • • • • • • John1 John2 John3 (h) Person4

Fig. 9 Possible scenarios of DB - User (mis)match

by the user. In such cases, feedback on queries with aggregates are likely to result in unwanted results and should only be given with special care. Suppose a user poses the query let $grp := distinct-values(//person/name) for $n in $grp return <group> { $grp, count(//persons[./name eq $grp]) } </group> to see how many people with the same name he knows, i.e. are contained in the database. It could happen that the query result for a name is different than he expects and he would like to give feedback on this.

Data Integration Using Uncertain XML

103

For example, in the situation of Figures 9(g) and 9(h), the query result for “John” is 2, but the user knows 3 persons named “John”. Here the user should be aware that any feedback should not only be a universal truth, but also something the database with its incomplete knowledge should know. Giving the feedback that the query result should be 3 would eliminate all possible worlds with less than or more than 3 persons, hence one could possibly end up with an empty database. Nevertheless, feedback can be a powerful mechanism in reducing uncertainty in the database if users (or application developers) use feedback with care, i.e. only universal truths or falsehoods, and only in cases where a database with incomplete knowledge should have knowledge about it. In other words, the database should have possessed the correct information. Instead of deleting possible worlds that do not adhere to the feedback statements, the associated probability of invalid items can be set to 0. In this case, the item is not physically deleted, but will always be shown with a probability of 0. Queries can easily be contructed to exclude items with 0-probabilities.

References 1. Doan, A., Domingos, P., Halevy, A.Y.: Learning to match the schemas of data sources: A multistrategy approach. Machine Learning 50(3), 279–301 (2003) 2. de Keijzer, A., van Keulen, M.: User feedback in probabilistic xml. Technical Report CTIT-07-25, Centre for Telematics and Information Technology, Enschede, The Netherlands (2007) 3. de Keijzer, A., van Keulen, M., Li, Y.: Taming data explosion in probabilistic information integration. In: Proceedings of the Int. Workshop on Inconsistency and Incompleteness in Databases (IIDB), Munich, Germany (2006) 4. van Keulen, M., de Keijzer, A., Alink, W.: A probabilistic xml approach to data integration. In: Proceedings of ICDE, Tokyo, Japan, pp. 459–470 (2005) 5. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001) 6. van Keulen, M., de Keijzer, A.: Qualitative effects of knowledge rules and user feedback in probabilistic data integration. The VLDB Journal 18(5), 1191–1217 (2009)

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources Bettina Fazzinga

Abstract. This chapter describes a framework for querying heterogeneous XML data sources, that extends previous approaches for approximate query evaluation, by providing techniques for combining partial answers coming from different sources. This approach does not rely on a global schema shared by the sources, but it automatically adapts the query to the available data, providing the user with the XML elements satisfying the query to a certain extent. Based on this framework, a query language is described which allows the collection of as much information as possible from several heterogeneous XML sources. An algorithm for approximately evaluating a query on a single source and a strategy to join partial results coming from different sources are provided. Finally, an experimental validation of the approach in a peer-to-peer application scenario is presented.

1 Introduction Nowadays, millions of persons and corporations exploit Internet and, more in general, networks for several purposes, such as retrieving, divulging and sharing information. The need to make easy the retrieval and the exchange of data over the networks has led to the development of techniques for automatically collecting and integrating information stored in different and multiple sources. One of the main issue arising in this field is the heterogeneity of the ways of structuring data, even when data concern the same kind of information. In order to cope with this heterogeneity, a language allowing some flexibility in the definition of structure of the data is needed. To this aim, the eXtensible Markup Language (XML)[38] has been proposed and it is now the de-facto standard language used for formatting data to be shared on the networks. Bettina Fazzinga DEIS - Universit`a della Calabria, Via P. Bucci - 87036 Rende (CS), Italy e-mail: bfazzinga@deis.unical.it

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 107â&#x20AC;&#x201C;132. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

108

B. Fazzinga

XML query languages provide flexible mechanisms that are capable of resolving some differences among the actual data schemas. For instance, XPath [39] provides the descendant axis which allows users to select elements which are direct or indirect children of a given element without specifying exactly the path connecting them. However, XPath still requires some knowledge of the data schema; for instance, an XPath query must use the exact terms appearing in the data when specifying conditions on element names. Moreover, standard query languages for XML are not capable to cope with the fact that data stored in the sources can be partial, in the sense that sources may describe the same data, but from different point of views, thus storing only partial information. For coping with the heterogeneity of the source schemas and the partiality of the information stored in the sources, several approaches have been proposed, mostly aimed at building a mediator system that hides to the user the amount of sources and the differences among them and interacts with all sources for providing the user with a meaningful answer. In this case, queries are expressed with respect to a global schema which is related to the local ones by means of mappings (see, e.g., [6, 22, 40]). Therefore, queries expressed with respect to the global schema are translated to comply with the local schemas. Other approaches do not use a global schema but require schema mappings to be specified between pairs of data sources. Queries are expressed w.r.t. a local schema and then propagated to other sources through proper translations [13, 34]. In general, however, mapping-based approaches limit the autonomy of data sources, since sources must share their own schemas and, in some case, they are forced to store mappings from their schemas to those of the neighbors. In this chapter, we consider the scenario where the user is not aware of the local data schemas, and no inter-schema mapping is provided. This way, total autonomy is guaranteed to data sources and the main problem to be dealt with is information retrieval based on the specification of some properties of the objects to be retrieved. Obviously, since classical database-like queries are exact, they are expected to provide poor results in this setting. Several approaches have been proposed for approximate XML querying, that add flexibility to XPath by automatically adapting queries to the available data. Each of these approaches adopts different semantics and different sets of transformations for adapting queries [2, 3, 4, 16, 33, 35]. Query transformation-based approaches proved useful to tackle the problem of query answering over single XML documents. The problem gets more complicated when different sources provide information on the same subject from different points of view, i.e., by considering different properties of the same objects. Here, besides isolating users from the possibly complex interaction with data sources, query evaluation mechanisms must be capable of combining â&#x20AC;&#x153;partialâ&#x20AC;? information provided by the sources to obtain results as complete as possible. Example 1 shows a possible scenario where the needed information is available but spread across different sources. Example 1. Fig. 1 shows a scenario where partial descriptions of movies are provided by three XML data sources D1 , D2 , and D3 , and each source employs a different schema and focuses on different properties. In particular, D1 focuses on movie

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

109

titles, years and main actors, D2 looks at titles and actors for movies of the current year, and D3 has more details about authors’ names and ages. In this scenario, a user D1

film

…

actor

title The International

film

actor

year

N. Watts

movie

…

title

actors

title

2009

movie

…

actors

actor

R. De Niro

D. Barrymore

C. Owen

movie

actor

year

…

Duplicity

Everybody’s fine

actor J. Roberts

actor C. Owen

2009

movie

year

actor C. Owen

2009

movie

actor

C. Owen

…

movie

title

…

actor

movie

title

The International

…

actor

Duplicity

name

age

name

age

C. Owen

Fig. 1 Motivating example

interested in finding information about 2009 movies starring Clive Owen may issue an XPath query q of the form //movie[actor=’C. Owen’][year=’2009’]. The query is depicted in Fig. 1 (in a rounded box) as a tree pattern [23] where a box surrounds the output node. The exact evaluation of q over the fragments yields no result; even adapting q to the available data would retrieve a set of XML elements, each not providing enough information by itself. However, the sources provide enough information to correctly characterize the searched movies and therefore answer q: D3 has information about a movie starring Owen and D1 knows that that movie is a 2009 movie, moreover D2 stores information about another movie the current year (2009) starring Owen.

110

B. Fazzinga

The aim and strategy proposed in this chapter are thus different from past work on querying heterogeneous XML data. Our objective is twofold: besides coping with the difference between query and data schemas, we aim at enabling the retrieval of objects that satisfy a query even if their descriptions are spread across different sources. As said before, we aim at providing a querying mechanism that does not require neither semantic schema mappings nor explicit knowledge of the local schema used by each data source. Our proposed technique, whose logical phases are depicted in Fig. 2 (for the scenario of Fig. 1), is based on the idea of “vaguely” evaluating a query, i.e., relaxing some of the conditions, executing these transformed versions, grouping the retrieved partial answers (joining), and finally returning those groups which satisfy the query (selection). We call this process vague query evaluation.

Fig. 2 Logical phases of vague query evaluation

The transformation process is performed locally at each data source and is driven by the available data, in the sense that proper transformations are chosen to match the data. The transformation process should never produce queries that retrieve elements too different from those required by the user. Thus, costs can be associated with basic query transformation operations, and an answer is considered valid only if the overall transformation cost associated with the transformed query that retrieves it is under a certain (local) threshold. In the example, it is assumed that the transformation cost of query q2 is above the threshold, thus q2 is not actually executed on source D2 . After evaluating transformed queries, retrieved elements that provide partial information about the same real-world objects are grouped. As it will be clearer in the following, this process is based on the evaluation of semantic similarity among XML elements. In the example, elements e1 and e4 are grouped, as well as elements e3 and e5 . Finally, the correspondence of the grouped elements with the original query is assessed, by comparing their “overall” transformation cost with a global threshold. In the example, it is assumed that the overall cost of {e3 , e3 } exceeds the global threshold (as neither e3 nor e5 contain information about the movie year), thus {e3 , e5 } is not part of the final query result. This evaluation mechanism allows us to retrieve the desired information even when disseminated over several sources. By distinguishing two levels at which query relaxation is applied, we are able to collect, from each source, pieces of information that would not completely satisfy the query by themselves, but that can be

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

111

subsequently completed with information coming from other sources. Therefore, it is reasonable to assume that the original query is allowed to be more deeply modified when it is evaluated locally at each source, although after joining the (partial) answers coming from each source, the elements obtained this way should better satisfy the conditions specified by users. Hence, the local threshold is assumed to be less restrictive (and its value greater than) than the global one.

1.1 Plan of the Chapter In this chapter we investigate the problem of retrieving XML data from multiple heterogeneous data sources that do not share a global schema and do not specify schema mappings. The main contribution of this chapter is the definition of an approximate querying approach based on XPath, which extends previous work on approximate tree patterns [2, 33] for querying multiple heterogeneous XML data sources. The rest of the chapter is organized as follows: Section 2 discusses related works, Section 3 presents our query language, Section 4 introduces our query evaluation strategy, Section 5 presents an application scenario and shows an experimental evaluation of our proposed techniques, and Section 6 draws some conclusions.

2 Related Work Several approaches addressing the problem of querying heterogeneous XML data have been recently proposed, mainly based on some form of global schema summarizing all the information contained in the different sources [1, 6, 7, 10, 13, 22, 28, 32, 36, 40]. The global schema hides to the user the amount of sources and the differences among them, and interacts with all sources for providing the user with a meaningful answer. The global schema is virtual in the sense that it is used for posing queries, but not for storing data. Mappings are established between the global schema and the source schemas, forming a two-tier architecture in which queries are posed over the global schema and evaluated over the underlying source data. Two main formalisms have been proposed for schema mediation: global-asview (GAV), where the concepts in the global schema are defined as views over the concepts in the sources, and local-as-view (LAV), where the concepts in the sources are specified as views over the global schema. In many applications, the schemas of the source data are so different from one another that a global schema would be very hard to be built, and to maintain over time. Hence, a solution is to provide architectures for decentralized data sharing. In this context, mappings between disparate schemas are provided locally between pairs or small sets of sources [9, 18, 34]. When a query is posed at a source, relevant data from other sources can be obtained through a semantic path of mappings. The key step in query processing in this kind of architectures is reformulating a query over other sources on the available semantic paths. When a source joins the network it is necessary to build mappings with several neighbors in order to make it reachable through a â&#x20AC;&#x153;chainâ&#x20AC;? of

112

B. Fazzinga

semantic mappings for query processing. Therefore, the volatility of sources and the dynamism of the network become a critical aspect. Other approaches do not rely on the availability of a global schema and directly match an XML query to the schema of the target data exploiting semantic similarity and some restricted form of structural similarity [11, 12, 20, 21]. Two other approaches that do not use a global schema are described in [14, 27]. [14] is one of the first approaches to the fuzzy evaluation of a query, and it proposes a strategy where the answer to a query is a set of tuples associated with numbers, indicating, for each tuple, how much the tuple satisfies the original query. This approach only considers conjunctive queries over relational databases. [27] describes a framework for feedback-driven XML query refinement. Specifically, queries are split in sub-queries, which are evaluated allowing the renaming of the specified labels. The result of each sub-query, along with a number representing the label satisfaction degree, is used to refine the labels specified in the original query. The most similar approaches to the approximate query evaluation described in this chapter are [2, 33], which exploit a form of tree pattern relaxation obtained through the application of transformations associated with costs. In both these approaches, a set of transformations applicable to a tree pattern query is defined, and a strategy for evaluating the relaxed versions of the query is proposed. In [2], allowed transformations are node renaming, leaf deletion, edge relaxation and subtree promotion. The last kind of relaxation allows a query subtree to be promoted so that the subtree is directly connected to its former grandparent by an ancestor-descendant edge. Node renaming is supported only w.r.t. fixed name hierarchies that must be provided. For each node and each edge of the query, user specifies two weights, an exact weight and a relaxed weight, where the first one is greater than the second one. The former is the score associated with an exact match of the node or the edge, whereas the latter is the score associated when a relaxation is applied to the node or the edge. The score associated with a query is obtained by summing relaxed weights for nodes and edges on which some relaxations are applied and exact weights for non-relaxed nodes and edges. This means that a relaxed version of a query has an associated score that is lower than the score of the original one. The key idea underlying this approach is to encode all the possible query relaxations in a single query evaluation plan and only relevant answers are selected. Three algorithms are proposed: (1) Thres, that takes a weighted query tree pattern and a threshold and computes all the approximate answers whose scores are at least as large as the threshold; (2) OptiThres, an adaptive optimization of Thres that uses scores of intermediate results to dynamically â&#x20AC;&#x153;undoâ&#x20AC;? relaxations encoded in the evaluation plan without compromising the set of answers returned; (3) TopK, that takes a weighted query tree pattern and finds the top-k approximate answers. [33] proposes a query language that allows node insertion, deletion, and renaming. Costs are associated with labels and only node renamings completely specified by the user are allowed, independently of the available data. Two polynomial-time algorithms that find the top-k answers to the query are presented: the first algorithm finds all the approximate results, sorts them by increasing cost, and prunes the result list after the k-th entry. The second algorithm is an extension of the first one. It uses

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

113

the schema of the database to estimate the best transformed queries, sorts them by cost, and executes them against the database to find the top-k results. The evaluation of a query is based on an expanded representation of the query that implicitly includes all the so-called â&#x20AC;&#x153;semi-transformedâ&#x20AC;? queries. All embedding images of the semi-transformed query are computed using a bottom-up algorithm based on a list algebra. The approach described in this chapter extends the set of transformations proposed in the previous approaches, introducing three predicate transformations and node renaming without any user specification. Moreover, this approach, differently from the existing proposals for heterogeneous XML querying, besides applying query relaxations (without schema knowledge) in order to collect approximate answers, also employs techniques that properly assemble partial query answers obtained from different sources to provide the user with answers as complete as possible. Several approaches for combining partial results have been proposed, that focus in particular on duplicate detection and entity resolution [5, 17, 19, 24, 30, 31, 41], but they present some limitations. In [41], the edit distance between two trees is evaluated, defined as the minimum number of operations (node insertion, deletion, and renaming) required to transform one tree into the other. [17] present a framework for approximate XML joins where upper and lower bounds are given for the edit distance, and reference sets are used that reduce the number of distances to be computed. Exploiting these two approaches to establish whether two elements are duplicate fails in the case that the two elements referring to the same real world entity have very different structure, as the value of tree edit distance will be very high. The technique introduced in [19] exploits a Bayesian network model to compute the probability of two objects, represented by XML elements, being duplicates. The main assumption is that all XML documents comply to the same schema, or a schema mapping step has been performed. Therefore, only differences concerning repeated nodes and strings are considered. In [24] structural dissimilarity between XML elements is taken into account, but the assumption is that XML elements can be duplicates only if their path from the root is the same. The technique proposed in this chapter does not present any of the above-mentioned limitations and does not rely on any of above-mentioned assumptions.

3 Vague Queries on Multiple Heterogeneous XML Data Sources In this section we define a query language, named VXQL, whose flexibility enables users to find the information they are interested in, even when such information is disseminated in different XML data sources. The language is essentially an extension of XPath; we use the graphical formalism of tree patterns to represent XPath expressions in the examples. VXQL supports query relaxations similar to those proposed in [2, 33] and introduces transformations concerning textual predicates. Differently from these approaches, that adopt tree pattern formalisms to define query relaxations, our

114

B. Fazzinga

language is directly based on XPath. Let s be an XPath step of the form axis::l[ f ]. The language supports the following basic transformations applicable to the axis and to the node test l: - node renaming: if l =’∗’, l is replaced with a different label l ; - node deletion: s is replaced with descendant-or-self::∗[ f ]; - axis relaxation: if axis is child, s is replaced with descendant::l[ f ]. Moreover, transformations are also applicable to XPath predicates, which appear in the leaf nodes of the corresponding tree pattern (comparison predicates). Let f be an XPath predicate of the form text() op ’abc’, where op is a comparison operator; the language supports the following transformations: - ∗-node insertion: f is replaced by descendant-or-self::∗[ f ]; - relaxation of equality predicate: if the comparison operator used in f is =, f is replaced with contains(text(),’abc’); - predicate deletion: f is deleted. Example 2. In the scenario of Fig. 1 (where deleted nodes are not shown), we can transform q in 5 different ways: - q1 is obtained by renaming the root element from movie to film and removing the textual predicate of the actor element. This query captures element e1 . - q2 and q3 are obtained by transforming the child edge between movie and actor in a descendant one and removing the subtree rooted in the year element. Moreover, the textual predicate on the actor element is removed in q2 . These queries capture elements e2 and e3 . - q4 and q5 are obtained by adding a ∗-labeled descendant to the actor element and removing the subtree rooted in the year element. These queries capture elements e4 and e5 . The costs for node renamings, node deletions, and axis relaxations are associated with query steps. In particular, a renaming cost represents the cost of renaming a node with a label having the maximum semantic distance from the original one. This cost is weighted by a semantic distance measure (such as the one provided in [37]) that evaluates the dissimilarity of a label in the query with respect to the labels in the data. The costs for the insertion of ∗-labeled nodes as descendants of leaf nodes, for the relaxation of equality predicates, and for the deletion of predicates are associated with predicates. Observe that the possibility of specifying the cost of each transformation allows users to give “priority” to some of the conditions expressed in the query. That is, specifying a high cost for a certain transformation t1 and a low cost for another transformation t2 means that elements which satisfy t1 are preferred over those satisfying t2 . In general, transformation costs are expressed as natural numbers. Nevertheless, since when evaluating a VXQL query the overall costs of its transformed versions are compared with a given threshold, it is possible to define the cost of each transformation with respect to this threshold , and vice-versa. Therefore, we just assume three predefined cost levels, high (denoted as h), medium (m), and low (l).

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

115

In some cases, besides associating costs with applicable transformations, it can be useful to avoid that some of the allowed transformations, for instance those removing or modifying important constraints of the original query, are applied together, yielding query answers too different from the requested ones. VXQL features this by allowing users to mark the more relevant transformations and specify a maximum number of marked transformations applicable to the query. Example 3. Query q of Fig. 1, augmented with transformation costs represented with labels attached to nodes and edges, takes the form of Fig. 3, where the output node is surrounded by a solid box and marked transformations are surrounded by dashed boxes. (1) a b : h

movie Legend

¼:l

(2) (3) x : m

t:c

actor C. Owen

K:l (5)

x : m (4)

year

a b Renaming

2009

x:m

(6)

(7)

Allowed transormation t with cost c

Axis relaxation

Node/predicate deletion

Insertion of a *-labeled node

Fig. 3 A VXQL query

The relaxations allowed by the query in Fig. 3 are: 1: renaming of the movie element; 2: transformation of the child axis between movie and actor in a descendant one; 3 and 4: removal of the actor and year elements; 5: insertion of a ∗-labeled node as a descendant of the actor element; 6 and 7: deletion of the predicates on the content of actor and year elements; By specifying a low cost for transformation 2 and 5, a medium cost for transformations 3, 4, 6, and 7, and a high cost for transformation 1, the user essentially states that the queries obtained by relaxing the requirement that the actor element is a child of the returned movie element or obtained by relaxing the requirement that the text C. Owen is directly contained inside the actor element are preferred over the queries that remove actor or year elements or their predicates. Moreover, the lowest preference is given to transformed queries which return elements named differently from “movie”. Since the deletions of actor predicate and year predicate are marked, if the maximum number of allowed transformations is set to 1 the elements gathered from each document must satisfy either the condition on the actor name or the condition on the movie year. Relaxed versions of the original query are obtained by applying different sequences of basic transformations, each sequence entailing a cost equal to the sum of the costs

116

B. Fazzinga

of the single transformations. We compute the cost entailed by a relaxed version q of the original query q by taking the minimum among the costs entailed by all possible sequences of transformations that yield q from q. The same applies to the number of marked transformations. Example 4. Consider query q of Fig. 3 and the relaxed queries q1 , . . . , q5 of Fig. 1. The transformation costs entailed are: a) b) c) d)

cost of q1 = h + m; cost of q2 = l + 3 · m; cost of q3 = l + 2 · m; cost of q4 = cost of q5 = l + 2 · m.

The marked transformations applied are: a) 1 for q1 , q3 , q4 and q5 ; b) 2 for q2 .

VXQL allows a very fine-grained specification of transformation costs, e.g., a different cost can be specified for each transformation. In many practical cases, lessdetailed cost specifications are enough for expressing users’ needs. Indeed, in the case that the user is not interested in specifying a different cost for each transformation or she does not have a precise idea about how costs can affect the results, she can choose one of the predefined query forms. Thus, we introduce three kinds of simplified query: - exact query, where an infinite cost is assumed for all transformations; - uniform-cost query, where the same cost is assumed for all transformations, and no transformation is marked; - count-based query, where the same cost is assumed for all transformations, but some of them are marked and a maximum number of marked transformations applicable is specified. It should be observed that uniform-cost queries are appropriate when the user has no knowledge about the relative importance of the conditions in the query. Count-based queries additionally allow the user to impose that no more than a certain number of conditions can be relaxed. In the case that the user is not satisfied with the obtained results, she can tune the predefined costs and restart the query evaluation process. This way, also a non-expert user can exploit the system capabilities to collect interesting data from the sources through approximate query evaluation. As previously explained, the evaluation of a VXQL query q on a set of (heterogeneous) XML data sources requires to evaluate q on each source, then to combine the (partial) answers obtained from this evaluation, and finally to filter out combined answers which do not satisfy q. To capture this behavior, local and global cost thresholds (τl and τg ) and the maximum number of marked transformations applicable locally and globally (κl and κg ) are specified. Example 5. Consider the elements and the queries of Fig.1, and the entailed costs computed in Example 3. Suppose that τl = l + 2 · m, τg = l, κl = 1, κg = 0. Elements

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

117

e1 , e3 , e4 , e5 are retrieved since the costs of queries q1 , q3 , q4 , q5 are equal to or less than τl (suppose that h + m is less than l + 2 · m) and the numbers of marked transformations applied to the queries are equal to or less than κl as well. Element e2 is not retrieved since the cost of q2 exceeds τl and, moreover, the number of marked transformations applied to q2 is greater than κl . We can note that none of the elements e1 , e3 , e4 , e5 can be considered separately as a final answer, since none of the costs of queries q1 , q3 , q4 , q5 is lower than or equal to τg . In order to facilitate choosing global and local cost thresholds, denoted as τg and τl , VXQL allows users to express them as percentages of the maximum allowed cost of a transformed query. Moreover, we consider three predefined (global) cost thresholds, low, medium, and high, corresponding to the 10%, 30%, and 50% of the maximum cost of a transformed query, respectively. In real world scenarios, it is responsibility of the system administrators to set the predefined thresholds, while users just choose their preferred thresholds when specifying queries. Users may also avoid specifying the local transformation threshold, as it can be set by default by increasing the global value of a fixed percentage (we use 25%).

4 Evaluating Vague Queries We now briefly describe the overall process of vague evaluation of a VXQL query over a set of sources. The process is composed of 3 main steps: Local evaluation. The query is evaluated over each source by possibly applying suitable transformations. At each step of the evaluation, the evaluation algorithm tries to apply all possible transformations to a query step that make it match the available data while not violating the local thresholds. Joining. The partial answers (XML elements) yielded by the local evaluation which are likely to refer to the same object are joined. This process is based on a function measuring the dissimilarity of the objects that two elements describe by looking at their keys. XML elements whose dissimilarity value is under a certain join threshold are grouped in sets (named vague XML elements). The output of this step is a set of vague XML elements, each representing a query answer and having an “overall” transformation cost. Selection. Vague elements whose associated transformation cost is under the global thresholds are selected. As the joining step may produce vague elements that are subsets of others and thus needless in the final result, redundant vague elements are pruned from the result. In the next sections, we give a more detailed description of the three phases of our approach.

4.1 Local Evaluation In this section we describe our approach to the evaluation of a VXQL query against the local XML database of a source.

118

B. Fazzinga

In our algorithm, we use a textual form for denoting a VXQL query. A VXQL query is an XPath expression with costs associated. An XPath expression exp is a sequence of XPath steps s1 / . . . /sk such that each si is of the form ss[ f1 ] . . . [ fn ], where ss will be said to be simple step and is of the form axis::l and each f j is said to be a filter expression and, in turn, consists in an XPath expression. We evaluate a VXQL query by applying relaxations on the corresponding XPath expression and taking into account the specified costs and thresholds in order to avoid some unnecessary relaxations. The strategy implemented in our algorithm consists in scanning the steps of the input XPath expression exp in order of occurrence in exp, and evaluating all the possible relaxations of each step against the examined document. The output of the algorithm is a set of XML elements that are exact answers to relaxed versions of the original expression. The intermediate results consist of sets of tuples, each tuple representing a node binding. In particular, a tuple n, c, m represents the binding of a node n that has been obtained by applying basic transformations that entail a total cost equal to c, and m of these transformations are marked. Function evExpression takes as input a node binding (in the first invocation costs are 0 and the context node is the root of the document), an XPath expression exp and the two local thresholds and evaluates exp w.r.t. a set of bindings, by essentially invoking evStep on each step in exp. Function evStep evaluates a step w.r.t. a set of node bindings. Specifically, first the simple step is evaluated and the output set of node bindings is pruned. Second, every filter is evaluated, by recursively invoking function evExpression, since each filter is, in turn, an expression. The result of each filter evaluation, that is a set of bindings having as nodes the context nodes and as costs the costs resulting from the filter evaluation, is given as input for the subsequent filter evaluation. Finally, the result of the last filter evaluation is returned. Function prune is used to eliminate useless bindings. More precisely, for each binding of an XML element n, function prune eliminates every other binding of n with higher costs. Function evSimpleStep returns a set Bnew of new node bindings, representing the result of evaluating all the possible relaxations of ss. Obviously, function evSimpleStep applies a relaxations only if the resulting costs are not greater than the thresholds and provide new bindings having costs properly updated on the basis of the applied relaxations. Function transfCost returns the cost specified in the query for the input step and the input relaxation. Functions MarkDel, MarkAxisRel, MarkRen evaluate to true if the corresponding relaxations (i.e., node deletion, axis relaxation, node renaming, respectively) are marked for the input step in the query. Function retrieve accesses the document and retrieves the nodes satisfying the specified label and reachable from the context node n through the specified axis. In the case that the input label is sim(l), function retrieve retrieves nodes reachable from the context node n through the specified axis and whose label is semantically similar to the specified label. This means that the dissimilarity d between the specified label and the data label is measured, and it is assessed that the sum between the current cost c and d multiplied by the specified cost r for the renaming relaxation in the step is not greater than τl (e.g, c + d ∗ r ≤ τl ).

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

119

Predicate relaxations are applied to the current step in the same way as the above relaxations, and, for the sake of conciseness, their evaluation is not described. Algorithm 1 function evExpression( r, 0, 0, , s1/ . . . /sk , τl , κl ) B1 = evStep({ r, 0, 0, }, s1 , τl , κl ) for each i ∈ [2..k] do Bi = evStep(Bi−1 , si , τl , κl ) end for return Bk function evStep(B, ss[ f1] . . . [ fh ], τl , κl ) Bnew = prune(evSimpleStep(B, ss, τl, κl )); for each i ∈ [1..h] do B = 0/ ; for each n, c, m in Bnew do B = evExpression( n, c, m , f i , τl , κl ); for each n , c , m in B do B = B ∪ n, c , m end for end for Bnew = prune(B ); end for return Bnew function prune(B) Bnew = 0/ for each n, c, m ∈ B do if ( n, c , m ∈ B| (c < c and m ≤ m) or (c ≤ c and m < m) Bnew = Bnew ∪ { n, c, m } end for Bnew = removeSameCost(Bnew) return Bnew function evSimpleStep(B, axis :: l, τl , κl ) Bnew = 0/ for each n, c, m ∈ B do Bnew = Bnew ∪ retrieve(n,axis,l) if (! MarkRen(axis::l) or m + 1 ≤ κl )) Bnew = Bnew ∪ retrieve(n,axis,sim(l),τl ) if (( transfCost(axis::l,axisRelaxation)+c ≤ τl ) and (! MarkAxisRel(axis::l) or m + 1 ≤ κl )) Bnew = Bnew ∪ retrieve(n,descendant,l,τl) if (( transfCost(axis::l,axisRelaxation)+c ≤ τl ) and (! MarkAxisRel(axis::l) or m + 1 ≤ κl ) and (!MarkRen(axis::l) or m + 1 ≤ κl )

120

B. Fazzinga

or ( MarkAxisRel(axis::l) and MarkRen(axis::l) and m + 2 ≤ κl )) Bnew = Bnew ∪ retrieve(n,descendant,sim(l)),τl) if (( transfCost(axis::l,deletion)+c≤ τl ) and (! MarkDel(axis::l) or m + 1 ≤ κl )) if ( MarkDel(axis::l)) m = m+1 Bnew = Bnew ∪ {n, c+transfCost(axis::l,deletion), m} end for return Bnew

4.2 Joining Partial Results The semantics of VXQL is in general independent of the particular technique adopted to measure the semantic dissimilarity between two XML elements. Any technique whose aim is to assess whether two elements refer to the same real-world object can be used. In this section, we start by discussing possible approaches to this problem, then we describe our proposed dissimilarity evaluation function.

actor

…

actor

name

age

…

movies

name

…

Clive Owen

movie

Clive Owen

movie

title

year

title

year

The International

2009

Inside Man

2006

…

actor

…

name

nationality

…

Clive Owen

England

filmografy … movie

title The International

movie

title Elizabeth: The Golden Age

Fig. 4 Three elements describing the same object

city London

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

121

In general, different sources may employ different representations of the same information, so it is difficult to establish whether differently-structured XML elements refer to the same object. A naive strategy is to measure the structure or content dissimilarity between (whole) elements. The problem with this approach is that it suffers when different XML structures are used; for instance, elements e1 and e2 in Fig. 4 are very different, although they definitely refer to the same actor. A more suitable approach would be that of evaluating the dissimilarity between keys, if available. This way, elements e1 and e2 in Fig. 4, having the same key (name), are recognized to be similar. Nevertheless, since keys are defined locally by each source, elements referring to the same object could be identified by keys having different structures. Specifically, when comparing elementâ&#x20AC;&#x2122;s keys, the possibility of two elements referring to the same object but being identified by completely different keys must be taken into account (for instance, elements e2 and e3 in Fig. 4, as the key of e2 is name whereas the key of e3 is ID). In these cases, any dissimilarity function that looks at key dissimilarity, even by applying an edit distance-based approach [41, 17], fails in recognizing elements referring to the same object. A more effective approach is that of testing whether the information represented by the key of one element is contained in the second, or vice-versa. The XML dissimilarity measure we currently adopt is based on this idea. We exploit VXQL queries to measure the dissimilarity degree of the objects described by two XML elements. Specifically, given two XML elements e and e , our approach uses VXQL queries to check whether the information contained in the key of e is â&#x20AC;&#x153;representedâ&#x20AC;? in e and vice-versa. That is, if either the information in the key of e is represented in e , or the information in the key of e is represented in e , we say that e and e refer to the same object. For instance, consider elements e2 and e3 in Figs. 4. It can be noted that the keys identifying the two elements are completely different, but the information contained in the key of e2 is fully contained in e3 . In this case, we conclude that e2 and e3 refer to the same actor. In order to check whether the information contained in the key of e is represented in e , we first translate the key of e into a key-testing VXQL query and then execute it on e (named target element). The VXQL query associated with the key of e must consider any relevant information represented in the key of e , while allowing this information to be represented differently in e . The relative order of subelements in the key is disregarded in the VXQL query, since it is unusual to represent key information using the relative order of subelements (this feature is not supported by XML key constraint languages). Moreover, some flexibility is provided in the execution of the query. Specifically, lower weights are associated with transformations which only alter the structure of the key, retaining the semantics of its information. The key-testing VXQL query associated with an element e does not allow deleting steps that correspond to textual predicates or leaf elements, since the information contained in these parts of e is the most important when recognizing dissimilarity between the key and a target element. We instead assume a unitary cost for axis relaxation, *-node insertion and relaxation of equality predicates, since these transformations only modify the structure of the key. Moreover, the key-testing VXQL query associated with an element e assigns a higher cost to node renaming, because

122

B. Fazzinga

this transformation may correspond to modifications of the semantics of elements in the key. Finally, (internal) node deletion essentially corresponds to applying both node renaming and axis relaxation. Hence, the cost assigned to node deletion in the key-testing VXQL query associated with an element e is greater than the sum of the costs assigned to node renaming and axis relaxation. Example 6. The key-testing VXQL query associated with element e2 in Fig. 4 is shown in Fig.5. In this case, the cost of axis relaxation, equality predicate relaxation (denoted as â&#x2C6;ź) and â&#x2C6;&#x2014;-node insertion is 1, the cost of node renaming is 2 and the cost of node deletion is 4.

Fig. 5 Key testing VXQL query associated with element e2 in Fig. 4

Given two XML elements e and e and the minimum-cost relaxed version rxp of the key-testing VXQL query obtained starting from the key of e such that e satisfies rxp, we take the cost of rxp as a measure of the semantic dissimilarity between e and e . For instance, consider elements e2 and e3 of Fig. 4; since the key-testing VXQL query obtained starting from the key of e2 is satisfiable on e3 applying no transformation at all, we conclude that e2 and e3 refer to the same actor. In this phase, we build sets of XML elements referring to the same object, by putting in the same set the XML elements whose dissimilarity value, evaluated by the above key-testing VXQL query, is less than a given threshold. Specifically, we evaluate the dissimilarity among the XML elements (belonging to different sources) retrieved in the local evaluation phase, and we group those elements that are judged to represent the same real world object in sets called vague elements.

4.3 Selection of Final Results In this phase, we examine the vague elements provided in the joining phase and select only those vague elements that satisfy the original query to a reasonable extent, that is we compute the overall cost of the whole set w.r.t. the original query, and select only those vague elements whose overall cost is lower than or equal to the global thresholds. Given a vague element v, as each element in v describes a certain object o, it is reasonable to merge the elements in v to provide a single description of o. This can be achieved by just concatenating the content of the elements in v. Function

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

123

merge applied to a vague element v returns the concatenation of the contents of the elements e ∈ v. Observe that the same information can be reported several times in merge(v) and that the relative order among the content of the elements merged together is irrelevant to compute their correspondence w.r.t. the original query. Intuitively enough, the relevance of a vague element v w.r.t. a VXQL query q can be assessed by executing q on merge(v). The minimum cost relaxed version of the query that matches the merged elements represents the cost of the whole vague element. Example 7. Consider elements e1 and e4 of Fig.1, and suppose that they are recognized as referring to the same object in the joining phase, thus they are put in a vague element v. The merged element e obtained by joining all the elements of v (suppose that v only contains e1 and e4 ) is shown in Fig. 6. In this case, the relaxed version q (shown in Fig. 6) of the original query q matching the merged element e entails a cost equal to l with no marked transformations applied, so the cost of evaluating q on e is l. Supposing that the local threshold and the global threshold are those of Example 3, we obtain that the merged element obtained by joining e1 and e4 is a final result since the cost and the number of marked transformations applied are not greater than τg and κg , respectively. We can easily note that the merged element e obtained by joining e3 and e5 is not a final result, since the minimum overall cost of a query matching e is l + 2 · m and it is greater than τg .

movie

title The International

actor

title

year 2009

N. Watts

name

age

C. Owen

q’

movie

actor

The International

year 2009

* C. Owen

Fig. 6 The merged element obtained by joining e1 and e4 of Fig. 1 and the query matching it

124

B. Fazzinga

5 An Application Scenario The techniques proposed in this chapter have been implemented in a peer-to-peer (P2P) application scenario [15], that is a typical scenario where mapping-based approaches fail due to the volatility of the sources and the strong requirement of the maximum level of source autonomy. In particular, we considered a hybrid P2P system [8, 25, 26], where some distinguished peers (super-peers) act as resource information indices, that maintain meta-information about the resources made available by the different peers, and are possibly organized in P2P networks themselves. The system implements the architecture shown in Fig. 7. In particular, the lefthand side of the figure depicts the modules implemented by peers, and its right-hand side depicts those implemented by super-peers. Each peer is connected to a single super-peer. Peer

Querying API / User interface

Super-peer Routing module

Query engine Local query engine

Local XML repository

Data Synopsis repository

Global query engine

Synopsis builder

P2P network sublayer

Network

Fig. 7 System architecture

Besides the underlying database management subsystem, the architecture of peers comprises four main modules: the P2P network sublayer, the Synopsis builder, the Querying API/User interface, and the Query engine. The P2P network sublayer manages the interactions with the underlying network. The synopsis builder computes concise representations of the stored XML data (whose structure will be detailed in the following), and sends them to the super-peer of reference, through the P2P network sublayer. The querying API/user interface module manages the interactions with users. It provides an API for submitting queries in their textual form and collecting results. A user interface allows the user to (i) specify queries in both graphical and textual form; (ii) obtain a graphical representation of the results as they are received (as it will be clearer in the following, the systems aims at first contacting the peers that are more likely to provide results); (iii) decide, on the basis of his/her degree of satisfaction, when to stop the process. The query engine implements the query evaluation algorithm and the logic for combining partial answers coming from different sources. These functionalities are managed separately by two submodules:

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

125

- The Local query engine applies the vague query evaluation process over the local XML database, producing partial answers. Such answers may bring along the information which is subsequently used to evaluate the degree of dissimilarity among different XML elements. The results of the local query evaluation process are returned to the global query engine if the query was submitted to the local peer, otherwise they are sent back through the P2P network sublayer. The local query engine also connects to an external ontology (not shown in the figure) that provides the semantic distance function between two element names. - The Global query engine is employed when a query is issued locally. It forwards the query to the super-peer of reference and collects answers through the P2P network sublayer, then completes the global query evaluation process by joining the partial results obtained and returning them to the user through the querying API. The architecture of super-peers comprises three main modules: the Synopsis repository, the P2P network sublayer, and the Routing module. The P2P network sublayer receives data synopses from peers and stores them into the repository. Moreover, it receives vague queries from peers and passes them to the routing module. The routing module works in co-operation with the other super-peers. It gathers data synopses from its local repository and from the repositories of other super-peers, then it applies a routing strategy that, by exploiting the information in the synopses, is capable of (i) reducing the number of query issued on non-relevant peers, i.e., peers whose local schema ensures that the local query evaluation would not provide results; (ii) giving priority to peers that will possibly provide more results. Thus, the routing strategy brings several benefits. First, it reduces the overall load posed to the system (yet not improving query answering performance). Second, in the case of highly-loaded systems (both in terms of computational and network load), the load can be better tailored to the actual user needs, by first posing queries to the peers that are more likely to provide answers. This way, interactive querying sessions can be supported where the user is quickly presented with the first obtained answers, and the retrieval lasts just as long as it is needed. The routing strategy is described in the next section.

5.1 Routing Strategy Our proposed routing strategy uses the XSketch data synopses proposed in [29]. The XSketch synopsis associated with an XML document is a graph whose nodes represent sets of elements with the same name. Each node in the synopsis is annotated with the cardinality and the shared element name of the corresponding set. An edge between two nodes n1 , n2 represents a parent-child relationship between an element in n1 and an element in n2 . Moreover, the edge from n1 to n2 is labeled with F iff every element in n1 has at least one child in n2 ; the edge is labeled with B iff for every element in n2 , its parent is in n1 . For instance, in the document represented by the synopsis in Fig. 8,1 (i) there are 19 movie and 8 series elements; (ii) 1

For the sake of readability, textual nodes are not represented in the figure.

126

B. Fazzinga

each movie and series has a title; (iii) each series has a season, and seasons are children of series elements only; (iv) the 64 actor elements are children of both movies and series elements, but all of the 64 name elements are children of actor elements. DB (1)

B,F

B,F series (8)

movie (19) F

title (27) actor (64) B,F name (64)

F year (42)

B,F season (23)

F F

Fig. 8 An example XSketch synopsis

An XSketch synopsis can be exploited to estimate the selectivity of an XPath expression, that is the number of XML elements that are selected by the expression. In general, the selectivity estimation of an XPath query q using a synopsis S is performed by first computing the whole set of embeddings of q in S , then summing up the selectivity associated with each embedding. In particular, our system uses the algorithm proposed in [29] to compute the selectivity of an XPath query q w.r.t. a node n of the synopsis, denoted as sel(q, n). Selectivity estimation is used by the query routing module to compute an overall score given to a synopsis with respect to a VXQL query. This score is then employed to drive routing decisions, i.e., more priority is given to the peers whose synopses exhibit higher scores. For each node in the synopsis, the selectivity of the transformed versions of the query w.r.t. the node is computed. Since a transformed query may not represent all the original query conditions, we weigh the selectivity associated with a node in the synopsis with the “relative” cost of the transformed query which selects the node. Given a VXQL query q and a relaxed query q obtained from q, the relative transformation cost of q is given by the cost for transforming q into q (denoted as cost(q, q )) divided by the maximum transformation cost of q (denoted as maxCost(q)). The score given to a synopsis S w.r.t. a VXQL query q is defined as follows: cost(q, q ) score(q, S ) = ∑ maxq ∈RelVers(q,τl ,κl ,n) sel(q, n) ∗ 1 − maxCost(q) n∈S where RelVers(q, τl , κl , n) denotes the set of all the relaxed versions of q matching n ∈ S whose cost is not greater than the local thresholds. Note that the formula correctly rules out non-output nodes as no transformed query exists for them under the cost thresholds.

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

127

<!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT

University (Professor*)> Professor (name, city, course*, article*)> course (name, year, argument*, book*)> article (title, volume, year, co-author*)> book (title, author*, editor*, publisher*)>

<!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT

University (FullProfessor*, AssociateProfessor*, AssistantProfessor)> FullProfessor (PersonalData)> AssociateProfessor (PersonalData)> AssistantProfessor (PersonalData)> PersonalData (code, name, age, city, course*, paper*)> course (name, year, topic*, credits)> paper (title, volume, year, keyword*)>

<!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT

University (Employee*)> Employee (Professor|GenericStaff)> Professor (code, name, city, course*, researchActivity*)> researchActivity (topic, journalPaper*, conferencePaper* )> course (name, year, topic*, numberOfStudents)> journalPaper (title, volume, year, co-authors*)> conferencePaper (title, titleConference, year, co-authors*)>

Fig. 9 Example DTDs

Table 1 Queries used in the experiments Query ID Q1 Q2 Q3 Q4 Q5 Q6

Meaning Professors teaching a course having a specific argument and a specific number of credits Professors who are author of an article having a specific keyword Professors who are co-author of a certain professor in a certain year Professors of a specific city teaching a course adopting a specific book Courses having a specific number of students and having a specific number of credits Full professors of a specific age whose research activity includes a specific topic

5.2 Experimental Evaluation In this section we describe the experimental evaluation we performed to assess the effectiveness of our proposed techniques in the previously-described P2P scenario. The setting of the experiments is described by the following parameters: - the system was composed of a network of 20 Pentium IV machines, with RAMs ranging from 1GB to 2GB; - the peers provided data about professors, courses and articles, and adopted 8 different schemas and differently-structured keys comprising personal and fiscal data, (Figure 9 shows 3 example DTDs2 ); - 6 uniform-cost queries, with different degrees of selectivity, were issued against the system; the queries are reported in Table 1; - 4 of the 20 peers acted as super-peers and were part of a fully-connected network; - the data had an overall size of 60MB; 2

The labels in the original DTDs are in Italian.

128

B. Fazzinga

Fig. 10 Correct answers returned with different thresholds

Table 2 Precision, recall, and gain obtained Query Q1 Q2 Q3 Q4 Q5 Q6

High threshold Prec. Recall Gain 88.9% 80.0% 60.0% 93.8% 95.7% 40.6% 87.5% 70.0% 133.3% 80.0% 80.0% 100.0% 92.3% 85.7% 71.4% 91.7% 86.8% 57.1%

Medium threshold Prec. Recall Gain 94.1% 80.0% 60.0% 95.5% 89.4% 31.3% 87.5% 70.0% 133.3% 75.0% 60.0% 50.0% 92.3% 85.7% 71.4% 93.9% 81.6% 47.6%

Low threshold Prec. Recall Gain 100.0% 75.0% 50.0% 100.0% 89.4% 31.3% 85.7% 60.0% 100.0% 100.0% 40.0% 0.0% 91.7% 78.6% 57.1% 96.8% 78.9% 42.9%

Table 3 Average precision, recall, and gain Precision Recall Gain

High threshold Medium threshold Low threshold Exact 91.4% 93.3% 97.2% 99.0% 87.3% 82.8% 79.1% 54.1% 56.0% 48.0% 41.3% â&#x20AC;&#x201C;

- three different global cost thresholds were employed, corresponding to the 50% (high), 30% (medium), and 10% (low) of the maximum cost of the transformed versions of the queries; - the local cost threshold was set equal to the global one increased by a 25%; - the timeout was set to 2 minutes. Fig. 10 shows the number of correct answers returned. Specifically, for each of the 6 queries considered, the diagram reports the number of actual objects satisfying the query, the number of correct answers retrieved through vague evaluation when varying the cost threshold, and the number of correct answers retrieved through exact evaluation. The number of objects satisfying the query has been computed by manually translating the queries to the schemas used by the sources. A vague element is assumed to be an incorrect answer if either it contains an element describing an object that is not an answer to the query, or if it contains two elements describing different objects.

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

129

The results show that, in all cases, relaxed queries allow the retrieval of significantly more answers than exact queries (48, 4% more on average). In many cases, we obtain a number of distinct correct answers that is close to the number of objects in the data set; this shows the effectiveness of our joining technique. Regarding the choice of a threshold, we can conclude that a medium threshold is a good compromise between correctness and completeness. Furthermore, with a low threshold we obtain a very high precision (close to that obtained by exact evaluation) and still a considerable gain in terms of correct answers (41.3%). This shows the effectiveness of our technique when the user gives more importance to correctness than to completeness of results. Table 2 reports, for each query and for each threshold, the precision obtained, defined as the ratio between the number of correct answers and the total number of answers, and the recall, defined as the ratio between the number of correct answers and the number of objects satisfying the query. Note that, if more than one vague element in the query result refer to the same object, these vague elements are not considered separately when computing the recall. The table also reports a value that indicates the increase in the number of correct answers obtained through vague evaluation. This value, called gain, is defined as ans/exAns â&#x2C6;&#x2019; 1 where ans is the number of correct answers to the query, and exAns is the number of correct answers to the exact version of the query. Table 3 reports the average precision, recall, and gain. The experiments show that our approach is able to retrieve and properly combine data from heterogenous sources, providing high precision (94% on average) and recall (83.1% on average). We also evaluated the effectiveness of our proposed routing strategy by looking at how the number of partial answers returned by peers is related to the score given to their synopses. Fig. 11 reports the percentage of partial answers retrieved as the evaluation proceeds; the X-axis reports the percentage of peers already contacted (we recall that peers are contacted in decreasing score order). We averaged the values over the 6 queries with medium threshold. The results obtained show that the routing policy gives proper priority to the peers that are more likely to contribute to the query results. Specifically, in the case depicted in the figure, almost 80% of the total number of answers are returned to the user after having accessed just 68% of the contributing peers.

Fig. 11 Effect of the routing policy

130

B. Fazzinga

6 Conclusions In this chapter we focused on the retrieval of XML data from heterogeneous multiple sources. We presented the main issues arising in this field, and we discussed the state-of-the-art approaches tackling these problems. We proposed a new approach enabling the retrieval of meaningful answers from different sources, by exploiting vague querying and approximate join techniques. It essentially consists in first applying transformations to the original query obtaining relaxed versions of it, each matching the schema adopted at a single source, then using relaxed queries to retrieve partial answers from each source and finally combining them using information about retrieved objects. The approach has been experimentally validated and has proved effective in a P2P setting. Several issues remain still open in the context of approximately querying heterogeneous XML data sources. On the one hand, it will be worth investigating the use of more expressive rewriting mechanisms, based on sets of transformation rules richer than that investigated in this chapter and employed in most state-of-the-art systems where approximate query evaluation is obtained through query relaxation. Augmenting the set of transformation rules with further relaxation primitives (other than node renaming, node deletion, subtree promotion, axis relaxation, predicate relaxations) would improve the flexibility of the rewriting mechanism, thus enhancing its capability of adapting a query to different data schemas. On the other hand, a great deal of attention should be devoted to the optimization of the evaluation of approximate query answers.

References 1. Abiteboul, S., Benjelloun, O., Milo, T.: The Active XML project: an overview. Journal on Very Large Databases 17(5), 1019–1040 (2008) 2. Amer-Yahia, S., Cho, S., Srivastava, D.: Tree pattern relaxation. In: Proc. Int. Conf. on Extending Database Technology, pp. 496–513 (2002) 3. Amer-Yahia, S., Koudas, N., Marian, A., Srivastava, D., Toman, D.: Structure and Content Scoring for XML. In: Proc. Int. Conf. on Very Large Databases, pp. 361–372 (2005) 4. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: FleXPath: Flexible Structure and FullText Querying for XML. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 83–94 (2004) 5. Augsten, N., Bhlen, M.H., Dyreson, C.E., Gamper, J.: Approximate Joins for DataCentric XML. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 814–823 (2008) 6. Baru, C.K., Gupta, A., Ludscher, B., Marciano, R., Papakonstantinou, Y., Velikhov, P., Chu, V.: XML-based information mediation with mix. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 597–599 (1999) 7. Beneventano, D., Bergamaschi, S., Guerra, F., Vincini, M.: The SEWASIE Network of Mediator Agents for Semantic Search. Journal of Univ. Comp. Science 13(12), 1936– 1969 (2007) 8. http://www.bittorrent.com

Exploiting Vague Queries to Collect Data from Heterogeneous XML Sources

131

9. Bonifati, A., Chang, E.Q., Ho, T., Lakshmanan, L.V.S., Pottinger, R.: HePToX: Marrying XML and heterogeneity in your P2P databases. In: Proc. Int. Conf. on Very Large Databases, pp. 1267–1270 (2005) 10. Camillo, S.D., Heuser, C.A., Mello, R.S.: Querying heterogeneous XML sources through a conceptual schema. In: Proc. Int. Conf. on Conceptual Modeling, pp. 186–199 (2003) 11. Chen, C.X., Mihaila, G.A., Padmanabhan, S., Rouvellou, I.: Query translation scheme for heterogeneous XML data sources. In: Proc. ACM Int. Work. on Web Information and Data Management, pp. 31–38 (2005) 12. Do, H., Rahm, E.: COMA - A system for flexible combination of schema matching approaches. In: Proc. Int. Conf. on Very Large Databases, pp. 610–621 (2002) 13. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 509–520 (2001) 14. Fagin, R.: Combining Fuzzy Information from Multiple Systems. J. Comput. Syst. Sci. 58(1), 83–99 (1999) 15. Fazzinga, B., Flesca, S., Pugliese, A.: Retrieving XML data from heterogeneous sources through vague querying. ACM Trans. on Internet Technology 9(2) (2009) 16. Fuhr, N., Grojohann, K.: XIRQL: An XML query language based on information retrieval concepts. ACM Trans. on Information Systems 22(2), 313–356 (2004) 17. Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Integrating XML data sources using approximate joins. ACM Trans. on Database Systems 31(1), 161–207 (2006) 18. Halevy, A.Y., Ives, Z.G., Madhavan, J., Mork, P., Suciu, D., Tatarinov, I.: The Piazza Peer Data Management System. IEEE Trans. on Knowledge and Data Engineering 16(7) (2004) 19. Leit˜ao, L., Calado, P., Weis, M.: Structure-based inference of xml similarity for fuzzy duplicate detection. In: Proc. Int. Conf. on Information and Knowledge Management, pp. 293–302 (2007) 20. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with Cupid. In: Proc. Int. Conf. on Very Large Databases, pp. 49–58 (2001) 21. Mandreoli, F., Martoglia, R., Tiberio, P.: Approximate query answering for a heterogeneous XML document base. In: Proc. Int. Conf. on Web Information Systems Engineering, pp. 337–351 (2004) 22. Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. In: Proc. Int. Conf. on Very Large Databases, pp. 241–250 (2001) 23. Miklau, G., Suciu, D.: Containment and equivalence for a fragment of XPath. Journal of the ACM 51(1), 2–45 (2004) 24. Milano, D., Scannapieco, M., Catarci, T.: Structure-aware XML Object Identification. IEEE Data Eng. Bull. 29(2), 67–74 (2006) 25. http://www.napster.com 26. Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palmr, M., Risch, T.: EDUTELLA: A P2P networking infrastructure based on RDF. In: Proc. Int. World Wide Web Conf., pp. 604–615 (2002) 27. Pan, H.: Relevance Feedback in XML Retrieval. In: Proc. Int. Conf. on Extending Database Technology Workshops, pp. 187–196 (2004) 28. Pitoura, E., Abiteboul, S., Pfoser, D., Samaras, G., Vazirgiannis, M.: DBGlobe: A service-oriented P2P system for global computing. ACM SIGMOD Record 32(3), 77–82 (2003) 29. Polyzotis, N., Garofalakis, M.N.: Xsketch synopses for xml data graphs. ACM Transaction on Database Systems 31(3), 1014–1063 (2006)

132

B. Fazzinga

30. Puhlmann, S., Weis, M., Naumann, F.: XML Duplicate Detection Using Sorted Neighborhoods. In: Proc. Int. Conf. on Extending Database Technology, pp. 773–791 (2006) 31. Ribeiro, L., Hrder, T.: Entity Identification in XML Documents. Grundlagen von Datenbanken, 130–134 (2006) 32. Rodriguez-Gianolli, P., Mylopoulos, J.: A semantic approach to XML-based data integration. In: Proc. Int. Conf. on Conceptual Modeling, pp. 117–132 (2001) 33. Schlieder, T.: Schema-driven evaluation of approximate tree-pattern queries. In: Proc. Int. Conf. on Extending Database Technology, pp. 514–532 (2002) 34. Tatarinov, I., Halevy, A.Y.: Efficient query reformulation in peer-data management systems. In: Proc. ACM SIGMOD Conf. on Management of Data (2004) 35. Theobald, A., Weikum, G.: Adding Relevance to XML. In: Proc. Int. Work. on the Web and Databases, pp. 35–40 (2000) 36. Vdovjak, R., Houben, G.: RDF-based architecture for semantic integration of heterogeneous information sources. In: Proc. Work. on Information Integration on the Web, pp. 51–57 (2001) 37. WordNet, http://wordnet.princeton.edu/ 38. The World Wide Web Consortium. Extensible Markup Language (XML), http://www.w3.org/XML 39. The World Wide Web Consortium. XML Path Language, http://www.w3.org/TR/xpath 40. Yu, C., Popa, L.: Constraint-based XML query rewriting for data integration. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 371–382 (2004) 41. Zhang, K., Stgatman, R., Shasha, D.: Simple fast algorithm for the editing distance between trees and related problems. SIAM J. on Computing 18(6), 1245–1262 (1989)

Fuzzy XQuery Marlene Goncalves and Leonid Tineo

Abstract. This chapter comprises a fuzzy-set-based extension to XQuery which allows user to express preferences on XML documents and retrieves documents discriminated by their satisfaction degree. This extension consists of the new xs:truth built-in data type intended to represent gradual truth degrees as well as the xml:truth attribute to handle satisfaction degrees in nodes of fuzzy XQuery expressions. XQuery language is extended to declare fuzzy terms and use them in query expressions. Additionally, several kinds of expressions as FLWOR are fuzzified. Finally, an evaluation mechanism is presented in order to avoid superfluous calculation of truth degrees.

1 Introduction World Wide Web plays an essential role in many online companies and it has made available an exorbitant amount of data from several websites. Many websites offer common online services such as travel agencies, shopping stores, car rental, encyclopedia, and so on. Thus, the Web has been become a popular tool for this kind of services. In fact, many of these websites contain engines which query data from other existing sites. Usually, most of these websites may use XML (Extensible Markup Language) format [20] to interchange data. XML is a format recommended by the World Wide Web Consortium (W3C) and it has been extensively used to transfer data among websites. XML documents may be queried through declarative query languages such as XPath [18] and XQuery [19]. Both languages are XML-centric, i.e., their data model and type system are based on XML. XQuery is an extension of XPath conceived to integrate multiple XML sources and it is the W3C standard language for XML data. Most of database engines (IBM, Oracle, and Microsoft) support XQuery. Marlene Goncalves Universidad Simón Bolívar, Apartado 89000, Caracas 1080-A, Venezuela e-mail: mgoncalves@usb.ve Leonid Tineo Universidad Simón Bolívar, Apartado 89000, Caracas 1080-A, Venezuela e-mail: leonid@usb.ve Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 133–163. springerlink.com © Springer-Verlag Berlin Heidelberg 2010

134

M. Goncalves and L. Tineo

Despite expressive power of XQuery, this language is not accurate to handle search criteria based on user’s preferences which are expressed using linguistic terms. Furthermore, XQuery is not able to discriminate query answers according to user’s criteria. This weakness is often referred to as rigidity problem of query languages and it is due to query conditions are based on Boolean logic [2]. To illustrate user’s preference criteria, suppose researchers who want to attend a conference. They decide to query a travel company website searching for the best flight trip according their preferences. For someone, a trip may be selected if and only if is the very cheap and makes few connections to arrive the city where conference will hold. However, a second person prefers a direct flight whose destination is a near city and then takes a train to reach the conference city. In this example, preference criteria involve linguistic terms of vague nature such as: very, cheap, few, and near. Semantic of such terms is context-dependent and may vary according to user’s preference. On the other hand, as many optional trips might exist, it would be helpful to discriminate them in terms of compatibility with user’s criteria. A possible solution to this kind of queries is a query system based on fuzzy logic. Such system may allow defining user’s criteria and ranking query answers using a membership function; a membership function quantifies the satisfaction degree of each answer with respect to user’s criteria and induces a total order of the dataset. At present time, software developers must rank query answer in the logic layer because XQuery does not support queries based on fuzzy logic. In consequence, it could be a source of errors that increases development costs and requires high level capacitated programmers. In addition, current websites does not provide this kind of query capabilities for which expression handling of user preferences is limited. In this chapter, we propose to incorporate fuzzy logic to XQuery language as a step towards providing a more flexible native XML language which is able to discriminate answers based on user’s criteria. We extend at most all valid XQuery expressions in order to allow fuzzy conditions. Also, we suppose XQuery syntax is familiar for reader. Details of XQuery are presented by W3C [19]. The presentation of proposed extension is as follows: ⎯ First, we dedicate a section to introduce a basic background of fuzzy logic which is needed for comprehension of our extension. ⎯ Second, we briefly review related work in the fields of XML query languages and fuzzy query languages. ⎯ In the third section of this chapter, we detail a first extension to XQuery to support fuzzy logic capabilities, this is, the extension of the data model that includes truth degrees as well as traditional Boolean values: true and false. ⎯ Fourth, we extend XQuery with capability of declaring linguistic terms; terms are interpreted based on fuzzy sets according to user’s preferences. ⎯ Fifth section describes the main contribution, a fuzzy-logic-based extension for different kinds of query expressions in XQuery, i.e., filter expressions, comparison expressions, quantified expressions, conditional expressions and FLWOR expressions. ⎯ Sixth, we

Fuzzy XQuery

135

propose a query evaluating mechanism that attempts to keep low the extra added cost of fuzzy query evaluation. ⎯ Seventh, and final, section points out the conclusions of this chapter and addresses some future works.

2 Background Zadeh [21] introduced fuzzy sets as an extension of classical ones. A fuzzy set is a set whose elements possess membership degrees determined by a membership function. A membership function μ of a fuzzy set F, denoted by μF, is a function into range [0,1] that induces a totally ordered set. Some important concepts of fuzzy sets theory are support, core, and α-cut. They allow establishing relationships between fuzzy sets and classical. Support corresponds to those elements that are not fully excluded from the fuzzy set, i.e., a set of elements whose membership degree is greater than zero. Core identifies those elements that are fully included in the set, this is, a set of elements whose membership degree is 1. Finally, α-cut operator defines a set of elements whose membership degree is greater or equal than α ∈[0,1]. Fuzzy sets are the basis of fuzzy logic. It is a multi-valued logic whose truth values are represented by real numbers on the closed interval [0,1]. The truth value of a proposition s is denoted by μ(s). The value μ(s)=0 means that sentence s is completely false and μ(s)=1 when s is completely true. A truth degree μ(s)∉{0,1} means that s is possibly true with a degree μ(s). Fuzzy logic sentences are built using linguistic terms: predicates, modifiers, comparisons, connectives and quantifiers. Semantic of such terms is context-dependent and expresses user’s preferences. Positive adjectives or terms in natural language such as good, bad, cheap, and expensive, may be used in fuzzy logic as fuzzy predicates. Semantic of these linguistic terms is expressed by means of fuzzy sets. Fuzzy modifiers are linguistic terms whose effect is to transform a membership function of a predicate by increasing, decreasing, translating o reversing. In most cases, the very and extremely adverbs indicate presence of a fuzzy modifier. Linguistic terms may also be used to establish comparisons. Comparative adjectives expressed in English by termination er or the word more, and pure adjectives such as better and worse, are examples of linguistic terms represented as fuzzy comparators. Semantic of these terms is also expressed by means of fuzzy sets but in this case they are fuzzy sets of pairs, i.e., binary fuzzy relations. Fuzzy logic propositions may be combined by means of fuzzy connectives. This category includes fuzzy extension of traditional logic operators negation, conjunction and/or disjunction. Furthermore, user may define his own connectives. Universal and existential quantifiers have been extended as fuzzy quantifiers. Quantitative adjectives as few and many, and relative superlative as most of are linguistic quantifiers. Semantic of these fuzzy terms may be given by fuzzy sets of numbers with non-empty core and a convex membership function.

136

M. Goncalves and L. Tineo

Fuzzy quantifiers are classified according to their nature as absolute or proportional [22]. Absolute quantifiers are defined on real numbers and they describe fuzzy quantities such as about 5 or more than 20. Proportional ones correspond to numbers into closed interval [0,1] and they define relative quantities such as at least half or most of half. According to their membership function behavior, fuzzy quantifiers may be classified by increasing, decreasing and unimodal [22]. The linguistic expressions such as at least 5 and most of are examples of increasing fuzzy quantifiers; expressions such as at most 5 and few of are examples of decreasing fuzzy quantifiers; expressions such as around 5 and half of are examples of unimodal fuzzy quantifiers.

3 Related Work XQuery is a native XML query language proposed by the World Wide Web Consortium as standard language [19]. In the XQuery data model, the documents are represented as a tree of nodes; the nodes may be document, element, attribute, text, name-space, processing instruction, and comment. Additionally, the data model allows atomic values and literals, such as strings, Booleans, decimals, integers, floats and doubles, and dates. However, few efforts have been done to incorporate fuzzy logic into XML query languages. Some ideas to extend XPath with fuzzy terms were introduced by Braga et al [5]: ranked lists through annotations (or comments) in the result, and use of fuzzy predicates and fuzzy quantifiers. Nevertheless, they do not allow user to define fuzzy terms but they offer a set of few built-in predicates; modifiers, comparators and connectives are not considered by Braga et al [5]. Combination of fuzzy predicates is made by means of arithmetic operations on ranking variables instead of using fuzzy operators. In addition, quantified expressions and FLWOR expressions are not considered in this work. Additionally, Goncalves and Tineo [12] integrate fuzzy logic and XPath expressions. Damiani et al [7] extend XPath selection queries using flexible constraints on structure and content of XML documents. Fazzinga et al. [9] introduced semantics of fuzzy XPath queries that rank the top-k answers in terms of their approximated degrees. Barranco et al. [1] define a new fuzzy query language called XFSQL (XML Fuzzy Strutured Query Language) using the XML formatting rules. Thomson et al. [15] proposed a fuzzy logic based XQuery language and some techniques to extract data from XML documents. Nevertheless, none of these works proposes to extend quantified and FLWOR expressions with fuzzy logic. W3C [17] has proposed a more flexible XML query language to match of a full-text search associating scores with the results. Such scores express result relevance according to full-text-search conditions. Scores are binding to a variable returned by the XQuery query. However, they do not consider fuzzy quantified

Fuzzy XQuery

137

and FLWOR expressions or specification of user-criteria using linguistic term definitions and combining them with fuzzy operators. On the other hand, in relational database context, several works have been defined to support fuzzy queries. Some of these works are SQLf [3], FSQL [10][11] and SoftSQL [2]. Among these proposals, SQLf is the most complete because it has more extended SQL statements and it is the unique solution updated on features SQL2003 [8]. Based on these experiences, we think we may incorporate some of their ideas in order to enrich the XQuery language. Finally, we have the problem of evaluation of fuzzy queries. Since membership function calculation and fuzzy sets contain more elements than classical ones, time complexity of fuzzy queries evaluation may be higher than crisp query evaluation. Proposals as FSQL [10][11] and SoftSQL [2] evaluate fuzzy queries by means of naïve mechanisms. However, Bosc and Pivert [4] have conceived an evaluation mechanism based in α-cut distribution on conditions of a fuzzy query that translates a fuzzy query to a regular one; this mechanism is known as Derivation Principle. The translated query selects the desired subset of data without computing the satisfaction degree of fuzzy condition for the whole input data. More recently, Ma and Yan [14] have proposed the use of this principle as a unified approach for fuzzy query evaluation on relational databases. This principle has been widely studied for SQLf queries by Tineo’s research group [6][13][16], showing better performance of the queries with respect to naïve mechanisms. To illustrate this principle, we denote DS(<fuzzy sentence>) as the regular sentence translated from a fuzzy sentence using α-cut distribution and DNC(<fuzzy condition>, ≥, α) as derived necessary condition that specify α-cut of a fuzzy condition. Consider the user requirement Find cheap flights with a threshold of 0.5 which may be specified in SQLf [3] as: select * from flights where price = cheap with calibration 0.5; Where with calibration is a clause for the threshold specification and the fuzzy predicate cheap is represented as trapezium of Fig. 1. 1 0,5 0 0

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000

Fig. 1 Representation of the fuzzy predicate cheap

When the Derivation Principle is applied, we obtain a classic SQL query that retrieves the same rows that the fuzzy one.

138

M. Goncalves and L. Tineo

DS(select * from flights where price = cheap with calibration 0.5) = select * from flights where price ≤ 1400. The query may be translated using the definition of the α-cut of a fuzzy set which allows deriving the Boolean condition as follows: DNC(price = cheap, ≥, 0.5) = price ≤ 1400. Then, the fuzzy query is evaluated on the Boolean query result. In this case, superfluous access and computation of rows whose price > 1400 will be avoided. In this chapter, we fuzzify the data model, the data definition language and expressions in XQuery language fuzzy logic. First, we extend the data model to include gradual truth degrees. Second, we allow definition of fuzzy terms in XQuery language. Third, we flexibilize XQuery expressions such as filtering, comparison, conditional, quantified, and FLWOR. Fourth, we propose an evaluation algorithm for XQuery queries based on the Derivation Principle.

4 Truth Degrees To support fuzziness in XQuery, it is necessary to give a representation to satisfaction degrees obtained from fuzzy conditions and establish a standard notation for xml data annotated with fuzzy degrees belonging to fuzzy queries. Thus, we define a new xs:truth data type and a new xml:truth attribute in the two following sub-sections.

4.1 The Xs:Truth Data Type First, we need to extend the XQuery and XPath Data Model [19] in order to generalize xs:boolean data type to gradual truth values. Thus, we define the primitive data type xs:truth which represents a set of possible truth degrees. Value space of xs:truth is a subset of xs:decimal restricted to the real unit interval [0.0,1.0]. A order-relation on xs:truth is a order relation on real numbers, restricted to this subset. Values of xs:truth have a lexical representation consisting of a finite-length sequence of decimal digits (#x30-#x39) separated by a period as a decimal indicator; leading and trailing zeroes are optional. If its fractional part is zero, the period and the following zeroes may be omitted. For example: 0, 1, 1.0, 0.333, 0.75 are valid xs:truth values. An instance xs:truth· may also have the following legal literals {true, false} as a alternative representation of 1 and 0, respectively. Canonical representation and constraining facets for xs:truth· will be the same than xs:decimal.

Fuzzy XQuery

139

The xs:boolean built-in data type will be derived from xs:truth. Usual castings rules of xs:boolean and xs:decimal data types are not affected. We remark that, since xs:truth· is derived from xs:decimal, casting from xs:truth· value into xs:boolean will produce false for zero and true otherwise. For example, the source value 0.75 of xs:truth type into xs:boolean is converted to true while the value 0 is converted to false. If the source value is one of the special xs:float or xs:double values: NaN, INF, or -INF, or is out of the real interval [0.0,1.0], an error is raised [err:FOCA0002] which means Invalid lexical value. In XDM, fn:true and fn:false are additional Boolean constructor functions. They construct xs:boolean values: true and false, respectively These functions are compatible with xs:truth data type. The type xs:truth inherits from comparison operators on numeric values, e.g., the expression (0.25 ge false) returns true while (true gt 1.0) returns false. Notice that the usual comparison operators eq, ne, lt, le, gt, ge (and their generalized versions =, !=, <, <=, > and >=) produce usual Boolean results. The comparison operators returning gradual truth values will be further defined. On the other hand, the fn:not function will be changed to this form: fn:not($arg as item()*) as xs:truth. This function inverts value of argument. Therefore, $arg is first reduced to an effective truth value v by applying the fn:truth function, and then returns its complement, 1-v. The fn:truth($arg as item()*) as xs:truth function is intended to compute the effective truth value of a data input by means of the following rules: ⎯ If $arg is the empty sequence, fn:truth returns false; ⎯ If $arg is a sequence whose first item is a node, fn:truth returns true; ⎯ If $arg is a singleton value of type fn:truth or a derived from fn:truth, fn:truth returns $arg; ⎯ If $arg is a singleton value of type xs:string or a type derived from xs:string, xs:anyURI or a type derived from xs:anyURI or xs:untypedAtomic, fn:truth returns false if the operand value has zero length; otherwise it returns true; ⎯ If $arg is a singleton value of any numeric type or a type derived from a numeric type, different to xs:truth, fn:truth returns false if the operand value is NaN or is numerically equal to zero; otherwise it returns true; ⎯ In all other cases, fn:truth raises a type error [err:FORG0006] that means invalid argument type. The effective truth value of a sequence is implicitly calculated during processing of the following types of expressions: logical expressions, the fn:not function, predicates, conditional expressions, quantified expressions, and comparisons. We will discuss the processing of this kind of expressions later in this document.

4.2 The XML:Truth Attribute The introduction of fuzzy logic terms and conditions in XQuery expressions will produce discriminated answers to queries on XML documents. Therefore, we must

140

M. Goncalves and L. Tineo

provide a mechanism in order to annotate XML documents with truth degrees. For this reason, we propose the attribute xml:truth which is similar to the standard xml:id attribute. This section defines meaning and processing of the attribute xml:truth as a truth degree attribute in XML documents. Truth degrees may also be declared through external mechanisms. Nevertheless, we would like to guaranty that any user will have a â&#x20AC;&#x153;correctâ&#x20AC;? schema. Thus, it is desirable to have a mechanism that allows to recognize truth degrees by all conformant XML processors. According to XML standards [19], prefixes beginning xml are reserved for XML use and XML-related specifications. It licenses to use attribute xml:truth as a common syntax for truth degrees in XML with the semantics specified herein. Therefore, authors of XML documents are encouraged to name their truth degree attributes xml:truth to increase the interoperability of these identifiers on the Web. All xml: truth attributes must have xs:truth as type. In this manner, declarations of such attributes may omit specification of type, otherwise, xs:truth must be used in the specification. An xml:truth processor is a software module that works in conjunction with an XML processor to provide access to the truth degrees in an XML document. An xml:truth processor must assure that the type constraint holds for all xml:truth attributes. A violation of the constraints in this specification results in an xml:truth error. Such errors are not fatal, but should be reported by the xml:truth processor. For example, the following is an XML Schema for elements of tag flight annotated with xml:truth attribute: <xs:element name="flight"> <xs:attribute name="xml:truth" type="xs:truth"/> <xs:complexType> <xs:sequence> <xs:element name="origin" type="xs:string"/> <xs:element name="destination" type="xs:string"/> <xs:element name="airline" type="xs:string"/> <xs:element name="number" type="xs:string"/> <xs:element name="depart" type="xs:time"/> <xs:element name="arrival" type="xs:time"/> <xs:element name="price" type="xs:decimal"/> </xs:sequence> </xs:complexType> </xs:element> An XML Document example of a fuzzy query producing elements of tag flight will be annotated with xs:truth degree values as follows:

Fuzzy XQuery

141

<preferred flights> <flight xml:truth = true> <origin>Caracas</origin> <destination>New York</destination> <airline>AL01</airline> <number>357</number> <depart>07:00</depart> <arrival>12:00</arrival> <price>1200</price> </flight> <flight xml:truth = 0.75> <origin>Caracas</origin><destination>Paris</destination> <airline>AL02</airline> <number>468</number> <depart>16:00</depart> <arrival>06:00</arrival> <price>1700</price> </flight> <flight xml:truth = 0.85> <origin>Caracas</origin><destination>Los Angeles</destination> <airline>AL03</airline><number>751</number> <depart>08:00</depart><arrival>13:00</arrival> <price>1200</price></flight> <flight xml:truth = 0.25> <origin>Caracas</origin> <destination>London</destination> <airline>AL05</airline> <number>545</number> <depart>17:00</depart> <arrival> 06:00 </arrival> <price>1300</price> </flight> </preferred flights>

5 Linguistic Terms Fuzzy queries comprise linguistic terms which are defined by the user and represented as fuzzy terms. To create fuzzy terms, a definition sublanguage is presented in this section. In addition to built-in functions and xs:truth datatype operations, we allow user to define linguistic terms giving their xs:truth value degree. Declaration of such terms may be similar to usual function declaration in XQuery. Fuzzy terms may be declared in Prolog, imported from a library module, or provided by an external environment as part of a static context. Their declaration causes that a declared operator to be added to the operator signatures of the module in which it appears. Finally, static and dynamic consistency constraints of fuzzy terms are like those constraints of user-defined functions.

5.1 Predicates A fuzzy predicate is a particular kind of user-defined function. Its declaration includes its name, and name and data type of each parameter. Returned data type is always xs:truth, therefore it may be omitted.

142

M. Goncalves and L. Tineo

FuzzyPredDecl ::= "declare” “fuzzy” "predicate" QName "(" Param ")" ("as" “xs:truth”)? (EnclosedExpr | “trapezium” “(“ Expr “)” | “extension” “(“ Expr “)” ) Param ::= "$" QName ("as" SequenceType)? A predicate declaration contains an expression that defines how its result is computed from its parameters. In case of trapezium specification, parameter must be of a numeric datatype. Parenthesized expression is a sequence of four static value numeric expressions in increasing order: v1, v2, v3, and v4. They define a fuzzy set F characterized by core(F)=[v2,v3], support(F)= ]v1, v4[ where membership border(F) is given by the line segment (v1,0)-(v2,1) in the left side ]v1, v2[ and (v3,1)-(v4,0) in the right side ]v3, v4[. The returned value is a membership degree of parameter actual value into this fuzzy set. For example, valid fuzzy predicates declarations are the following: declare fuzzy predicate young ( $age as xs:int) trapezium (-INF,0,25,65) declare fuzzy predicate high ( $salary as xs:decimal) trapezium (2000,4000,xs:double(INF), xs:double(INF) With previous declarations, we would have, for example: young(20) is true, young(45) returns 0.5, high(1500) is false, high(2500) results in 0.25. In case of extension specification, parenthesized expression is an even length sequence of static value expressions. From left to right, each pair of elements must be one of xs:truth datatype followed by one element of the parameter datatype. The returned value is a truth degree joint with actual parameter value in this sequence or false if value is not in the sequence. Examples of valid fuzzy predicates declarations may be: declare fuzzy predicate lucky ( $num as xs:int) extension ( (true,13), (.75,7), (1,49), (false,18), (0.33,33), (.8,40) ) declare fuzzy predicate preferred ( $color as xs:string) extension ( .33, “orange”, .66, “yelow”, .66, “blue”, 1.00, “green” ) Based on these declarations, we would have, for example: lucky(13) is true, lucky(40) returns 0.8, lucky(666) is false, preferred(“red”) results in false, preferred(“blue”) gives 0.66, preferred(“green”) is true.

5.2 Modifiers A fuzzy modifier is an unary operand on xs:truth. It applies only on fuzzy predicates to build new fuzzy predicates and its declaration specifies modifier

Fuzzy XQuery

143

name. The returned data type and the parameter are always xs:truth and therefore it may be omitted. FuzzyModDecl ::= "declare” “fuzzy” "modifier" QName "(" Param ")" ("as" “xs:truth”)? (EnclosedExpr | “translation” “(“ Expr “)” | “power” “(“ Expr “)” ) Param ::= "$" QName ("as" “xs:truth”)? Modifier declaration includes an expression that defines how its result is evaluated. In case of translation specification, any predicate modified by this modifier must have a numeric datatype parameter. Parenthesized expression must be a static value numeric expression. Operator semantic is that this value is added to actual value for predicate parameter before predicate calculation. For example, the following would be valid fuzzy modifiers declarations: declare fuzzy modifier really ( $dummy ) translation (+100) declare fuzzy predicate weakly ( $x as xs:truth ) translation (-100) Using modifier and predicate declarations previously defined, we would have, for example: really young(20) is 0.875, really young(45) returns 0.25, weakly young(65) gives 0.25, weakly young(45) retrieves 0.75, weakly young(35) is true. In case of power specification, parenthesized expression must be of static positive numeric value. This numeric value would be power exponent that would be applied to a predicate given truth degree. For example, the following would be valid fuzzy modifiers declarations: declare fuzzy modifier very ( $dummy) power (+2.0) declare fuzzy modifier relatively ( $dummy) power (+0.5) With previous declarations, we would have: very lucky(13) gives true, very lucky(40) returns 0.64, relatively lucky(666) is false, very preferred(“red”) results in false, relatively preferred(“green”) is true and relatively preferred(“blue”) gives 0.812403840463596036. Two additional fuzzy modifiers would be built in: ⎯ op:not ( $arg as xs:truth) as xs:truth: its result is a modified predicate whose returned values would be complement (as in fn:not) of original predicate; ⎯ op:ant ( $arg as xs:truth) as xs:truth: its result is a modified predicate whose returned values would be those of original predicate applied to complement of actual value of its argument, where the complement of a value v is defined as M-v, being M the maximum value in the domain of original predicate; this domain must be derived from a numeric data type.

144

M. Goncalves and L. Tineo

According to the previous declarations, we would have, for example, these: not lucky(13) gives false, not very lucky(40) returns 0.36, very not lucky(40) is 0.1156, not preferred(“blue”) gives 0.64, if the argument $age in the declaration of young predicate were constrained to the sub-range {0 .. 120} then ant really young(100) would give 0.875, ant preferred(“green”) raises a type error [err:FORG0006] that means invalid argument type.

5.3 Comparators A fuzzy comparator is a particular kind of user defined comparison operator. Its declaration specifies operator name, as well as name and datatype of two operands (parameters). The returned data type is always xs:truth and it may be omitted. FuzzyCompDecl ::= "declare” “fuzzy” "comparator" QName "(" Param “,” Param ")" ("as" “xs:truth”)? (EnclosedExpr | "(" "$" QName (“-”|”/”) "$" QName ")" “trapezium” “(“ Expr “)” | "(" "$" QName “,” "$" QName ")" “extension” “(“ Expr “)” | "(" "$" QName “,” "$" QName ")" “similarity” “(“ Expr “)”) Param ::= "$" QName ("as" SequenceType)? The comparator declaration includes an expression that defines how its result is computed from its parameters. In case of trapezium specification, the parameters must be of a numeric datatype. The two "$" QName into the parenthesis before the trapezium keyword are name of parameters in the same order than their declaration. Parenthesized expression must be a sequence of four static value numeric expressions in increasing order: v1, v2, v3, and v4. They define a fuzzy set F characterized by core(F)=[v2,v3], support(F)= ]v1, v4[ where membership border(F) is given by line segment (v1,0)-(v2,1) in the left side ]v1, v2[ and (v3,1)-(v4,0) in the right side ]v3, v4[. The Returned value would be the membership degree of the distance between parameters actual values into this fuzzy set. This distance is calculated as the difference when the symbol “-” is used and the quotient in case of “/” symbol. For instance, some examples of valid fuzzy comparators declarations are: (: mlt abbreviates much less than :) declare fuzzy comparator mlt ( $x as xs:int, $y as xs:int) ($x-$y) trapezium (-INF, -INF, -100, 0) (: mol abbreviates more or less :) declare fuzzy comparator mol ( $x as xs:int, $y as xs:int) ($x/$y) trapezium (2, 1, 1, 2)

Fuzzy XQuery

145

With previous declarations, we would have, for example, these: (3 mlt 120) gives true, (45 mlt 95) returns 0.5, (1500 mlt 15) is false, (500 mol 800) results in 0.25, (500 mol 1000) is false, (1500 mol 1000) results in 0.5. In case of extension specification, the two "$" QName into the parenthesis before the extension keyword is the name of parameters in the same order than their declaration. Parenthesized expression must be sequence of static value expressions. Length of this sequence must be multiple of three. From left to right, each triplet of elements must be one of xs:truth datatype followed by two elements of parameter datatype. The returned value would be the truth degree and the pair of the actual parameter values in this sequence, or false if the pair of values is not in the sequence. For example, a valid fuzzy predicate declaration may be: declare fuzzy comparator approx ( $x as xs:string, $x as xs:string) ($x,$y) extension ( .75, “red”, “orange”, .50, “orange”, “red”, .50, “yellow”, “orange”, .25, “orange”, “yellow”, .50, “yellow”, “green”, .25, “green”, “yellow”, .75, “blue”, “green”, .50, “green”, “blue” ) Based on the previous declaration, we would have, for example, these: (“red” approx “green”) results in false, (“green” approx “blue”) gives 0.50, (“orange” approx “orange”) is false. The case of similarity specification has the same restrictions of extension specification and at most the same semantic. The difference is that the comparator is a reflexive and symmetric fuzzy relation. For example, a valid fuzzy predicate declaration to “similar” is: declare fuzzy comparator similar ( $x as xs:string, $x as xs:string) ($x/$y) similarity ( .75, “red”, “orange”, .50, “yellow”, “orange”, .50, “yellow”, “green”, .75, “blue”, “green” ) Examples of fuzzy comparators using previous declaration are: (“red” similar “green”) results in false, (“green” similar “blue”) gives 0.75, (“orange” similar “orange”) is true.

5.4 Connectives A fuzzy connective is a particular kind of user-defined logical operator. Its declaration specifies connective name. Datatype of its parameters and its result are always xs:truth and in consequence, it may be omitted. A fuzzy connective

146

M. Goncalves and L. Tineo

declaration causes a declared operator to be added to the operator signatures of the module in which it appears. Precedence of these operators is the lowest. FuzzyConnDecl ::= "declare” “fuzzy” "connective" QName "(" Param “,” Param ")" ("as" “xs:truth”)? EnclosedExpr Param ::= "$" QName ("as" SequenceType)? The connective declaration involves an expression that defines how its result is computed from its parameters. For example, valid fuzzy connectives declarations could be: (: imp abbreviates implies :) declare fuzzy connective imp ( $x, $y) { (not($x) or $y) } (: por abbreviates probabilistic or :) declare fuzzy connective por ( $x, $y) { ($x + $y) - ($x * $y) } Two additional fuzzy connectives would be built in: ⎯ op:and ( $arg1 as xs:truth, $arg2 as xs:truth) as xs:truth: result of this operator is the minimum between effective truth values of its arguments; ⎯ op:or ( $arg1 as xs:truth, $arg2 as xs:truth) as xs:truth: its result is maximum between effective truth values of its arguments.

5.5 Quantifiers A fuzzy quantifier is a particular kind of user-defined operator. The absolute and proportional keywords are used to indicate the nature of the fuzzy quantifier. Parenthesized expression following the keyword trapezium must be a sequence of four static value numeric expressions in increasing order: v1, v2, v3, and v4. They define a fuzzy set F characterized by core(F)=[v2,v3], support(F)= ]v1, v4[ where membership border(F) is given by line segment (v1,0)-(v2,1) in the left side ]v1, v2[ and (v3,1)-(v4,0) in the right side ]v3, v4[. FuzzyQuanDecl ::= "declare” “fuzzy” "quantifier" QName (“absolute”|“proportional”) “trapezium” “(“ Expr “)” For example, valid fuzzy quantifier declarations are as follow: declare fuzzy quantifier atLeast30 absolute trapezium (25,30,INF,INF) declare fuzzy quantifier around20 absolute trapezium (10,17,25,50) declare fuzzy quantifier fewOf proportional trapezium (-INF,0,.25,.50) declare fuzzy quantifier mostOf proportional trapezium (.50,.75,1,+INF) Fuzzy quantifiers will be used in quantified XQuery expressions later in this document.

Fuzzy XQuery

147

6 Fuzzy Queries The query language for XML data, XQuery is very rich in the variety of provided query expressions. They are named [19]: filter expressions, comparison expressions, logical expressions, quantified expressions, conditional expressions and FLWOR expressions. Combination of such expressions may be used in order to build complex queries. In this section, we show how we extend these expressions in order to support fuzzy query on XML data. This section comprises the main contribution of this chapter; nevertheless, the previous sections are indispensable because they deal about the representation of elementary concepts that we need for a fuzzy XQuery language.

6.1 Filter Expressions XQuery allows filtering sequences by means of expressions, known as predicate expressions. These expressions are enclosed in square brackets. In case of multiple predicates, they are applied from left to right, and the result of applying each predicate is input for the following one. For each item in the input sequence, the result of the predicate expression is coerced to an xs:boolean value, called the predicate truth value. We propose extend it to an xs:truth value instead of just an xs:boolean value, allowing thus fuzzy predicate expressions and giving discriminated answers. FilterExpr PredicateList Predicate TholdExpr

::= ::= ::= ::=

PrimaryExpr PredicateList Predicate* “[“ (Expr | TholdExpr)“]” “threshold” DecimalLiteral

In traditional Boolean predicates, a filtering expression retains those items that have truth value of true, and those with a truth value of false are discarded. Fuzzy nature of xs:truth type leads us to make more flexible filtering. We simply discard those items with a truth value of false. Predicate truth value is derived from application of similar rules to the corresponding XQuery xs:booelan valued predicates: ⎯ The first rule remains unaltered. If the value of the predicate expression is a singleton atomic value of a numeric type or derived from a numeric type, the predicate truth value is true if the value of the predicate expression is equal (by the eq operator) to the context position, and is false otherwise. ⎯ The second rule changes: Otherwise, the predicate truth value is the effective truth value of the predicate expression. Here, we use the effective truth value instead of the effective Boolean value. For example, the following expression selects all the descendants of the context node that are elements named toy and whose color attribute satisfies the fuzzy predicate preferred. descendant::toy [preferred(attribute::color)]

148

M. Goncalves and L. Tineo

The next expression produces a sequence of products in a variable, and returns those products whose price is much less than 300 and whose color is relatively preferred in terms of the user-defined semantic for these xs:truth terms. $products[price mlt 300][ relatively preferred (color)] In order to give discriminated answers, we must to return not only the selected elements but also the corresponding truth value for the predicate filter expression. It will be done by means of the xml:truth attribute. For this reason, answer discrimination is only possible when the original input sequence is a sequence of nodes. In case of filtering an atomic value sequence, the result will lose the effective truth values as if they were coerced to xs:boolean. When a node is retained, a value is assigned to its xml:truth attribute as follows: â&#x17D;Ż First, we guaranty that there is such attribute, else, we add this attribute to the node with the initial value xml:truth=true; â&#x17D;Ż Thereafter, we compute the value for xml:truth as the minimum between current value and the effective truth value of the predicate expression. In order to illustrate it, consider the following fuzzy predicate: declare fuzzy predicate cheap ( $price as xs:decimal) trapezium (-INF,0,1200,1600) Also, suppose that the variable $flight is instantiated as: <flight> <origin>Caracas</origin> <destination>Paris</destination> <airline>AL02</airline> <number>468</number> <depart>16:00</depart> <arrival>06:00</arrival> <price>1700</price> </flight> <flight> <origin>Caracas</origin> <destination>New York</destination> <airline>AL03</airline> <number>751</number> <depart>08:00</depart> <arrival>13:00</arrival> <price>1200</price> </flight> <flight> <origin>New York</origin> <destination>Beijing</destination> <airline>AL04</airline> <number>958</number> <depart>19:00</depart> <arrival>19:00</arrival> <price>1300</price> </flight> <flight> <origin>Frankfurt</origin> <destination>Beijing</destination> <airline>AL06</airline> <number>601</number> <depart>20:00</depart> <arrival>10:00</arrival> <price>1400</price> </flight> And, consider the filter expression: $flight[cheap(price)]

Fuzzy XQuery

149

The result of this expression is a sequence of nodes annotated with their truth degree according to the effective truth value of the predicate, as follows: <flight xml:truth = true > <origin>Caracas</origin> <destination>New York</destination> <airline>AL03</airline> <number>751</number> <depart>08:00</depart> <arrival>13:00</arrival> <price>1200</price> </flight> <flight xml:truth = 0.75 > <origin>New York</origin> <destination>Beijing</destination> <airline>AL04</airline> <number>958</number> <depart>19:00</depart> <arrival>19:00</arrival> <price>1300</price> </flight> <flight xml:truth = 0.5 > <origin>Frankfurt</origin> <destination>Beijing</destination> <airline>AL06</airline> <number>601</number> <depart>20:00</depart> <arrival>10:00</arrival> <price>1400</price> </flight> </flights> A predicate with the keyword threshold has the effect of reject those nodes whose xml:truth attribute is under the specified decimal value. For example: $flight[cheap(price)][threshold 0.6] With previous data and definition, this former filter expression would not include the following node in the resulting sequence: <flight xml:truth = 0.5 > <origin>Frankfurt</origin> <destination>Beijing</destination> <airline>AL06</airline> <number>601</number> <depart>20:00</depart> <arrival>10:00</arrival> <price>1400</price> </flight> </flights>

6.2 Comparison Expressions XQuery provides three kinds of comparison expressions: â&#x17D;Ż Value comparisons correspond to usual comparison of values in traditional programming languages; â&#x17D;Ż General comparisons generalizes the previous one and they are intended to compare sequences with a comparison operator under the scope of an implicit existential quantifier; and

150

M. Goncalves and L. Tineo

⎯ Node comparisons that allow compare nodes by their identity or their position. We propose just extend the value comparisons allowing user-defined fuzzy comparators. Thus, an identifier (QName) that user has defined to be a fuzzy comparator may be used in place of traditional ones: eq, ne, lt, le, gt, and ge. ComparisonExpr CompOper ::= FuzzyComp) ValueComp ::= GeneralComp ::= NodeComp ::= FuzzyComp ::=

::= RangeExpr ( CompOper RangeExpr )? (ValueComp| GeneralComp| NodeComp| "eq" | "ne" | "lt" | "le" | "gt" | "ge" "=" | "!=" | "<" | "<=" | ">" | ">=" "is" | "<<" | ">>" QName

Fuzzy comparison evaluation is similar to traditional value comparisons. First, its operands are evaluated by checking of type compatibility, as XQuery does. If operand types are not a valid combination for the given operator according to the user’s definition, an error is raised [err:XPTY0004] because of type mismatch. Finally, if operand types are a valid combination, the operator is applied to the operands. The difference is that fuzzy comparison operators are user-defined while traditional ones are built-in. We remark that in case of fuzzy comparison, evaluation result is of xs:truth datatype instead of just xs:boolean. For example, suppose the fuzzy comparator similar as declared as: declare fuzzy comparator similar ( $x as xs:string, $x as xs:string) ($x/$y) similarity ( .75, “red”, “orange”, .50, “yellow”, “orange”, .50, “yellow”, “green”, .75, “blue”, “green” ) And, consider the following fuzzy comparison expression: $car/color similar "green" The evaluation of this expression proceeds as follows: Atomizes the node(s) that is returned by the expression $car/color. If the result of atomization is an empty sequence, comparison result is an empty sequence. If atomization result is a sequence containing more than one value, a type error is raised [err:XPTY0004]. If atomization result is any atomic value different of “green”, “blue” and “yellow”, the expression returns false, while for “green”, it returns true, for “blue”, it returns 0.75, and for “yellow”, it returns 0.50.

Fuzzy XQuery

151

Now, suppose the fuzzy comparator defined as: declare fuzzy comparator mol ( $x as xs:int, $y as xs:int) Despite operand expressions have different identities and/or names as nodes in the following comparison expressions, these expressions are valid because the two constructed nodes have value compatible with xs:int after atomization. <a>500</a> mol 800 500 mol 1000 1500 mol 1000 Finally, the results of these expressions will be 0.25, false and 0.5, respectively.

6.3 Logical Expressions In XQuery, a logical expression is either an and-expression or an or-expression whose data type is xs:boolean. It does not raise an error, but always gives values true or false. We extend this kind of expressions allowing user-defined fuzzy connectives and giving a fuzzy logic semantic to built-in and and or ones. FuzzyExpr OrExpr AndExpr

::= ::= ::=

OrExpr ( QName OrExpr )* AndExpr ( "or" AndExpr )* FuzzyLiteral ( "and" FuzzyLiteral )*

The first step to evaluate a logical expression is to find the effective truth value of each of its operands. A logical expression raises an error if evaluation of at least one operand raises an error. Nevertheless, some operands might not be executed if a short-cut evaluation strategy is implemented, and in consequence, evaluation of some erroneous operands might not be performed. If no error exists, truth value of logical expression is calculated according to definition of its operands. Built-in or and and connectives are interpreted as the maximum and minimum values with respect to effective truth values of their operands, respectively. The QName that connects two OrExpr in FuzzyExpr must be a name of a user-defined fuzzy connective, whose semantic is specified by the user in the corresponding declaration. Logic expressions might use comparison expressions and/or fuzzy predicate expressions FuzzyPred. FuzzyLiteral ::= FuzzyPred ::=

( ComparisonExpr |FuzzyPred ) (“not” | ”ant” | QName)* QName "(" ExprSingle ")"

In the FuzzPred syntax, the rightmost QName corresponds to a user defined fuzzy predicate. Its application is similar to a function call. Others QName corresponds to user defined fuzzy modifiers, while keywords not and ant are

152

M. Goncalves and L. Tineo

built-in fuzzy modifiers. Continuous modifiers are applied from right to left. The result of a modifier will be the predicate to be modified for left adjacent one. Consider the following declarations: declare fuzzy predicate young ( $age as xs:int) trapezium (-INF,0,25,65) declare fuzzy predicate high ( $salary as xs:decimal) trapezium (2000,4000,xs:double(INF), xs:double(INF) The truth degree of the following expression will be 0.25 young(45) and high(2500) If you suppose the following user defined predicates: declare fuzzy predicate lucky ( $num as xs:int) extension ( (true,13), (.75,7), (1,49), (false,18), (0.33,33), (.8,40) ) declare fuzzy predicate preferred ( $color as xs:string) extension ( .33, “orange”, .66, “yelow”, .66, “blue”, 1.00, “green” ) The corresponding truth degrees of following expressions will be 0.66 and 0.8: preferred(“blue”) or lucky(666) preferred(“red”) or young(20) and lucky(40) Now, consider also the following declarations: declare fuzzy modifier really ( $dummy ) translation (+10) declare fuzzy modifier very ( $dummy) power (+2.0) The truth degrees of the following expressions will be 0,0625 and 0.5625. very really young(45) very not high(2500) Finally, consider the declarations of these fuzzy terms: declare fuzzy comparator mol ( $x as xs:int, $y as xs:int) ($x/$y) trapezium (0.5, 1, 1, 2) declare fuzzy connective imp ( $x, $y) { (not($x) or $y) } declare fuzzy connective por ( $x, $y) { ($x + $y) - ($x * $y) } Results of following expressions will be 0.75 and 0.625, respectively. 500 mol 800 imp 500 mol 1000 500 mol 800 por 1500 mol 1000.

Fuzzy XQuery

153

6.4 Quantified Expressions In XQuery, quantified expressions support existential and universal quantification. We extend them in order to allow user-defined fuzzy quantifiers as follows: QuantifiedExpr ::= Quant VarBind (,VarBind)* "satisfies" ExprSingle Quantifier ::= ("some" | "every" | QName) VarBind ::= "$" VarName TypeDeclaration? "in" ExprSingle TypeDeclaration ::= "as" SequenceType A quantified expression begins with a quantifier, which is either the keyword some or every, or a QName identifying a user-defined fuzzy quantifier. It is followed by one or more in-clauses that are used to bind variables, the keyword “satisfies” and a test expression. Each in-clause associates a variable with an expression that returns a sequence of items, called the binding sequence for that variable. The in-clauses generate tuples of variable bindings, including a tuple for each combination of items in the binding sequences of respective variables. Conceptually, test expression is evaluated for each tuple of variable bindings. The results depend on the effective truth value of test expressions. To define semantic of a quantified expression, we must distinguish two main cases. The first case is when generated tuples of variable binding are crisp, i.e., they are not provided of xml:truth degree. The second case is when generated tuples of variable binding are fuzzy, i.e., they are provided of xml:truth degree. The value of the quantified expression is defined by the following rules: First main case (crisp tuples) Given a number of n generated tuples of variable binding. Then (ρ0,…,ρn) is as follows: In case of a user defined fuzzy quantifier, the sequence of membership degree values in the fuzzy set defines the quantifier for the quantities 0,…,n if the quantifier is absolute or the quantities 0,1/n,…,n/n if the quantifier is proportional. For the built-in some quantifier ρ0=0, ρ1=1, ρ2=1,…,ρn=1. For the built-in every quantifier ρ0=0, … , ρn-2=0,ρn-1=0,ρn=1. Let’s μ0 be the truth value 1.0, μn+1 be the truth value 0.0, (μ1,…,μn) be a sequence of obtained effective truth value for the test expression given in decreasing order (μ1≥μ2 … ≥μn) Then, the quantified expression truth value will be: • •

max min (ρ i , μ i ) for increasing quantifier

i∈{0Kn}

max min (ρ i ,1 − μi +1 ) for decreasing quantifier

i∈{0Kn}

• min ⎛⎜ max min (ρi , μi ), max min (ρ i ,1 − μi+1 )⎞⎟ for unimodal quantifier ⎝ i∈{0Kn}

i∈{0Kn }

⎠

154

M. Goncalves and L. Tineo

Second main case (fuzzy tuples)

Consider n as the number of generated tuples of variable binding. For i∈{1,…,n}, let’s (ρ0,i,…,ρi,i) be as follows: In case of a user defined fuzzy quantifier, the sequence of membership degree value in the fuzzy set defines the quantifier for the quantities 0,…,i if the quantifier is absolute or the quantities 0,1/i,…,i/i if the quantifier is proportional. For the built-in some quantifier ρ0,i=0, ρ1,i=1, ρ2,i=1,…,ρi,i,=1. For the built-in every quantifier ρ0,i=0, … ,ρi-2,i=0,ρi-1,i=0,ρi,i=1. Let ρ0,0 be the truth value 1.0 when the quantifier is decreasing, otherwise let ρ0,0 be the truth value 0.0. Let’s τ0 be the truth value 1.0, τn+1 be the truth value 0.0, (τ1,…,τn) be the decreasing order sequence of the truth degrees of generated tuples of variable binding. Let’s υ0 be the truth value 1.0, υn+1 be the truth value 0.0, (υ1,…,υn) be the decreasing order sequence of degrees obtained as the minimum between truth degrees of generated tuples and respective effective truth value for the test expression. Then, the quantified expression truth value would be: •

max min ⎛⎜τ i ,1 − τ i +1 , max min (ρ j ,i ,υ j )⎞⎟ for increasing quantifier j∈{0Ki } ⎝ ⎠ • max min ⎛⎜τ i ,1 − τ i +1 , max min (ρ j ,i ,1 − υ j +1 )⎞⎟ for decreasing quantifier i∈{0Kn} j∈{0Ki } ⎝ ⎠ • max min ⎛⎜τ i ,1 − τ i +1 , max min (ρ j ,i ,υ j ), max min (ρ j ,i ,1 − υ j +1 )⎞⎟ for unimodal i∈{0Kn} j∈{0Ki } j∈{0Ki } ⎝ ⎠ i∈{0Kn}

With the defined semantics, the effective truth value of the following quantified expression would be true as we expect with traditional XQuery semantics. some $x in (1, 2, 3), $y in (4, 3, 2) satisfies $x + $y = 5

Also the following expression shows that semantic of traditional quantifiers is preserved. In this case, the expression gives the truth value false. every $x in (1, 2, 3), $y in (4, 3, 2) satisfies $x + $y = 5

Consider the user defined fuzzy predicate high as follows: declare fuzzy predicate high ( $salary as xs:decimal) trapezium (2000,4000,xs:double(INF), xs:double(INF)

The following quantified expression results as 0.50. some $salary in 2500 to 3000 satisfies high($salary)

The effective truth value of this other expression will be 0.25.

Fuzzy XQuery

155

every $salary in 2500 to 3000 satisfies high($salary)

Suppose atLeast30 is an increasing absolute fuzzy quantified defined by: declare fuzzy quantifier atLeast30 absolute trapezium (25,30,INF,INF)

And the young predicate defined as: declare fuzzy predicate young ( $age as xs:int) trapezium (-INF,0,25,65)

The following quantified expression has satisfaction degree 0.90, atLeast30 $age in 0 to 120 satisfies young($age)

If we define a unimodal behavior fuzzy quantifier of absolute nature around20 as: declare fuzzy quantifier around20 absolute trapezium (10,17,25,50)

And mol as a fuzzy comparator (more or less) defined as: declare fuzzy comparator mol ( $x as xs:int, $y as xs:int) ($x/$y) trapezium (0.5, 1, 1, 2)

Then, the truth degree of the following expression will be 0.60: around20 $x in 1 to 7, $y in 1 to 7 satisfies $x mol $y

If you suppose the decreasing proportional fuzzy quantifier fewOf: declare fuzzy quantifier fewOf proportional trapezium (-INF,0,.25,.50)

And, the fuzzy predicate preferred: declare fuzzy predicate preferred ( $color as xs:string) extension ( .33, “orange”, .66, “yelow”, .66, “blue”, 1.00, “green” ) 0.34 will be the satisfaction degree of the following quantified expression: fewOf $c in (“red”, “orange”, “yelow”, “green” , “blue” ) satisfies preferred($c)

Finally, assume the declaration of a proportional increasing quantifier: declare fuzzy quantifier mostOf proportional trapezium (.50,.75,1,+INF)

156

M. Goncalves and L. Tineo

In the following quantified expression, the generated binding tuples are provided of xml:truth degrees because of the filter expression in the in-clause. Assuming predicates young and high as above, this querying expression gives us as truth degree 0.50. mostOf $emp in ( <e xml:id=”e1”> <age>25</age> <salary>2500</salary> </e> <e xml:id=”e2”> <age>35</age> <salary>3000</salary> </e> <e xml:id=”e3”> <age>45</age> <salary>3500</salary> </e> <e xml:id=”e4”> <age>55</age> <salary>4000</salary> </e> <e xml:id=”e5”> <age>65</age> <salary>2500</salary> </e> <e xml:id=”e6”> <age>25</age> <salary>3000</salary> </e> <e xml:id=”e7”> <age>35</age> <salary>3500</salary> </e> <e xml:id=”e8”> <age>45</age> <salary>4000</salary> </e> ) [high(salary)] satisfies not young($emp/age)

6.5 Conditional Expressions XQuery supports conditional expressions based on keywords if, then, and else: IfExpr

::=

TholdExpr

::=

"if" "(" Expr [TholdExpr]")" "then" ExprSingle "else" ExprSingle “threshold” DecimalLiteral

In our extension, the expression followed by the if keyword, called test expression, might give a satisfaction degree different of usual true and false. The first step in processing a conditional expression is to find the effective truth value of the test expression. The value of a conditional expression would be defined as: If effective truth value of test expression is false, value of else-expression is returned, this is the expression in else clause; otherwise, value of then-expression is returned, this is the expression in then clause. For example, the following conditional expression returns the value of the thenexpression when the effective truth value of the test-expression is over the threshold 0.5. if (young($emp/age) threshold 0.50) then $emp/age +“ is young.” else “”

Another example of conditional expression may be: If $color is “orange”, “yelow”, “blue” or “green” return “It might please me!”, otherwise, return “I don’t like it at all!”. if (preferred($color) then “It might please me!” then “I don’t like it at all!”

Fuzzy XQuery

157

6.6 FLWOR Expressions In order to build complex queries involving multiple document sources, XQuery provides a query structure named FLWOR expressions; FLWOR corresponds to initials of keywords identifying the clauses of this kind of expressions: For, Let, Where, Order by and Return. Its fuzzy extension involves no change in FLWOR expression syntactic rules, and therefore, we do not present here its syntax schema. Our focus is how fuzziness of others expressions and data structures may affect the result of a FLWOR expression. In the following, we discuss it each FLWOR clauses. XQuery allows iterations over sequence-contained data by means of the for clause in the FLWOR expression. Each variable in a for clause is associated to a sequence obtained from another expression. The iteration is done over all possible combinations of values for variables according to corresponding sequences. When the expression related to a variable gives a sequence of nodes, it may possibly provides of truth degrees for each node by means of the xml:truth attribute. In this case, we compute a global truth degree for each combination of variables values. This degree logically corresponds to the conjunction of conditions that originally give birth to those degrees; therefore, it is computes as the minimum. The let clause in the FLWOR expression has the effect to assign the result of another expression to be hold in a variable. If the expression gives a node with xml:truth attribute, the value of this attribute would be also combined with others truths values with the minimum operator. When each node from a node sequence has a xml:truth attribute, they are not automatically combined because they must be first aggregated. This aggregation could be done in the where clause using a quantified expression. The where clause establishes a filtering criterion. In our extension, it could be any expression giving a truth degree, i.e., expressions with effective truth values such as those presented in previous sections. The effective truth value of the where clause condition would be combined with truth degree obtained from for and let clauses. The combination is done, as ever, with the minimum operator. In case of for and let clauses are crisp, their truth degrees are 1, the neutral of minimum. The fuzzy extension presented here does not directly affect the order by clause. However, it is possible to perform a FLWOR expression query with a for clause over a xml:truth attribute provided nodes sequences ordering by such attribute. The return clause specifies the result that would be produced by iteration. The final result of the FLWOR expression would be a sequence containing all them. In case of fuzziness, in each iteration, a truth degree is calculated according to for, let and where clauses. When the return clause built a new node, the computed truth degree is added as a xml:truth attribute of the new node. Let us illustrate an extended FLWOR expression: Suppose a variable $flights that has a document comprising flights between cities as in previous examples. Another variable $opinions contains customer scores of airports in these cities. Someone searches a cheap trip from Caracas to Beijing with just one connection in an intermediate city using some of him/her preferred airlines. This person wants also to consider intermediate cities where

158

M. Goncalves and L. Tineo

most of middle-age customerâ&#x20AC;&#x2122;s opinions about the airport give high scores. Finally, the result must be given in decreasing order based on userâ&#x20AC;&#x2122;s criteria. This query would be expressed as follows: for $c in ( for $f1 in $flights//flight [origin=Caracas][good(airline)], $f2 in $flights//flight [destination=Beijing][good(airline)] let $v := $opinions//opinion [airport=$f2/origin][age=middle] where $f1/destination = $f2/origin and mostOf($x in $v) satisfies score=high and cheap($f1/price+$f2/price) return <connection> $f1 $f2 </connection> ) order by $c.xml:truth descending return $c

Using the variable $c , the outer for clause would iterate over a sequence of nodes with label <connection> obtained from the inner FLWOR expression. Each one of these nodes would be provided of a xml:truth attribute that is computed in processing of the inner FLWOR expression. Iteration in the outer FLWOR expression would be done in decreasing order of $c.xml:truth attribute, giving thus the desired output. Inner for clause iterates over all possible pairs of values for variables $f1 and $f2 obtaining from sequences of labeled nodes <flight> with origin Caracas and with destination Beijing, respectively. Each one of these nodes has a xml:truth attribute resulting form the filtering expression with predicate [good(airline)] . The combination of a pair of nodes for $f1 and $f2 would produce a truth degree minimum ($f1.xml:truth ,$f2.xml:truth). For each pair of values $f1, $f2, the let clause would instantiate the variable $v to a sequence of nodes with label <opinion> and with a xml:truth attribute resulting from the filtering expression with predicate [age=middle] over opinions for the airport of the city origin of $f2 flight. These truth degrees are not immediately combined, they remain in corresponding nodes. According to quantified expression semantics, the xml:truth attribute values for nodes in $v are used for computing he effective truth value of expression mostOf($x in $v) satisfies score=high. Also effective truth value of cheap($f1/price+$f2/price) expression is computed. The condition in the where clause is a conjunction therefore its effective truth value is obtained as the minimum of three combined test expressions. Thus the where clause would reject those pairs $f1 $f2, that do not coincide at intermediate city because the effective truth value would be false. The inner return clause would build a new node with label <connection> in each iteration. The xml:truth attribute of each node would be computed as the

Fuzzy XQuery

159

minimum between the effective truth value of condition in where clause and the truth degree produced form the combination of a pair of nodes for $f1 and $f2.

7 Query Processing Beyond Fuzzy XQuery language definition, an important issue concerns fuzzy XQuery query evaluation. We propose a mechanism based in Derivation Principle [4] to evaluate of fuzzy XQuery queries. First, a regular XQuery query is derived from a fuzzy one in terms of α-cut; second, derived XQuery query retrieves data whose degrees are greater and equal than α and finally, data are sorted by degree value. Thus, a fuzzy query may be evaluated avoiding an exhaustive scan of whole input XML data. We illustrate evaluation of fuzzy XQuery queries using Derivation Principle through the following example. Consider the document "flights.xml" whose content is: <flights> <flight> <origin>Caracas</origin> <destination>New York</destination> <airline>AL01</airline> <number>357</number> <depart>07:00</depart> <arrival>12:00</arrival> <price>1200</price> </flight> <flight> <origin>Caracas</origin> <destination>Paris</destination> <airline>AL02</airline> <number>468</number> <depart>16:00</depart> <arrival>06:00</arrival> <price>1700</price> </flight> <flight> <origin>Caracas</origin> <destination>Los Angeles</destination> <airline>AL03</airline> <number>751</number> <depart>08:00</depart> <arrival>13:00</arrival> <price>1200</price> </flight> <flight> <origin>Caracas</origin> <destination>London</destination> <airline>AL05</airline> <number>545</number> <depart>17:00</depart> <arrival> 06:00 </arrival> <price>1300</price> </flight> <flight> <origin>Caracas</origin> <destination>Frankfurt</destination> <airline> AL06 </airline> <number>632</number> <depart>17:00</depart> <arrival>08:00</arrival> <price>1300</price> </flight> <flight> <origin>New York</origin> <destination>Beijing</destination> <airline>AL04</airline> <number>958</number>

160

M. Goncalves and L. Tineo

<depart>19:00</depart> <arrival>19:00</arrival> <price>1300</price> </flight> <flight> <origin>Paris</origin> <destination>Beijing</destination> <airline>AL02</airline> <number>888</number> <depart>7:00</depart> <arrival>21:00</arrival> <price>1400</price> </flight> <flight> <origin>Los Angeles</origin> <destination>Beijing</destination> <airline>AL03</airline> <number>975</number> <depart>16:00</depart> <arrival>12:00</arrival> <price>1400</price> </flight> <flight> <origin>London</origin> <destination>Beijing</destination> <airline>AL05</airline> <number>577</number> <depart>20:00</depart> <arrival>11:00</arrival> <price>1400</price> </flight> <flight> <origin>Frankfurt</origin> <destination>Beijing</destination> <airline>AL06</airline> <number>601</number> <depart>20:00</depart> <arrival>10:00</arrival> <price>1400</price> </flight> </flights>

Suppose that the user wants to retrieve information about flights from Caracas that are served by good airlines. For this purpose, the user defines a fuzzy predicate good: declare fuzzy predicate good( $airline as xs:string) extension( .5,‘AL01’, 1.0,‘AL02’, .8,‘AL03’, .4,‘AL04’, .7,’AL05’, .3,’AL05’)

The user also wishes to restrict results to those flights obtaining truth degree to the query over the threshold 0.6. This requirement may be specified in fuzzy XQuery as the expression: doc("flights.xml")/flights/flight [origin=‘Caracas’ and good(airline)][threshold 0.6]

Applying the concept of α-cut, we can derive the classic XQuery expression: doc("flights.xml")/flights/flight [origin= ‘Caracas’ and airline = (‘AL02’, ‘AL03’, ‘AL05’)]

Fuzzy XQuery

161

This classic filter expression would produce the result: <flights> <flight> <origin>Caracas</origin> <destination>Paris</destination> <airline>AL02</airline> <number>468</number> <depart>16:00</depart> <arrival>06:00</arrival> <price>1700</price> </flight> <flight> <origin>Caracas</origin> <destination>Los Angeles</destination> <airline>AL03</airline> <number>751</number> <depart>08:00</depart> <arrival>13:00</arrival> <price>1200</price> </flight> <flight> <origin>Caracas</origin> <destination>London</destination> <airline>AL05</airline> <number>545</number> <depart>17:00</depart> <arrival> 06:00 </arrival> <price>1300</price> </flight> <flight>

Derivation Principle based evaluation mechanism would evaluate fuzzy conditions just for elements in this selected elements. In this way, superfluous computation of truth values for seven nodes is avoided. The final result of the query expression would be as follows. Notice that flight nodes are annotated with xml:truth attribute. <flights> <flight xml:truth = true > <origin>Caracas</origin> <destination>Paris</destination> <airline>AL02</airline> <number>468</number> <depart>16:00</depart> <arrival>06:00</arrival> <price>1700</price> </flight> <flight xml:truth = 0.8 > <origin>Caracas</origin> <destination>Los Angeles</destination> <airline>AL03</airline> <number>751</number> <depart>08:00</depart> <arrival>13:00</arrival> <price>1200</price> </flight> <flight xml:truth = 0.7 > <origin>Caracas</origin> <destination>London</destination> <airline>AL05</airline> <number>545</number> <depart>17:00</depart> <arrival> 06:00 </arrival> <price>1300</price> </flight> <flights>

162

M. Goncalves and L. Tineo

Since our evaluation mechanism is based on Derivation Principle, membership degrees will be calculated only for these three answers. On the other hand, if Derivation Principle is not applied, then a naïve evaluation strategy must scan XML document completely, calculate membership degree for each elements and finally, discard irrelevant answers. Previous example intuitively shows us efficiency of Derivation Principle-based strategy. This strategy has been successfully used in fuzzy SQL queries [6][13][16].

8 Conclusion and Future Works We have presented here a fuzzy set based extension to XQuery. This extension allows user to specify preferences on XML queries and retrieve discriminated answers by user’s preferences. This extension comprises the new xs:truth built-in data type intended to represent gradual truth degrees. This datatype es defined as derived from xs:decimal, restricted to the interval [0,1] and at same time xs:boolean was redefined as derived from xs:truth. The concept of effective Boolean value has been replaced by effective truth value with this new type. The standard xml:truth attribute of type xs:truth was introduced in order to handing satisfaction degrees in nodes proceedings of fuzzy XQuery expressions and possibly stored in XML documents. The language is extended to declare fuzzy terms predicates, modifiers, comparators, connectives and quantifiers. These terms are treated as user defined operators that are placed in corresponding work spaces. We have extended FLWOR expressions as well as all other XQuery expressions to work with fuzzy terms and produce gradual answers. Also, an evaluation mechanism based in the Derivation Principle is presented in order to avoid superfluous computation of truth degrees. It would be interesting to incorporate in XQuery other user preference handling operators such as skyline and top-k. Acknowledgments. We give thanks to Venezuela’s FONACIT project G-200500278 and France’s IRISA/ENSSAT project Pilgrim for supporting this research work. We express a great acknowledgement to Jesus Christ, source of force and inspiration: I will lift up mine eyes unto the hills, from whence cometh my help. My help cometh from the LORD, which made heaven and earth. He will not suffer thy foot to be moved: he that keepeth thee will not slumber. Behold, he that keepeth Israel shall neither slumber nor sleep. The LORD is thy keeper: the LORD is thy shade upon thy right hand. The sun shall not smite thee by day, nor the moon by night. The LORD shall preserve thee from all evil: he shall preserve thy soul. The LORD shall preserve thy going out and thy coming in from this time forth, and even for evermore.” (Psalm 121)

References 1. Barranco, C.D., Campaña, J.R., Medina, J.M.: Towards a XML Fuzzy Structured Query Language. In: Proceedings of the Joint 4th Conference of the European Society for Fuzzy Logic and Technology and the 11th Rencontres Francophones sur la Logique Floue et ses Applications, pp. 1188–1193 (2005)

Fuzzy XQuery

163

2. Bordogna, G.: Psaila. G.: Customizable Flexible Querying Classic Relational Databases. In: Galindo, J. (ed.) Handbook of Research on Fuzzy Information Processing in Databases, Hershey, PA, USA. Information Science, vol. VIII, pp. 189– 215 (2008) 3. Bosc, P., Pivert, O.: SQLf: A Relational Database Language for Fuzzy Querying. IEEE Transactions on Fuzzy Systems 3(1), 1–17 (1995) 4. Bosc, P., Pivert, O.: SQLf Query Functionality on Top of a Regular Relational Database Management System. In: Pons, O., Vila, M., Kacprzyk, J. (eds.) Knowledge Management in Fuzzy Databases, pp. 171–190. Physica-Verlag (2000) 5. Braga, D., Campi, A., Damiani, E., Pasi, G., Lanzi, P.L.: FXPath: Flexible Querying of XML Documents. In: Proceedings of EuroFuse (2002) 6. Curiel, M., González, C., Tineo, L., Urrutia, A.: On the Performance of Fuzzy Data Querying. In: Greco, S., Lukasiewicz, T. (eds.) SUM 2008. LNCS (LNAI), vol. 5291, pp. 134–145. Springer, Heidelberg (2008) 7. Damiani, E., Marrara, S., Pasi, G.: A flexible extension of XPath to improve XML Querying. In: Proceedings of the 31st annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 849–850 (2008) 8. Eisenberg, K., et al.: SQL:2003 Has Been Published. ACM SIGMOD 33(1), 119–126 (2004) 9. Fazzinga, B., Flesca, S., Pugliese, A.: Top-k Answers to fuzzy XPath Queries. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) Database and Expert Systems Applications. LNCS, vol. 5690, pp. 822–829. Springer, Heidelberg (2009) 10. Galindo, J.: New Characteristics in FSQL, a Fuzzy SQL for Fuzzy Databases. WSEAS Transactions on Information Science and Applications 2(2), 161–169 (2005) 11. Galindo, J., Urrutia, A., Piattini, M.: Fuzzy Database Modeling, Design and Implementation. Idea Group Publishing (2006) 12. Goncalves, M., Tineo, L.: A New Step Towards Flexible XQuery. Revista Avances en Sistemas e Informática 4(3), 27–34 (2007) 13. López, Y., Tineo, L.: About the Performance of SQLf Evaluation Mechanisms. CLEI Electronic Journal 9(2) (2006); Paper 8. Rueda, C., et al. (eds.) 14. Ma, Z.M., Yan, L.: Generalization of Strategies for Fuzzy Query Translation in Classical Relational Databases. Information and Software Technology 49(2), 172–180 (2007) 15. Thomson, E., Fredrick, J., Radhamani, G.: Fuzzy Logic Based XQuery operations for Native XML Database Systems. International Journal of Database Theory and Application 2(3), 13–20 (2009) 16. Tineo, L.: SQLf Horizontal Fuzzy Quantified Query Processing. In: Proceedings of the XXXI Conferencia Latinoamericana de Informática (2005) 17. W3C: XQuery 1.0 and XPath 2.0 Full-Text. W3C Working Draft 3 (2005), http://www.w3.org/TR/xquery-full-text 18. W3C: XML Path Language, XPath (2007), http://www.w3.org/TR/xpath20 19. W3C: XQuery 1.0: An XML Query Language (2007), http://www.w3.org/TR/xquery/ 20. W3C: Extensible Markup Language (XML) 1.0, 5th edn. (2008), http://www.w3.org/TR/REC-xml/ 21. Zadeh, L.A.: Fuzzy sets. Information and Control 8(3), 338–353 (1965) 22. Zadeh, L.A.: Computational Approach to Fuzzy Quantifiers in Natural Languages. Computer Mathematics with Applications 9, 149–183 (1983)

Attractive Interface for XML: Convincing Naive Users to Go Online Keivan Kianmehr, Jamal Jida, Allan Chan, Nancy Situ, Kim Wong, Reda Alhajj, Jon Rokne, and Ken Barker

Abstract. Traditionally, searching in general or querying in particular required the exact matching of value to return results. As technology improves in the information sector, the complexity of these systems also increases. This is fairly common, especially in the area of databases as new models, like XML, are emerging. Searching for information is becoming more challenging for most users as the user population is increasing rapidly to include more less skilled (naive) users. This is especially true when web-based search is considered. Most users are no more familiar with structured languages like SQL and XQuery. Using relative linguistic terms for querying seems to be the most Keivan Kianmehr Computer Science Department, University of Calgary, Calgary, Alberta, Canada Jamal Jida Department of Informatics, Faculty of Sciences III, Lebanese University, Tripoli, Lebanon Allan Chan Computer Science Department, University of Calgary, Calgary, Alberta, Canada Nancy Situ Computer Science Department, University of Calgary, Calgary, Alberta, Canada Kim Wong Computer Science Department, University of Calgary, Calgary, Alberta, Canada Reda Alhajj Computer Science Department, University of Calgary, Calgary, Alberta, Canada Department of Computer Science, Global University, Beirut, Lebanon e-mail: alhajj@ucalgary.ca Jon Rokne Computer Science Department, University of Calgary, Calgary, Alberta, Canada Ken Barker Computer Science Department, University of Calgary, Calgary, Alberta, Canada

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 165â&#x20AC;&#x201C;191. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

166

K. Kianmehr et al.

reasonable and logical approach to making any composite resource a more searchable database of information, while implementing fuzziness in XML accounts for the lack of structure that results from pooling databases together. This seems to be the natural evolution of the technology as it moves away from complex and confusing interfaces to more user-friendly, user-centric and intuitive ones. To address these concerns, this chapter describes the design and implementation of a fuzzy nested querying system for XML databases. The research involved is outlined and examined to decide on the most ﬁtting solution that incorporates fuzziness into a user interface intended to be attractive to naive users. After researching the task, we applied our ﬁndings via the implementation of a prototype which covers the intended scope of a demonstration of fuzzy nested querying. This prototype has been integrated into VIREX (a user-friendly system that allows users to view and use relational data as XML); the developed prototype includes an easy to use graphical interface that will allow the user to apply fuzziness in order to easier search XML documents. The goal of this is to provide insight on creating more intuitive ways of searching and using XML databases; thus increasing the size of the population using and addressing XML data. We intend to expand into relational and object-oriented databases. Keywords: Fuzzy logic, XML schema, XML documents, linguistic terms, query interface, searching, nested database elements.

1 Introduction As XML (eXtensible Markup Language) is becoming the standard of Internet databases and large scale data transfer [29], it becomes ever-useful to everyone, from seasoned computer-users to typical users who see computers as black boxes. Unfortunately, large pools of resources do not have centralized controllers. The lack of structure is also followed by the disadvantage of being extremely difficult to query with imprecise quantifying search terms. Natural language terms are vital to bridging the gap between casual users and professionals. Looking at a typical commercial office serves as a scenario for a concrete example. Assume that the office employs a wide variety of personnel whose time sheets are stored in a relational database; an IT support staff member that is asked to find the employees who have taken a large number of sick days is able to do so quickly by utilizing a simple SQL query, whereas a sales manager is not expected to have expertise in coding queries and would generally attempt to go through the listings by hand. The end result is that the IT employee saves a lot of time by running a simple query, whereas the manager wasted valuable hours of time that could have been directed towards more important matters. Clearly, the IT employee is capable of using the database more efficiently and effectively than a typical user with little

Attractive Interface for XML

167

technical experience, such as the manager. This result occurs for a variety of reasons, but can be summarized to the fact that the IT employee has a greater depth of technical expertise and understanding of databases and its querying language than the manager. This presents the question of how to address this disadvantage in order to give the manager equal footing for a task of this manner. Another disadvantage present is the lack of uniform structure when several resources are combined to provide users with a large data set or pool of resources. A simple example of this problem is a large online bookstore that offers books published by several international publishing houses. In order to keep their commercial website up-to-date, they must display what books are available for ordering, which publishing house is offering the books, the prices of each book, and their print status. Each publishing house has their own internal databases that store various data regarding their own products, yet the bookstore must be able to keep up with all of them, despite their differently structured databases that may not have the same fields, or even the same number of fields for each record. For example, one publishing house may list all previous copyrighted versions of a textbook, whereas another will only list its most recent copyright date. We address these problems by focusing on the two main concerns: the lack of uniformly structured information, and primarily, the inability to translate from human natural language to a database query. These problems could be handled by first introducing XML as the umbrella for integrating the different structures, and then providing a fuzzy querying facility that encourages naive users to search structured databases.

1.1 Overview of XML XML is a general purpose format that is structured similarly to HTML. XML is a language that simply describes data, and aside from this, adds no additional functionality. It is structured from elements as a tree, with an unlimited amount of children for any one element node, thus allowing data elements to be nested within another. Elements may also have additional attributes set to describe the data further. Figure1 depicts an illustrative example of a simple XML file containing a list of undergraduate students. Note that there is only one root element, <undergraduates>. The elements <fname>, <lname>, <faculty>, <major>, and <gpa> are children elements of <student>, and no elements are strictly required to be present, with exception again to the root element. Year is an attribute that describes what year of the program the student is in. Note that an “attribute” in XML is not defined to mean the same thing as with traditional databases. Whereas an attribute in a relational database is a column of a table, an attribute in an XML file is what is described in Figure 1.

168

K. Kianmehr et al.

Fig. 1 Example XML ﬁle

Fig. 2 Example XML Structure

Since the element tags are purely defined by the generator of the data file, the result is usually a file that is human-readable, and self-described. Most people, even those unfamiliar with XML, would be able to deduce the meaning of the data in Figure 1, just from context. Nested XML is similar to regular XML, differing in that it contains additional sub-trees on the children. For example, by looking at Figure 2, we can see that <course> has the sub tree of <cname> and <gpa>. This adds additional ease of readability and usability to XML. XML is also platform-independent, since the format itself is an open standard. Applications are free to use any parser they wish, as long as they conform to the W3C guidelines. Thus there is no monopoly on an XML API to form borders around development involving XML. XQuery is the standardized accompanying querying language for XML. It is semantically similar to SQL, but instead of Select-From-Where statements, XQuery is structured in For-Let-Where-Order-Return statements, which are more suitable to searching through tree-structured elements [29].

1.2 Overview of Fuzzy Theory Fuzzy set theory was pioneered in mid-1960 by Zadeh [9]. His theory outlined a method for deﬁning boundaries on “humanistic” math problems [25]. The

Attractive Interface for XML

169

current standard would be bivalent set theory, which in comparison is very strict. Many databases currently use bivalent theory, or in other words, a crisp set when returning results. However, as technology improves, the inherent flaws of crisp databases are becoming more apparent, especially when non-technical users are concerned. Take, for example, a simple problem such as trying to find the amount of fish in stock at a store. We want to find the kinds of fish that we have “lots” of certain thing. Using the crisp set method, we simply find the range and return all the fish that fit into the “lots” range. This approach is very simple but also flawed. Using fuzzy sets for the same problem would return the same fish, however, the solution would also take into consideration that the kinds of fish that fall into the “lots” range are closer to the ideal “lots” amount than others in the same set. In such a search, it would be unreasonable to discard all other results that do not fit into a set solely because they differ by a relatively small amount. Fuzzy set theory is important in this application because it allows results to be imprecise - much like the human language. A major advantage of utilizing fuzzy sets, when contrasting to crisp sets, stems from the fact that human language is vague. Specifically, the meaning of the same word may not be identical to each person. Also, quantifying linguistic terms do not have set boundaries, but instead, have vague limits. Individual fuzzy membership definitions take into account that what one user may consider “lots” may not be what other users deem “lots.” By ranking results, as well as returning results that may not be correct but still applicable, fuzziness adapts to the human language much better. To accompany this informal discussion and example of fuzzy sets and bivalent theory, we can look at the details of how they operate and where they differ. In Figure 3, we have a chart displaying crisp set results. The horizontal axis shows some relative amount of fish, while the vertical axis shows the measure by which a result belongs in the set, or in other words its accuracy. Using this, we assume that the fish the store sells will fall into a certain set, based on the amount of that fish in stock. For instance, in our previous example we would return all fish that fall into the “lots” category. While not incorrect, this result set may not be as accurate as possible because the function only returns true or false as to whether or not some type of fish is in a set (1 or 0 on the vertical axis). This can lead to a few problems of non-specificity. For example, fish that are nearing the “some” category are returned with no indication that they could potentially fall out of the “lots” category relatively soon (that is, some fish are sold after the query takes place). Also, like the previous example, an element either belongs to a set or it does not. We cannot return elements that might be worth including that are outside of the discrete boundaries of a set. Figure 4 is the same chart of the fish stocking problem, but using fuzzy sets instead. By using triangles instead of rectangles to represent the function, we can now measure a proper degree of membership. This means that if a

170

K. Kianmehr et al.

Fig. 3 Crisp set results

Fig. 4 Fuzzy set representation

type of ﬁsh is at the ideal “lots” amount it will return a full 1.0, and as we get further away from the ideal amount, the degree of membership is less, signifying a less-than-ideal result. This also allows the triangle to cover a larger range of results with no loss of accuracy, as the user can easily discard results that return a small degree of membership. In addition to triangular, there are several other ways to represent the gradual membership change in fuzzy sets; trapezoidal is another commonly used representation. With a good conceptual model of how fuzzy set operation explained, we can now explore some of the more technical aspects, such as the mathematical processes involved. More speciﬁcally, we will examine how adding a membership function to normal set theory allows the fuzzy theory to operate. µA (x) : D → [0, 1]

(1)

Attractive Interface for XML

171

The main difference between a range search on a set of values versus a search over a fuzzy membership set is that within a fuzzy membership on a value each value within the membership will have a degree of how well it fits within this membership. Mathematically speaking, every value within a defined fuzzy membership will fall somewhere in the range [0, 1], to reflect how well the term matches the membership. Using this fact, we can use Equation 1 to determine the membership coefficient a value has within a membership set. In Equation 1, the value between 0 and 1 is the membership coefficient, and µA is the membership defined by the user that covers a set of values of x that belongs to a set of values that covers the rest of the domain. All of the material discussed above has been integrated into VIREX which is a system under development by our research group to allow for the representation and querying of relational data as XML. VIREX has a user-friendly interface that allows the user to specify queries with minimum keyboard input. Thus it equally addresses naive users concerns; no need to learn any query language in particular; the user runs VIREX and get on the screen a diagram summarizing all the database content and links. This way, the user will be able to code queries without any need to know the database details. Whatever a professional is expected to know before he/she codes queries is displayed by the VIREX to put all users at the same level. Queries are then coded as a sequence of mouse clicks to specify the items to show in the result. Once a query is coded, VIREX displays the result as XML schema and documents. Our extension as described in this chapter will empower VIREX with more capabilities to the benefit of naive users; instead of specifying conditions in a traditional way using a drop-down table, they will be allowed to use fuzzy terms in their queries and it is the responsibility of the VIREX engine to translate the fuzzy terms into XQuery format to be executed at the backend and the result is returned to the user in fuzzy terms. The user is given the opportunity to display the XQuery produced by the VIREX engine; it is also possible to display the corresponding SQL statement. We demonstrate the effectiveness of the proposed approach by running a user study on computer science students who were enrolled in a database course without any background related to database design or query coding; they may be considered as naive users with almost same level of potential to learn as they all passed the prerequisites and are fourth year students.

2 Related Work and Current Solutions A variety of solutions and their respective implementations currently exist for the problem at hand. However, each implementation uses diﬀerent methods to achieve results that are correct, but may not be replicable by non-technical users. By researching other currently implemented solutions, we hoped to gain insight on how to construct a new, diﬀerent, and improved solution.

172

K. Kianmehr et al.

While our focus is on XML, by viewing other fuzzy database implementations, we can compare the potential pitfalls of our own design. Kacprzyk [10] discussed how to make relational databases, such as SQL or Access [10, 11], fuzzy by inserting a fuzzy attribute for every numerical attribute into the database. This is similar to the method already developed by our group and outlined in [1]. This additional fuzzy attribute represents the application of fuzzy set theory on existing data; therefore the query would return all relevant records according to that fuzzy numerical attribute. While this solution addresses the core concepts of fuzzy querying, our team believes this to be a very ineﬀective solution for the same reasons given for the Fuzzy XML implementation.

2.1 Techniques in Relational Databases to Represent Fuzzy Data Existing literature discusses many diﬀerent techniques for representing fuzziness within relational databases. In general, it seems that the following ideas are agreed upon: a fuzzy relational database (FRDB) either allows for queries that let preferences be expressed instead of exact Boolean conditions, or allows for the storage and querying of a new type of data that directly stores fuzzy sets. In other terms, a FRDB can accommodate two types of imprecision - impreciseness in the association among data values or impreciseness in the data values themselves [19]. The two most common techniques used for working with imprecision are similarity relations or possibility distributions, or a combination of the two techniques. These are discussed in the next subsections. Table 1 An instance of a Student relation FName LName Avg Marks Jeremy Scott A Jenny Wong A George Yuzwak C Jose Sanchez B

Attitude Unhappy Negative Positive Cheerful

Table 2 Similarity Relation for the ‘Attitude’ attribute of the Student relation (Table 1) Unhappy Negative Positive Cheerful

Unhappy Negative Positive Cheerful 1 0.8 0.2 0 0.8 1 0 0 0.2 0 1 0.95 0 0 0.95 1

Attractive Interface for XML

2.1.1

173

Similarity-Based Techniques

Buckles and Petry were the first to introduce the similarity-based relational model [5]. The basis of this model is the replacement of equality with a similarity relation. A similarity relation s(x, y) is a mapping of every pair of elements within the Universe of Discourse (domain of an attribute) to the unit interval [0,1] [5]. This is best visualized in the form of a matrix. An example of this, based on the Attitude attribute of the Student relation described in Table 1, is given in Table 2. The matrix illustrates that the similarity relation is reflexive and symmetric. In this model of FRDB, a similarity relation is defined over the elements in each attribute, in each relation [23]. Where a crisp definition of equality is still desired, the matrix representation of the similarity relation is reduced to the identity matrix. When queries are written for a similarity-based FRDB, a minimal similarity threshold value must be given for any attribute in the relation that is to be matched based on similarity rather than equality. If no threshold value is specified, it is assumed that the standard definition of equality applies [5]. Using the similarity relation defined in Table 2, one could construct a query on the Student relation requesting all students with ‘Positive’ attitude with threshold of 0.8. This would then include ‘Cheerful’ students as well as ‘Positive’ students. Another feature of the similarity-based FRDB, is that it allows for nonatomic domain values. In their model, Buckles and Petry [5] define that any member of the power set of the domain may be a domain value except the null set. This feature allows uncertainty of data values to be expressed, but is not in first normal form and suffers the associated implementation problems [6]. Similarity relations are best used on finite and discrete domains of linguistic sets [4]. The structure does not lend itself to infinite domains. 2.1.2

Possibility-Based Techniques

Instead of understanding a membership function µF (x) as the grade of membership of x in F , possibility-based FRDBs interpret it as a measure of the possibility that a variable Y has a value x [19]. Such fuzzy sets are referred to as possibility distributions and are represented by the symbol . These possibility distributions can be used to indicate the possibility that a tuple has a particular value for an attribute. For example, if a tuple in a Person table has the value ‘Young’ for the attribute ‘Age’, a possibility distribution describes the likelihood that such person has a particular value for the age: young = {1.0/22, 1.0/23, 0.8/24, 0.6/25, ...} [4] So the likelihood that the Young person is 24 years old is 0.8. This allows the linguistic identiﬁer to be used as a value in the domain, while the actual possibility distribution is given elsewhere in the database in the form of a relation having the name of the linguistic identiﬁer [4].

174

K. Kianmehr et al.

Raju et al [19], describe two different ways of implementing a possibilitybased FRDB. Each represents a fuzzy relation r by a table with additional column for µr (t), showing the membership of tuple t in r. The first (Type-1) stipulates that the domain of each attribute is a fuzzy set (recall that a classical set is a special case of a fuzzy set). Given crisp values in a relation, there exist membership functions that map the values to linguistic terms with associated possibilities. The second implementation (Type-2) described by Raju et al [19], permits more uncertainty in the data values. It allows for ranges or possibility distributions to be the actual values of attributes. This cannot be implemented given current commercial frameworks for relational databases since it allows for different data types in the same column and/or multiple values. Representation would require a new abstract data type to handle the new possibilities for attribute values. Finally, possibility distributions work well to provide information about objects that ‘may be’ a valid response to a query [4]. This model works well to represent imprecise data values. 2.1.3

Hybrid Techniques

Other techniques for representing fuzziness in relational databases have been proposed to include characteristics of both the similarity-based and possibility-based models. This allows them to work with more than one area of imprecision. An example is GEFRED - a Generalized Model of Fuzzy Relational Databases [18]. In this model, each attribute in a relation has an underlying domain that can be represented in one of many ways. Values can contain possibility distributions, ranges of values, approximate values or linguistic terms, each denoted by a syntactic identiﬁer. Linguistic terms are linked to possibility distributions stored in external relations. These possibility distributions generally take the form of trapezoidal functions. The model also allows for linguistic terms in a column to be related via a ‘proximity relation’, which is identical to the similarity relation described by Buckles and Petry [5]. If no proximity relation exists for the attribute, it is assumed that the classical deﬁnition of equality applies for values in this domain [18]. GEFRED would require some sort of middleware product or specialized query language to interpret its data values, but it is a good example of how the discussed techniques can be employed to represent imprecision of data values and imprecision in the relationships between data values.

3 XML Schema and Fuzzy Data in XML The W3C XML Schema has recently became a standard [13, 20]. An XML schema defines the structure of an XML document instance. XML schemas allow for strong data typing, modularization, and reuse. The XML schema specification allows a developer to define new data types (using the ¡complexType¿ tag), and also use built-in data types provided by the specification. The developer can also define the structure of an XML document instance

Attractive Interface for XML

175

and constrain its content. As well, the XML schema language supports inheritance, so that developers do not have to start from scratch when defining a new schema. These features of the W3C XML schema specification allow for schemas that are effective in defining and constraining attributes and element values in XML documents [13]. There has already been some research completed on representing fuzzy data in XML. The fuzzy object-oriented modeling technique (FOOM) proposed by Lee et al [13] is one such approach. This method builds upon objectoriented modeling (OOM) to also capture requirements that are imprecise in nature and therefore ‘fuzzy’. The FOOM schema defines a class of XML document that can describe fuzzy sets, fuzzy attributes, fuzzy rules, and fuzzy associations. This method would be useful in representing data contained in object-oriented databases. However, it is too specific in terms of its objectoriented nature to be applied directly to relational databases. Another, more general approach is proposed by Turowski et al [21]. The method described is aimed at creating a common interchange format for fuzzy information using XML to reduce integration problems with collaborating fuzzy applications. XML tags with a standardized meaning are used to encapsulate fuzzy information. A formal syntax for important fuzzy data types is also introduced. This technique of using XML to represent fuzzy information is general enough to be built upon to apply to relational databases. However, it uses DTDs, rather than the currently accepted method of XML schemas to define and constrain the information held in an XML document. It would be beneficial to extend this approach to define the XML document class for holding data from fuzzy relational databases with an XML schema, rather than a DTD. However, in this chapter, we follow a different trend by specifying/deriving membership functions for the attributes intended to be queried using fuzzy terms. The latter attributes are expected to have numeric domains; alternatively categorical values are discretized. The best starting point for our research was to examine other solutions that tied directly to the problem, thus fuzzy XML implementations were the first topic to be researched. A previous implementation by our group outlined a method of adding fuzziness to an XML database by mapping a new subelement to an existing element that would store its fuzziness value [1]. While this method is straightforward and easy to implement, it relies on changing the database, which poses a problem to us as the problem is directed towards users who might not have the technical expertise to do so. Also, ownership boundaries pose a problem if the database does not belong to the user, and thus, cannot be modified in this way. The main problem here is that the database cannot be changed and thus this mapping must be done via some other way. This solution also requires an initial calculation over the entire database, which may prove to be extremely expensive, depending on its current size. As the initial phase requires inserting new fuzzy attributes into the database for each numerical value, the overall volume of the database may increase

176

K. Kianmehr et al.

considerably. This problem may propagate when the database contains data that changes over time. In the example given above, each time the bookstore’s repository is updated with changes from a publisher’s database, recalculation of fuzzy values for each updated attribute must be performed, thus increasing overhead even more. In addition, this approach also requires that fuzzy linguistic terms be pre-defined, which disallows the user from customizing their own terms. As each user’s definition of quantifiers and qualifiers are usually somewhat deviant from another’s, this approach fails to provide users with personal flexibility. However, the logic behind the previous effort by our group was sound and is useful to us as the ideas of applying fuzziness to XML can be used in our own implementation.

4 The Proposed Solution This chapter proposes to solve the problems with two conceptual decisions. The ﬁrst idea deals with the inability to handle several diﬀerently structured databases when combined together. This concern is especially prominent when attempting to combine several relational databases, which have rigid schema, together. The second is concerned with making the data more accessible to users who may lack in-depth technical knowledge about database querying. Combining the two solutions creates one that handles data under variable schema, and makes it more searchable. We developed a stand alone prototype that incorporates all of these ideas to act as a proof of concept. Then we integrated these ideas into VIREX in order to empower VIREX with more sophisticated capabilities.

4.1 Variable Schema Compromise This chapter proposes to solve the structure variability issue by using XML to eliminate the barrier of rigid database schema. By nature, XML is a markup language that constitutes of user-defined nested elements, as discussed previously. This prevents the need to have a strictly defined schema as relational databases do, especially in the scenario that several databases are combined to form a large repository. It is also an ideal choice for cross-database querying, since similar databases may be combined into an XML file for querying. By designing our solution for XML data files, this database may be read and used by a variety of other applications on different systems, thus making this type of database a preferable choice for platform independence. This involves working from a database in the form of an XML data file which consists of regular data and nested data. This chapter will not cover the methods in which other databases may be converted into XML data, as it is outside of the scope of this chapter and has already handled by our group and others [7, 12, 15, 16, 22]. We will assume that we begin with an existing XML data set and its corresponding schema file. Our assumption is valid because it is well done by our group and will be used as the testbed for our research.

Attractive Interface for XML

177

4.2 Fuzzy Querying As previously discussed, fuzzy set theory plays a major role in our approach. It is integrated into data querying to provide imprecise searching to nontechnical users. Specifically, we use fuzzy set theory to develop the ability to perform fuzzy searching on all numerical attributes in an XML database. We also focus on extending this functionality to nested XML elements. User-defined fuzzy relations will allow users to search for sets of data by abstracting the strict data into broader linguistic terms, allowing them to effectively query the database in a way that is more understandable and intuitive, regardless of their level of technical knowledge. We also perform this fuzzy calculation completely dynamically as opposed to changing the data layer of a system as the previously referenced solutions suggest. We believe it is a faster, more efficient, and more intuitive solution for non-technical users.

4.3 Combining Two Solutions into One The use of an XML database and fuzzy logic combined into an intuitive graphical user interface (GUI) allows users to create search queries with fuzzy linguistically-based terms. The interface allows the user to freely associate imprecise linguistic terms with fuzzy ranges for any numerical element. These terms can then be used to construct an executable search query for the XML database. The end result is an application that is capable of parsing an XML database and allows users to utilize fuzzy natural language terms to search for desired information.

Fig. 5 Basic view of our application

5 Implementation 5.1 Application Design Concepts After the initial research phase, we found it necessary to plan and assess the scope of the prototype being developed to satisfy the requirements of the problem. This involved determining some design decisions that would be adhered to during the implementation of the prototype. It was decided that three key aspects needed to be present throughout the system. The first concept was that the prototype would need to have a logical flow to a typical user. The user should not have any difficulty in discerning the

178

K. Kianmehr et al.

proper order of use when presented with the application. Such details like tab-ordering at the top of the window were kept in mind to reflect a logical presentation to the user. This is depicted in Figure 5, which shows the tabs along the top of our application. These tabs were placed in logical ordering, with focus given to the first tab upon start-up, to clearly communicate to the user the proper flow of use. This would inevitably also help a user make fewer mistakes with input, thus creating fewer mistakes within the prototype itself. Another key design idea was that the prototype should be easy to use as it was geared towards non-technical users. Again, this was reflected in the placement of graphical components on the interface, and in the clear labeling of each field and button. Each tab was designed with only minimal components to ensure that the task remained simple to the user. Lastly, the prototype would have to balance ease of use with functionality as it should not sacrifice flexibility for esthetics. Although the main focus is to provide the users with a simple intuitive interface, we wanted to keep a large degree of flexibility in the provided functionality. An additional tab was added to accommodate users with some familiarity with XQuery to execute their own complex queries for non-numerical data, as well as for general testing purposes. After coming to these conclusions, the necessary development stages were prioritized based on their importance to these design concerns and to the overall scope. Base-line functionality, like querying and parsing an XML file, took precedence over fleshing out more advanced functions like adding additional membership function shapes.

5.2 Graphical User Interface The first task to be completed was the creation of the GUI, constructed using Java’s Swing. Our team came to the conclusion that each step of user querying would be divided amongst tabs, so that screen real-estate would not be monopolized by the prototype. Thus, the interface consists of five tabs, each serving a necessary function. Each of the tabs is logically sequenced in the order that the user would follow for each step of querying. The “Browse” tab is placed first amongst the rest, allowing access to browse functionality, as shown in Figure 6. This, simply put, enables the user to open an XML file and browse its contents. More importantly is that when an XML file is loaded, the related schema file is also opened and parsed automatically (the purpose for this is discussed in Section 4.3). When the user has loaded their desired data file, they may proceed to the next tab. The Schema File tab appears next. This tab displays the loaded file’s associated schema file to the user, allowing them to determine which level any desired fuzzy terms occur in. The schema file is automatically loaded, requiring no intervention on the user’s part. The schema file is displayed in order to help the user calculate the nesting level of the terms they need.

Attractive Interface for XML

179

Fig. 6 XML data ﬁle

Fig. 7 XML Schema ﬁle

Fig. 8 Specifying membership functions

180

K. Kianmehr et al.

Fig. 9 Fuzzy query

The “Membership Functions” tab comes next, giving the user the ability to define membership functions on numerical attributes from the data file. As Figure 8 shows, a list of existing fuzzy membership definitions appear in the table at the top, according to which numerical element is selected by the user in a drop-down menu. This drop-down component is automatically populated with attribute names after the XML data file is loaded. Our team chose this form of automation in order to decrease possible user errors. Populating these values into drop-down GUI components restricts the user’s input only far enough to ensure that they do not input erroneous attribute names, effectively eliminating this as a source of frustration to the user. A change in elementselection also dynamically modifies a label to the right, which displays the minimum and maximum values of the selected element in the XML database. This acts as a guideline to the user when defining ranges for a fuzzy term. For instance, if the user is aware that the lowest amount of fish in stock at the moment is 3, then it acts as a value for which the user can base the definition of their “few” range around. Text fields below the drop-down box allows the user to enter a fuzzy term for their range (such as “few”). This fuzzy term must be formatted like “few,x” where x is the level of nesting that term has. The user can determine the level of nesting by looking at the schema file through the Schema Tab. While we feel that is not a good solution, given the constraints placed on us by Swing and time, it was difficult to implement a more effective query-builder that would determine the level of nesting on its own. The user can then define their range via the Min and Max text fields and by clicking the “Add a New Fuzzy Term” button. This adds their new term to the list for that numerical element. Here, the user may define as many fuzzy terms as they wish, on as many numerical attributes as they wish. Once the user is finished defining their fuzzy membership functions, the “Fuzzy Queries” tab is next, as shown in Figure 9. The user may now perform queries on the XML file using the fuzzy terms they’ve created. Here, the

Attractive Interface for XML

181

user is presented with two drop-down boxes; the one on the right changes depending on the selection on the left. The drop-down box on the left is populated with any numerical elements that the user has defined fuzzy terms for, whereas the drop-down box on the right dynamically changes to list all fuzzy terms that have been defined for the selected element. The user can select an element and a fuzzy term to form query conditions, which can be added to the current query by clicking the “Add to Query” button, also displaying the condition to the query list. This list represents the entire query that the user has constructed so far. They can easily remove any conditions they no longer want by selecting it, and clicking the “Remove from Query” button. In effect, the user visually builds a query using the interface, rather than typing out a confusing and syntax-specific XQuery query. Clicking the ”Run Query” button in the upper right executes the query, and displays the returned results in the text area at the bottom. The user has the option of saving these results to an XML file if they desire, by clicking the “Save Results” button. Lastly, the fifth tab, “XQuery”, allows the user to enter FLWOR queries that the prototype can use to poll the database. This was added to allow more technical users to make queries that they themselves might be comfortable with. It will also allow users to search for non-numerical queries. Note that these results show up in the previous tab as the XQuery tab was implemented mainly for testing purposes.

5.3 Functionality Implementation Our underlying implementation was constructed with Java, in conjunction with NUX, an open-source XML API. NUX is an invaluable library that provides the ability to parse and query XML from a data file. Unfortunately, it has several drawbacks, such as not being schema-aware, being over simplified, and not being designed for this specific task. NUX was originally designed for messaging software, not large-scaled XQuery on data. Despite these drawbacks, the team has integrated NUX in order to provide a simple and effective API to hasten the implementation process. NUX assists us by effectively being the wrapper API of several different components [28]. These range from using XOM trees for storage to XSLT for output. This cuts down on the time to learn each individual component, and provides an accessible method for using each. Though problems have arisen from the aforementioned drawbacks they have been worked around by coding tweaks and design choices. As discussed in the previous section, we automatically load the names of the numerical elements into a drop-down box for the user to eliminate confusion and error-prone typing. While this takes a little more loading time in the initial file-loading phase, it benefits the user by doing some of the tedious work for them by finding all numerical elements and displaying their minimum and maximum values. This is accomplished entirely without any intervention from the user, except that they must provide the corresponding

182

K. Kianmehr et al.

schema file for the data file they wish to query. The prototype accomplishes this by finding the schema file from the same directory as the data file, identifying it by its identical filename with the “xml” extension replaced with “xmls”. This is necessary in order to poll the schema file for all numerical attributes, and also allows the application to determine the minimum and maximum values of each one. Performing these functions in the backend allows us to display these values to the user, in order to aid them in building their own membership functions later on. All data pertaining to membership functions is stored in an internal hash map for easy reference. Using the aforementioned Fuzzy Queries tab, users build their queries by defining conditions, which are stored in the visible table component. Clicking the “Run Query” button translates each of the conditions into query fragments in XQuery. The fragments are combined into a full XQuery statement, and are then executed on the loaded data file. Our query works by having the whole document placed in a variable, say $a. With this variable in place we can put the individual fragments into other variables, say $b, $c, and so on. We then run a comparison between $a and $b and where they match, we return the results. We have found this works well, especially for nested queries. In our implementation, our fuzzy memberships are defined with a triangle that spans over the ranged area on a graph to determine how closely related the value is to the fuzzy defined term. To do this, we assumed the following: 1. All fuzzy memberships have a minimum and maximum range; 2. The highest membership for each fuzzy membership area is the average between the min and max, in which the fuzzy coefficient will be 1. With these assumptions, we can determine the fuzzy coefficient by using the mathematical equation for a line, y = mx + b. m is the value of the slope on the graph, y is the value of the y axis for a point on the graph, x is the value of the x axis for a point on the graph, and b is a constant used in the mathematical equation to help determine where the line is positioned on the graph. In our application, we first determine the value for the slope of the line m. This can be determined using two points, (x1 , y1 ) and (x2 , y2 ) and the equation m = (y1 − y2 )/(x1 − x2 ). The two points used will be the value of where the membership function starts or ends, and the middle point. We will only know the x values for these points at first, since the user only defines the beginning and end of each range, but since the middle/average point will give the membership degree coefficient its maximum value of 1, we can discern the coordinate of first point from this. We can determine the second point since we know the minimum and maximum values, whose values will border the fuzzy membership function, giving it a coefficient of 0. Once we’ve determined the slope of the line, we need to find the position of it on the graph, which effectively is the value b. By substituting in a known coordinate on the line, we will be able to determine the value of b

Attractive Interface for XML

183

through algebra. Once we have found the values of m and b, our equation is complete. We can then substitute the returned fuzzy value into x, and solve the equation, giving us the value of y, which is the membership coeﬃcient for that value within the fuzzy membership function. We return this along with the results. The prototype satisﬁes the problem by providing an easy to use interface that allows a user to perform fuzzy searches with no additions to existing databases or changes in querying language.

6 Integrating the Proposed Approach into VIREX After we developed the prototype and thoroughly tested its functionality from software engineering perspective, we went into the next step to integrate with VIREX. The integration has been very successful and empowered VIREX with more expressiveness and functionality.

6.1 VIREX System Overview VIREX is a powerful tool for querying relational databases to produce XML Documents and corresponding XML schemas. It is also capable of creating views to transform part of a relational database into XML. VIREX has different modules that interact with the database to achieve the target. From user’s perspective, VIREX takes a relational database as input, extracts the schema of the relational database, and generates an interactive diagram, similar to the extended entity-relationship diagram (EER) diagram, using which queries can be constructed and views can be defined. Queries and views can be constructed visually using VRXQuery. Resulting XML documents and schemas are generated accordingly. In the front end of VIREX, there are four modules with which a user interacts to code queries and to define views; all are done using the mouse and with minimum keyboard input. After a visual query is constructed and submitted for execution, several modules within VIREX get involved in the generation of the target XML document and schema. VIREX works in a systematic way to satisfy user needs. The process starts with schema conversion. Based on the query constructed on the interactive diagram and the database schema generated earlier, the schema conversion module produces a schema object, which is provided to the XML generation module. This module has two submodules: query generation and data processing. The former submodule creates SQL queries to be executed based on the specified visual query. SQL queries are executed against the underlying relational database by the data processing submodule. After the XML document object is created, both the XML document object and the schema object are passed on to the result generation module which produces expandable JAVA tree representation of the XML document and schema; it generates colored XML documents and schemas as tree

184

K. Kianmehr et al.

structures. These documents are returned to the front end and displayed to the satisfaction of end-users. In addition to the process described above, extra steps are taken when a user decides to store the result of a query (as materialized view). After the XML document and schema objects are generated, they are passed onto the view maintenance module before they are displayed. The view maintenance module is responsible for materializing and updating views. In this case, the XML document and schema objects are analyzed by the view generating submodule to ďŹ nd appropriate mapping of the XML view into the relation database. The created database for the XML view is then populated based on data in the XML document object. A materialized view must be maintained consistent with its source database. In the approach adapted by VIREX, the update of a materialized view is deferred until the next time it is accessed where extra steps are executed before the XML document is generated by the XML transformation module. To update a materialized view, the view maintenance module consults the internal representation model to obtain information on modiďŹ cations done to the database since the last update of the view, which is then processed. The visual query used to generate the view is considered when the view is updated.

Fig. 10 Query result: colored document on the left; schema on the right

In parallel with the deferred update on materialized views, corresponding XML views are also updated. The view maintenance module checks the internal representation model and then updates the corresponding XML document object directly before the actual XML document is displayed by the

Attractive Interface for XML

185

result generation module. VIREX has a visual query module named VRXQuery, which allows interactive querying of relational and XML data and facilitates specifying results in arbitrary XML document structure. A corresponding XML schema that describes the result XML document is also generated. VRXQuery is simple, user-oriented, eﬃcient and eﬀective. There is no textual query and transformation languages to learn. Finally, a sample query result in XML format is displayed in Figure 10, where part of the XML document and the XML schema are shown; a sample fuzzy query in VIREX is shown in Figure 11.

Fig. 11 Sample VRXQuery query with a fuzzy term on the AGE attribute and the corresponding SQL and XQuery queries

6.2 The Operations Supported by VRXQuery VRXQuery is intended to be expressive, user-friendly and closed. To achieve expressiveness, we first analyzed the problem and identified the basic operations necessary and required to manipulate a given database in order to produce the target XML structure. Projection and selection are necessary for reducing the information to appear in the output. Join is necessary to combine information from different sources. However, nesting is more powerful and expressive. Join is dedicated to produce flat structures, while nesting produces a nested structure. Union and difference are also necessary. We added order-by to give the user the opportunity to sort the information in the result. We also support group-by, which has been already investigated by

186

K. Kianmehr et al.

other researchers extending XQuery and Xpath, e.g., [2]. Finally, the renaming of relations and attributes has been added as the basic schema evolution function necessary to make the integration of databases easier and straightforward after the names are unified. VRXQuery is user-friendly because all queries can be specified directly on the visual diagram as a sequence of mouse clicks and with minimum keyboard input; this is illustrated later on in Figure 11 where a user specifies the query on the displayed diagram and the condition is specified in the corresponding interactive table. The user is not expected to be expert in the relational or XML technology. Queries may be coded by trial and error; this makes VRXQuery attractive learning tool for people interested in learning the XML technology; they may specify simple queries and visualize the derived XML schema and document. VIREX provides online help for different aspects of the process to guide the users whenever they get stuck. Finally, closeness enriches expressiveness; it is also necessary to incorporate fuzziness into VRXQuery for wider user community. The latter property is described in the next section, where membership functions are automatically determined and optimized.

6.3 From Fuzzy VRXQuery to SQL and XQuery After specifying the elements/attributes to be queried in fuzzy terms, it becomes possible to code queries using fuzzy terms for the specified elements/attributes. However, it is still possible to query the latter elements/attributes without fuzziness, simply because the actual values in the database are not fuzzy. Fuzziness reflects only the perspective of one usergroup; it is not binding to all user groups accessing a common data source. This introduces more flexibility by allowing users to query the database using different perspectives. In this study, we used membership functions in triangular shape; it is appropriate and satisfies the purpose; it is in general a widely used shape in fuzzy systems. As a result, a table with the following structure is derived to include the summarized fuzzy information: FuzzyAttributes(Attribute, fuzzy term, left x, middle x, right x), where the triangular shape intersects the x-axis at left x and right x, i.e., these are the extremes of the fuzzy-triangle and between them middle x is the point having membership degree one. After the fuzzy sets are decided, the user is expected to specify a fuzzy term for each fuzzy set. These fuzzy terms as stored in the table FuzzyAttributes to be used for processing and transforming the fuzzy terms appearing in user queries. For each attribute required to be queried in fuzzy terms, the table FuzzyAttributes includes one row per fuzzy term to specify the boundaries and the middle point of the corresponding triangular shape. Included in Table 3 are four fuzzy terms that classify the ‘AGE’ attribute into four group, namely kid, young, adult and senior. This structure can be smoothly adjusted to adapt other forms of membership functions, like trapezoidal.

Attractive Interface for XML

187

Table 3 Test results reported by the user study to check the eﬀectiveness of VIREX as learning tool: testing VIREX based learning versus classical learning Attribute Fuzzy Term Left xRight xMiddle x AGE kid 0 25 0 AGE young 15 40 27 AGE adult 30 60 49 AGE senior 55 100 90

Fuzziness is incorporated in the condition (where-clause) of a VRXQuery query as demonstrated in Figure 11; notice how ‘AGE’ has been speciﬁed as ‘YOUNG’ in the condition part. Before the actual query is executed, fuzziness is resolved by considering the membership function(s) that correspond to the fuzzy term(s) appearing in the query. Each fuzzy term speciﬁed in the query is replaced by a condition that returns all the values in the range covered by the fuzzy term. The query is then transformed into equivalent query expressed either in SQL to retrieve information from the underlying relational database, or in XQuery to retrieve information from the data stored in XML format (both are shown in Figure 11). The process is illustrated in Figure 11 and by the fuzzy VRXQuery described in Example 1. Example 1. Consider an XML schema to describe citizens, and assume that the ‘AGE’ attribute is intended to be expressed in the queries in fuzzy terms. Further, assume for the ‘AGE’ attribute the system decided on the four membership functions listed in the above table. A fuzzy VRXQuery to “find young employees who are managing projects located in Calgary” could be expressed as follows. FOR $e IN distinct(document(“company.xml”)//EMPLOYEE) LET $p:=document(“company.xml”)//PROJECT[mngr=$e/SSN] WHERE $p/city= ‘Calgary’ and $e/AGE is ’YOUNG’ RETURN $e/SSN

This query is transformed into the following XQuery: FOR $e IN distinct(document(“company.xml”)//EMPLOYEE) LET $p:=document(“company.xml”)//PROJECT[mngr=$e/SSN] WHERE $p/city= ‘Calgary’ and $e/AGE>=15 and $e/AGE<=40 RETURN $e/SSN

Similarly, it is possible to express queries that require compound fuzzy condition like “find young employees who earn high salary”. As a result of executing a query with fuzzy term(s), the degree of membership is computed for each of the returned instances, and the user may request displaying the result in either text or graphical format. • The result of a fuzzy query is delivered to the user after ﬁnding the membership degree for each of the returned values. The information is displayed either in text format with the membership degree explicitly speciﬁed for each of the returned values, or a graph is plotted to specify for each returned value

188

K. Kianmehr et al.

its degree of membership along the two inclined lines forming the triangular shape of the fuzzy term.

7 User Study We do believe that a tool like VIREX extended with fuzzy capabilities would be excellent for learning both XQuery and SQL. Users think in fuzzy terms and can realize how their fuzzy way of thinking can be translated by VIREX into XQuery and SQL. This concept has been validated and tested in a fourth year database class where the students start the course with no background in database design, query coding, etc. Before students were exposed to relational databases, XML, SQL and XQuery they were requested to utilize VIREX for self learning. The students were split into two groups. Each group had around 25 students will different skills and capabilities. The split was done based on the result from the first quiz and assignment which cover the entityrelationship model. We tried to have balanced groups based on their marks. The first group of students were requested to use VIREX to learn SQL and XQuery without any previous background expected. The second group of students were taught SQL and XQuery in classical way by series of lectures in class using a combination of slides and white board. After both teaching rounds were completed the two groups of students were given a quiz and assignment on the covered material. We compared the two groups based on the following criteria: (1) the highest mark in each group; (2) the lowest mark in each group; (3) the average of each group; and (4) average percentage of completeness and efficiency of their solutions. The results are shown in Table 4. Table 4 Test results reported by the user study to check the effectiveness of VIREX as learning tool: testing VIREX based learning versus classical learning Group 1 Group 2 Highest Mark 7.5 10 Lowest Mark 2 2.5 Average Mark 5.25 5.8 Completeness & Efficiency 54.7% 60.3%

According to the results which are not explicitly reﬂected in Table 4, students who learned using VIREX (Group 1) had their marks mostly distributed around the average, and students who learned in the classical way (Group 2) had their marks either close to the top or to the bottom with very few around the average. This is an interesting result indeed. It simply shows how students who used VIREX were dedicated to learn; some of them who received around the average marks in the VIREX based assignment and quiz had below average marks in the ﬁrst assignment and quiz which were used in classifying the students into the two groups; it seems that they have pushed

Attractive Interface for XML

189

their capabilities to the limit. On the other hand, students in the second group continued to show same attitude towards learning, i.e., their marks in the second quiz and assignment were very comparable to their marks in the first quiz and assignment. Shall we move into a tool driven learning? We would say it is very early to draw such a conclusion; more extensive testing is required and will be conducted in the following terms. We will decide on how consistent is the effect of VIREX based learning only after we have collected the result over a number of terms. Another important piece of information to share with the readers is the overall performance of the students in the final exam. As both groups were enrolled in a required course, the same material was repeated in class for both groups combined, i.e., one group got the material twice and one group learned by VIREX once followed by the classical way. Both students performed well in the final, but the VIREX based group achieved better overall result as shown in Table 5. Table 5 Test results reported by the user study to check the effectiveness of VIREX as learning tool: testing VIREX based learning combined with classical learning versus applying classical learning twice Group 1 Group 2 Highest Mark 10 10 Lowest Mark 4.75 3.25 Average Mark 7.75 6.0 Completeness & Efficiency 79.1% 64.9%

Comparing the results reported in Table 4 and Table 5, we could draw some interesting observations that still need to be confirmed by more extensive testing over the next few terms. Students who used VIREX in the first round end up learning better than students who used classical way of learning twice. Comparing the columns related to each of the two groups, we could easily realize the visible improvement in the VIREX based learning. The most interesting result to comment on is the completeness and efficiency achieved by the VIREX based learning. I t is obvious that students in the first group developed themselves as professionals with close to 80% completeness and efficiency score.

8 Summary and Conclusions While fuzzy logic has been applied to a broad range of applications from movies to electronics, we have decided to focus on the theory’s usefulness to databases and how it can help make querying more intuitive and accurate. We have proposed a method for increasing the ease of use of fuzzy nested XML databases for non-technical users. This could be useful in a wide variety

190

K. Kianmehr et al.

of fields that involve usability for casual users. Small businesses could benefit from having a small and easy-to-maintain database that uses a modified version of the existing prototype. Other such applications could include calculating relevance of articles in relation to each other. News databases could benefit greatly from the implemented logic in our project, allowing them to calculate and dynamically display fuzzy related links to the user. Much like research the use of fuzzy databases are only limited to a designer’s imagination. Though we have integrated the proposed approach into VIREX, we are still running some user studies to validate the overall approach. Our initial results on students registered in fourth year database course are very promising. We found out that students get more engaged and learn better than in class lecturing; every student who tried to learn using VIREX has ended up grasping some skills to utilize SQL and XQuery; on the other hand, students who learned in class were sometimes not as much effective in coding the queries. As mentioned earlier, this conclusion is based on on one test result. To generalize, we still need to run this test on other database students over the next few semesters, including Fall, Winter and Spring/Summer.

References 1. Alhajj, R., Guarav, A.: Incorporating Fuzziness in XML and Mapping Fuzzy Relational Data into Fuzzy XML. In: Proc. of ACM SAC (April 2005) 2. Beyer, K., et al.: Extending XQuery for Analytics. In: Proc. of ACM SIGMOD, pp. 503–514 (2005) 3. Bosc, P., Galibourg, M., Hamon, G.: Fuzzy querying with SQL: extensions and implementation aspects. Fuzzy Sets and Systems 28, 333–349 (1988) 4. Buckles, B.P., Petry, F.E.: Fuzzy Databases in the New Era. In: Proc. of ACM SAC, pp. 497–502 (1995) 5. Buckles, B.P., Petry, F.E.: A Fuzzy Representation of Data for Relational Databases. Fuzzy Sets and Systems 7, 213–226 (1982) 6. Dey, D., Sumit, S.: A Probabilistic Relational Model and Algebra. ACM TODS 21, 339–369 (1996) 7. Fernandez, M., Tan, W.-C., Suciu, D.: SilkRoute: Trading between Relations and XML. In: Proc. of WWW, Amsterdam (May 2000) 8. Fong, J., Pang, F., Bloor, C.: Converting Relational Database into XML Document. In: Proc. of the Intern. Workshop on Electronic Business Hubs, September 2001, pp. 61–65 (2001) 9. Hellman, M.: Fuzzy Logic Introduction (2001) 10. Kacprzyk, J.: Fuzzy Logic in DBMSs and Querying. Systems Research Institute, Polish Academy of Sciences (1995) 11. Kacprzyk, J., Zadronzy, S.: Fuzzy Queries Against a Crisp Database Over the Internet: An Implementation (2000) 12. Lee, D., Mani, M., Chiu, F., Chu, W.W.: Schema Conversion Methods between XML and Relational Models. Knowledge Transformation for the Semantic Web (2003) 13. Lee, J., et al.: Modeling Imprecise Requirements with XML. Fuzzy Systems 2, 861–866 (2002)

Attractive Interface for XML

191

¨ 14. Lo, A., Kianmher, K., Ozyer, T., Kaya, M., Alhajj, R.: Wrapping VRXQuery with Self-Adaptive Fuzzy Capabilities. In: Proceedings of IEEE International Conference on Web Intelligence, Silicon Valley, CA (2007) 15. Lo, A., Alhajj, R., Barker, K.: Flexible User Interface for Converting Relational Data into XML. In: Proc. of the International Conference on Flexible Query Answering Systems. Springer, Lyon (2004) 16. Lo, A., Alhajj, R., Barker, K.: VIREX: Visual relational to xml conversion tool. Journal of Visual Languages and Computing 17(1), 25–45 (2006) 17. Lo, A., Alhajj, R., Barker, K.: VIREX: Interactive Approach for Database Querying and Integration by Re-engineering Relational Data into XML. In: Proc. of IEEE Conference on Web Intelligence, Hong Kong (2006) 18. Medina, J.M., Pons, O., Vila, M.A.: GEFRED: A Generalized Model of Fuzzy Relational Databases Version 1.1. Information Sciences (1994) 19. Raju, K.V., Majumdar, A.K.: Fuzzy Functional Dependencies and Lossless Join Decomposition of Fuzzy Relational Database Systems. ACM TODS 13, 129–166 (1988) 20. Thompson, H.S., et al.: XML Schema Part 1: Structures. W3C Recommendation (October 2004) 21. Turowski, K., Weng, U.: Representing and processing fuzzy information - an XML-based approach. Knowledge-Based Systems 15, 67–75 (2002) 22. Wang, C., Lo, A., Alhajj, R.: Novel Approach for Reengineering Relational Databases into XML. In: Proc. of XSDM (in conjunction with ICDE), Tokyo (2005) 23. Wang, S., et al.: Incremental Discovery of Functional Dependencies From Similarity-bases Fuzzy Relational Databases Using Partitions. In: Proc. of the National Conference on Fuzzy Theory and Its Applications, pp. 629–636 (2001) ¨ 24. Yang, K.Y., Lo, A., Ozyer, T., Alhajj, R.: DWG2XML: Generating XML Nested Tree Structure from Directed Weighted Graph. In: Proc. of ICEIS, Miami (2005) 25. Zadeh, L.: Fuzzy Sets. Information and Control 8, 338–353 (1965) 26. Zadeh, L., Klir, G.J., Yuan, B.: Fuzzy Sets. Information and Control (1996) 27. Zvieli, A., Chen, P.P.: Entity-relationship modeling and fuzzy databases. In: Proc. of IEEE ICDE, Los Angeles, pp. 320–327 (1986) 28. NUX website, http://dsd.lbl.gov/nux/ 29. World Wide Web Consortium (W3C), http://www.w3.org/

An Overview of XML Duplicate Detection Algorithms P´avel Calado, Melanie Herschel, and Lu´ıs Leit˜ao

Abstract. Fuzzy duplicate detection aims at identifying multiple representations of real-world objects in a data source, and is a task of critical relevance in data cleaning, data mining, and data integration tasks. It has a long history for relational data, stored in a single table or in multiple tables with an equal schema. However, algorithms for fuzzy duplicate detection in more complex structures, such as hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors to improve duplicate detection effectiveness. In this chapter, we study different approaches that have been proposed for XML fuzzy duplicate detection. Our study includes a description and analysis of the different approaches, as well as a comparative experimental evaluation performed on both artificial and real-world data. The two main dimensions used for comparison are the methods effectiveness and efficiency. Our comparison shows that the DogmatiX system [44] is the most effective overall, as it yields the highest recall and precision values for various kinds of differences between duplicates. Another system, called XMLDup [27] has a similar performance, being most effective especially at low recall values. Finally, the SXNM system [36] is the most efficient, as it avoids executing too many pairwise comparisons, but its effectiveness is greatly affected by errors in the data. Pável Calado IST/INESC-ID, Av. Prof. Cavaco Silva, 2744-016 Porto Salvo, Portugal e-mail: pavel.calado@ist.utl.pt Melanie Herschel Universität Tübingen, Sand 13, 72076 Tübingen, Germany e-mail: melanie.herschel@uni-tuebingen.de Lu´ıs Leitão IST/INESC-ID, Av. Prof. Cavaco Silva, 2744-016 Porto Salvo, Portugal e-mail: luis.leitao@ist.utl.pt

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 193–224. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

194

P. Calado, M. Herschel, and L. LeitË&#x153;ao

1 Introduction Electronic data plays a central role in numerous business processes, applications, and decisions. As a consequence, the quality of the data is essential. This fact has been illustrated by several studies and initiatives, such as those described by Batini and Scannapieco [3]. For example, a study on data quality conducted in 2002 by the Data Warehousing Institute shows that data quality problems cost U.S. businesses more than 600 billion dollars a year. Rahm and Do [38] classify data errors that lead to poor data quality as either occurring within a single data source or as being the result of integrating multiple sources. Furthermore, errors can exist both on the schema level and on the data level. In this chapter, we focus on one particular type of data level error that occurs both within a data source and during data integrationâ&#x20AC;&#x201D;namely duplicates. Essentially, duplicates are multiple representations of a same real-world object, where by object representation we mean its representation in a data source, such as a relational database or an XML document. The challenge in duplicate detection is to detect duplicate representations that are not exactly equal due to errors in the data, and that cannot be identified using a universal identifier (e.g., the ISBN of a book). Typical errors are typos, misspellings, the lack of a standard representation for the data, and missing, outdated, or contradictory data. As a consequence, in order to detect duplicates, more sophisticated methods than the comparison of object representations based on equality have to be devised. Algorithms to detect duplicates are very important in many applications, two of which we illustrate in the following.

1.1 Application Scenarios Data cleaning is the process of correcting errors and inconsistencies in a data source, to improve overall data quality. High data quality is a prerequisite to obtain meaningful results in data analysis applications, such as data mining, report generation on data warehouses, and customer relationship management. Detecting and removing duplicates prevents analysis applications from generating wrong results based on the wrong assumption that each representation describes a different object. Fig. 1 shows an example duplicate data scenario for a credit scoring company1. The company stores information about persons, e.g., name and date of birth as well as contracts the person has with various institutions. During data analysis, a credit score is computed for each person. Intuitively, the score is higher the more banking relevant contracts a person has. Eventually, credit scores affect business decisions made by customers of the credit scoring company. For instance, a bank will not grant a loan to a person that has a bad credit score. Considering the source data in Fig. 1, we see that the person named John Doe is represented twice. Each of these two representations is associated with contracts: Person 1, named J. Doe has a bank account and a loan whereas Person 2 is associated with a cell phone and a membership card for some department store. During 1

A real-world application of data cleaning in such a scenario is described in [46].

An Overview of XML Duplicate Detection Algorithms

195

Person PID

Name

DOB

J.Doe

13.06.1974

John Doe

13.6.74

Contract PID

CID

Description

Bank account

Loan

Cell phone

Membership card

Credit Score Data analysis

PID

Score

Good

Bad

Business decision

- John Doe is not granted a loan. - Bank looses potentially good business.

Fig. 1 Duplicate data on a credit scoring company

analysis, the duplicates are considered as two distinct persons, so Person 1 has a good credit score, whereas Person 2 has a bad credit score. Assuming a bank makes a credit check based on John Doeâ&#x20AC;&#x2122;s ID, hence knowing his first name, the result will be a bad credit rating and John Doe will thus not be granted a new loan. As a consequence, customers of the credit scoring company, e.g., banks, loose business due to erroneous source data and the popularity of the company with people like John Doe, who are declined a loan, decreases. In this scenario, data cleaning involving duplicate detection would help the credit scoring company in maintaining customer satisfaction and a good image. Another application is that of data integration, which combines data from distributed and heterogeneous data sources into a unique, complete, and correct representation for every real-world object. The data integration process can be divided into three main steps. First, during schema matching and schema mapping, semantically equivalent attributes are determined before the data residing in the different sources is mapped into a common schema [37]. Next, duplicate detection takes place. Finally, based on the result of duplicate detection, data fusion combines duplicates into a single tuple [9]. Fig. 2 shows an example integration process, based on the credit scoring company that we previously considered. Assume this company, denoted Company A acquires Company B. The latter also stores information about persons, and Company A wants to consolidate both Person tables (one from each source) into a single Person table. To do so, the data is first mapped to a common schema, in this case, the schema of Company A (Fig. 2, Step (1)). Schema matching is used to determine that the concatenation of Firstname and Lastname is semantically equivalent to Name. Also, note that during the mapping phase, the date of birth values are standardized. Once all the data is mapped to a common schema, duplicate detection determines that the tuples with ID = 1 and ID = P1 are actually duplicates (Fig. 2, Step (2)). Duplicate detection is a prerequisite to data fusion (Fig. 2, Step (3)) that in this example combines the found duplicates by taking the longer first name as primary first name and the shorter one as an alias, to be appended to the primary name. Although the origin of duplicates differs in both scenarios (i.e., duplicates in one source vs. duplicates introduced through the integration of multiple sources), both scenarios have in common that duplicates must be detected in data that conforms to a single schema. This assumption underlies all methods we discuss in this chapter.

196

P. Calado, M. Herschel, and L. LeitË&#x153;ao

Person (Company A) PID

Name

DOB

John Doe

13.06.1974

...

... (1) Schema matching & schema mapping

Person (Company B) PID

Firstname

Lastname

DOB

Jonathan

Doe

June 13, 1974

Peter

Miller

May 30, 1968

...

Firstname

Lastname

DOB

Jonathan (John)

Doe

13.6.1974

Peter

Miller

30.5.1968

...

PID

Name

DOB

John Doe

13.06.1974

Jonathan Doe

13.06.1974

Peter Miller

30.05.1968

...

(2) Duplicate detection

(3) Duplicate fusion

PID

Name

DOB

John Doe

13.06.1974

Jonathan Doe

13.06.1974

Peter Miller

30.05.1968

...

Fig. 2 A data integration scenario

1.2 Duplicate Detection Duplicate detection has been studied extensively for relational data stored in a single table. Ironically, it has appeared under various different names, including (in chronological order), record linkage [19], entity identification [29], deduplication [39], duplicate detection [7], object matching [15], entity resolution [4], fuzzy duplicate identification [12], object consolidation [13], reference reconciliation [16], or object identification [41], to name just a few. Algorithms performing duplicate detection in a single table generally compare tuples, each of which represents an object, based on attribute values. However, data usually comes in more complex structures. For instance, data stored in a relational table relates to data in other tables through foreign keys. Recently, duplicate detection algorithms for data stored in such complex structures have been proposed [1, 24, 44]. These approaches have in common that they do not only consider attribute values, but also relationships to related data. Among such approaches, several focus on the special case of detecting duplicates in XML data. Detecting duplicates in XML is more challenging than detecting duplicates in relational data because there is no schematic distinction between object types, among which duplicates are detected, and attribute types describing the objects. Furthermore, instances of a same object type may have different structure on instance level, whereas tuples within relations always have the same structure. We call these challenges candidate-description ambiguity and structural diversity [44]. On the other hand, XML duplicate detection allows exploiting the hierarchical structure both to improve runtime and the quality of the duplicate detection results, which is not the case when detecting duplicates in relational or graph-structured data. In this chapter, we discuss and compare several XML duplicate detection algorithms. The goal is to provide readers with an overview of existing approaches and with an understanding of their respective advantages and disadvantages. More specifically, we discuss DogmatiX [44], SXNM [36], a structure-aware duplicate detection algorithm [31], and an algorithm using Bayesian Networks, which we

An Overview of XML Duplicate Detection Algorithms

197

refer to as XMLDup [27]. In an extensive experimental evaluation using both artificial and real-world data, we compare the above algorithms2 both in terms of effectiveness and in terms of efficiency. Experiments show that, in general, DogmatiX is the most effective algorithm overall, as it yields the highest recall and precision values for various kinds of differences between duplicates. The XMLDup algorithm shows a very similar performance, yielding its best results at low recall values. However, when duplicates among several types of objects are detected simultaneously, DogmatiX is restricted to scenarios where parent-child relationships in the XML data reflect 1:N relationships of the real world. When instead, an M:N relationship exists between parent and child, SXNM and XMLDup are applicable. SXNM is the most efficient as it avoids the most superfluous pairwise comparisons, but its effectiveness is greatly affected by errors in the data. Hence, if effectiveness is the primary concern, XMLDup should be used. The remainder of this chapter is structured as follows. In Sec. 2, we briefly survey research in duplicate detection. Sec. 3 then delves into the details of selected XML duplicate detection algorithms. A comparative evaluation of the discussed algorithms is the subject of Sec. 4. We conclude this chapter with a summary and with a discussion of open issues in Sec. 5.

2 The State of the Art in Duplicate Detection Before specializing on XML duplicate detection approaches, we provide an overview of the general duplicate detection field in order to put the XML case in perspective to the rest of the research area. For a more complete and detailed discussion of related work, we refer readers to the survey by Elmagarid et. al [18].

2.1 Classification of Duplicate Detection Algorithms To discuss duplicate detection algorithms, we resort to the three-dimensional classification that we introduced in [27], of which a more complete and fine-grained version is provided in [42]. Essentially, we classify algorithms along the dimensions data, algorithm type, and algorithm focus. Data. The data dimension specifies what type of data an approach applies to. We distinguish (i) data in a single table, without multi-valued attributes, (ii) tree data such as the hierarchical organization of data warehouse tables or XML data, and (iii) data represented as a graph, e.g., XML with keyrefs or data for personal information management. Algorithm Type. The algorithm type dimension discerns between three types of algorithms used for duplicate detection: (i) iterative algorithms, which iteratively 2

Except for [31], as at the time of the writing of this chapter, we were unable to obtain a working version of the system.

198

P. Calado, M. Herschel, and L. LeitË&#x153;ao

detect pairs of duplicates, (ii), the use of clustering techniques, which split or merge the data to obtain duplicate clusters, and (iii) machine learning, where models and similarity measure parameters are learned. Algorithm Focus. Research in duplicate detection focuses on three goals, namely (i) effectiveness, (ii) efficiency, and (iii) scalability. Research on the first problem is concerned with improving precision and recall [2] by minimizing classification errors, i.e., classifying two objects as duplicates when the representations do not describe the same object (false positive) and classifying two objects as non-duplicates when the representations in fact describe the same real-world object (false negatives). High effectiveness is, for instance, achieved by developing sophisticated similarity measures [7, 12] or by considering relationships [41]. Research on the second problem assumes a given similarity measure and develops algorithms that try to avoid applying the measure to all pairs of objects [1, 33]. Research on the third problem tries to develop algorithms applicable to very large datasets. For such methods, it is essential to not only scale in time by increasing efficiency, but also to scale in space. To this end, relational databases are commonly used [21, 36]. In Tab. 1, we summarize some duplicate detection methods, classifying them along the two dimensions data and algorithm type. We further show whether an approach mainly focuses on effectiveness (Q, for quality), efficiency (T , for time), or scalability (S). Within the above classification, XML duplicate detection algorithms belong to the class of algorithms that operate on tree data. In principle, if we allow key and foreign key constraints specified, for instance, via key and keyref specifications in an XML schema, the logical structure can represent a graph. However, the corresponding finite instance can always be represented as a tree. We observe that all algorithms for tree structured data are iterative duplicate detection algorithms. A more concise definition of this type of algorithm will be provided in Sec. 3.1. Among the methods summarized in Tab. 1, the methods specifically designed for XML duplicate detection are [27, 31, 36, 43, 44], which we highlighted in bold. As we will discuss these methods in more detail in the subsequent section, the remaining discussion in this section focuses on other selected algorithms.

2.2 Related Work Jin et al. [23] explore how existing techniques that map similarity spaces into similarity/distance-preserving multidimensional Euclidean spaces can be used for duplicate detection. They first consider rules to select attributes relevant for similarity measurement, and then process each attribute individually as follows. All attribute values are mapped to a high-dimensional Euclidean space in which possible duplicates are then detected based on their Euclidean distance. These candidate pairs represent a subset of the set of all possible pairs, and the similarity measure is then only applied to these pairs. Essentially, a low-cost similarity join is performed in Euclidean space to reduce the number of pairs to which the more complex similarity measure is applied.

An Overview of XML Duplicate Detection Algorithms

199

Table 1 Classification of duplicate detection methods along the dimensions data, approach, algorithm focus, and domain dependence. T stands for efficiency (time), Q for effectiveness (quality), and S for scalability (scalability) DATA Table Tree Graph Iterative Her.95 [21](T ,S) Ana.02 [1](Q,T ,S) Bhat.04-07 [6](Q) Mon.97 [33](Q, T, S) Weis04-05 [43, 44](Q,T ) Dong05 [16](Q) Jin03 [23](Q,T ) Puh.06 [36](T ,S) Weis06-08 [45, 22](Q,T ,S) A Chau.03 [11](Q,T ,S) Mil.06 [31] (Q) L Doan03 [15](Q) Leit.07 [27] (Q) G Whang.09 [47] (Q, T ) T Clustering Chau.05 [12](Q,T ,S) Kala.06 [24, 13](Q, T ) Y Yin06 [48](Q,T ) P Learning Sar.02 [39](Q) McC.03 [30](Q) E Coh.02 [14](Q) Sin.04-05 [40, 41](Q) Elf.02 [17](Q) Bhat.06 [5](Q) Bil.03 [7](Q) Min.06 [32] (Q) Leht.06 [26](Q)

In [15], Doan et al. present PROM, a duplicate detection method that performs duplicate detection not only based on similar information, but also on non-similar information. More specifically, domain-specific rules are applied over all attributes (similar and non-similar) to check if a duplicate decision is reasonable. For instance, consider a relation Person(age, name, salary). Assume that an algorithm detects tuples (9, Mike, â&#x160;Ľ) and (â&#x160;Ľ, Mike, 200K) to be duplicates. Then, according to a Person profile existing in PROM, a rule saying that the relationship between age 9 and salary 200K is highly unlikely possibly revokes the duplicate. Using such domainspecific profiles was shown to increase effectiveness. All above approaches require a pairwise similarity (or distance) measure and a fixed similarity (or distance) threshold. Setting the threshold manually is not straightforward, and different objects may require different thresholds. In [12], Chaudhuri at al. observe that duplicates of a same object usually have a small distance to each other and only have a small number of other objects within a small distance. Based on these properties, clusters of duplicates can be detected by appropriately setting the threshold (a distance radius) for every object. This method improves effectiveness and the authors propose a technique to efficiently determine the appropriate thresholds. Machine learning duplicate detection approaches work by learning the various parameters of a duplicate detection process. For instance, MARLIN [7] includes learnable similarity measures that can be trained. An example is the learnable string edit distance, where the weight of each edit operation necessary to transform one string into another is adapted to a particular domain, using sample labeled data. Tuples, which consist of several attributes, can then be compared by considering individual attribute similarities, obtained with trained string similarity measures, in a final predicate, which is again learned from sample labeled data. MARLIN, and all other learning techniques primarily have the goal of improving effectiveness.

200

P. Calado, M. Herschel, and L. LeitË&#x153;ao

Having described iterative, clustering, and learning algorithms on table data, let us now shift the focus to algorithms on graph data. As a representative of an iterative algorithm on graph data, consider [16]. The algorithm proposed by Dong et. al. performs duplicate detection in a personal information management scenario by using relationships to propagate similarities from one duplicate classification to another, within a dependency graph, thereby improving effectiveness. Essentially, pairs of candidates to be compared are maintained in an active priority queue where the first pair is retrieved at every iteration. The pair is compared and classified as non-duplicate or duplicate. In the latter case, relationships are used to propagate the decision to neighbors, which potentially become more similar due to the found duplicate. The priority queue is also reordered, with neighbors being sent to the front or the back of the queue, depending on the domain-dependent type of relationships. Singla et al. [40, 41] propose a graph duplicate detection algorithm that considers pairs of candidates connected to each other in a reference graph. Similarities are then propagated to neighboring candidate pairs. This concept is similar to the one used in [16]. However, Singla et al. use an alternative method to detect duplicates and to propagate similarities. Their algorithm is based on Conditional Random Fields, where parameters are learned.

3 Detecting Duplicates in XML Despite duplicate detection techniques being around since at least 1959 [34], duplicate detection of XML data has only recently started to receive attention. In this section, we present some of the most representative approaches to this problem.

3.1 First Approaches to XML Duplicate Detection Early work in XML duplicate detection was mostly concerned with the efficient implementation of XML join operations. A pioneering approach was presented by Guha et al. [20], who suggested an algorithm to perform approximate joins in XML databases. However, their main concern was on how to efficiently join two sets of similar objects, and not on how accurate the joining process was. Thus, they focused on an efficient implementation of a tree edit distance [8], which could later be applied in an XML join algorithm. The concern with accuracy was later approached by Carvalho and Silva, in [10]. Although not specifically focused on XML, their work proposes a solution to the problem of integrating tree-structured data extracted from the Web. Two objects are compared by transforming each object into a vector of terms and using a variation of the cosine measure to evaluate their similarity [2]. Object structure is mostly ignored, and a linear combination of weighted similarities is used to account for the relative importance of the different fields within the objects. The authors show that this simple strategy manages to achieve high precision values in a collection of scientific publications. Nevertheless, and because of its more general nature, their

An Overview of XML Duplicate Detection Algorithms

201

approach does not take advantage of the many useful features existing in XML databases, such as their object structure or tag semantics. Only more recently has research been performed with the specific goal of discovering duplicate object representations in XML databases [27, 31, 36, 44] 3 . These works differ from previous approaches since they were specifically designed to exploit the distinctive characteristics of XML object representations: their structure, textual content, and the semantics implicit in the XML labels. As discussed in Sec. 2, all algorithms that have been developed for XML duplicate detection fall in the category of iterative duplicate detection algorithms. A characteristic of algorithms in this class is that they use a measure that computes a similarity score between two object representations. If the similarity is above a predefined threshold, the pair of object representations is classified as a duplicate. As there can be more than two representations of the same object, a common postprocessing phase of these algorithms is the computation of the transitive closure over the detected pairs of duplicates. The final result is a set of non-overlapping clusters, each cluster ideally containing all representations of the same real-world object. In the following sections, we will explain each of these algorithms in more detail. To better illustrate the functioning of iterative XML duplicate detection algorithms, we start by presenting an example, first used in [27], that we will use throughout the chapter.

3.2 An Illustrating Example Fig. 3 shows the tree representation of two XML elements U and U . Nodes are labeled by their XML tag name and contain an index, for future reference. Both trees represent XML elements named mv (an abbreviation for movie). These elements have two attributes, namely year and title. They nest further XML elements representing directors (dr) and casts (cst). A cast consists of several actors (ac), represented as children XML elements of cst. Year, title, dr, and ac have a text value that stores the actual data. For instance, title has a text value of “Pros and Cons” in buth U and U . In the following, we will refer to the nodes in trees U and U by their tag name, using their index number and object name when necessary (e.g., mv1U or ac2U ). The goal of duplicate detection is to determine if both movies are duplicates. This should be achieved despite the fact that the year and the directors’ and actors’ names are represented differently. Duplicate detection methods usually work by comparing the structure and/or the values in the XML elements. The individual similarities of each value, and possibly the structural similarities, are then combined, using a more or less complex strategy, to compute a final similarity value between the objects. Using the above example and notation, we now describe four state-of-the-art approaches for XML duplicate detection. 3

Note that [36, 44] are discussed in more detail in [42].

202

P. Calado, M. Herschel, and L. Leit˜ao

Tree U

Tree U’

mv1 @year @title 1983

mv1 dr1

cst1

@year @title

dr1

Pros Pros ac2 1984 and J. Smith ac1 and John S. ac1 Cons Cons Templeton H.M. T. Murdock P. Peck

cst1 ac2

ac3

B.A. Baracus Murdock

Fig. 3 Two XML trees, each representing a movie (mv) with directors (dr), cast (cst), and actors (ac)

3.3 The DogmatiX Framework Weis and Nauman [44] proposed one of the first frameworks capable of detecting duplicates in XML, called DogmatiX. Their framework aims at both efficiency and effectiveness in duplicate detection. The framework works in three main steps: candidate definition, duplicate definition, and duplicate detection. In the candidate definition phase, the goal is to define which object representations in the XML database are relevant for comparison. It consists in designating which parts of the XML schema of the database will be used in the ensuing comparisons. For example, considering the schema of the database of Fig. 3, if we were interested in finding duplicate actors, the candidate definition phase would take as input the set S = {/mv/cst/ac}, indicating the path to the elements to be compared. The result would be the set of all XML elements in the database that represent actors. In Fig. 3, this would correspond to the nodes labeled aci in both trees. This set is called the set of duplicate candidates. Once a set of duplicate candidates is specified, not all information contained in the XML elements might be useful to determine if they are indeed duplicates. This issue is handled in the duplicate definition phase, where we choose which elements of a candidate will actually be compared. Duplicate definition is done by defining a description for every candidate. This description is defined by a set of projection and selection operations performed on the candidate. In DogmatiX these operations are represented as an XQuery expression. The result of applying the query is stored in a relation containing text, xpath pairs, called the object description (OD). The value xpath contains the absolute XPath expression of the element to compare and the value text contains the actual data. To illustrate, assume that both trees in Fig. 3 are in the set of duplicate candidates. It is reasonable to assume that the title of the movie and the director’s name are good sources of information to determine if candidate U is a duplicate of candidate U . We therefore define their description by the set of operations yielding, as a result, the

An Overview of XML Duplicate Detection Algorithms

203

title and director name of each movie. Tab. 2 shows the resulting object description relations. Table 2 Object description relations for the XML elements in Fig. 3, considering that U and U are in the duplicate candidates set Object Object Description U { “Pros and Cons”, /mv@title , “John S.”, /mv/dr/text() } U { “Pros and Cons”, /mv@title , “J. Smith”, /mv/dr/text() }

Selecting the object descriptions need not be done by an expert. Instead, the authors propose and implement a set of heuristics to automatically choose the elements to be compared. Given a root element u, three heuristics are proposed: • r-distant ancestors, which chooses the ancestors of u up to a distance of r; • r-distant descendants, which chooses the descendants of u down to a distance of r; • k-closest descendants, which chooses the first k descendants of u, starting from u and proceeding in breadth-first order. Fig. 4 shows the elements selected by each of the above heuristics, if applied to tree U of Fig. 3, considering elements /mv/cst as duplicate candidates. The selection of descriptions as determined by a heuristic can be further refined based on either schema-based conditions (content model, data type, number of possible occurrences, etc. [44] ) or on instance-based conditions (value distribution, number of nulls, relevance of values, etc. [42]).

mv1 @year @title 1984

Pros and Cons (a)

1 cst1

cst1 cst1

ac1 T. Peck

ac2

ac3

B.A. Baracus Murdock (b)

2 ac1 T. Peck

ac2

B.A. Baracus (c)

Fig. 4 Object descriptions obtained when using the heuristics for description selection proposed in DogmatiX. (a) shows the result of the distant ancestors heuristic, with r = 1; (b) shows the result of the distant descendants heuristic, with r = 1; (c) shows the result of the closest descendants heuristic, with k = 3

Having defined all object descriptions, we now proceed to the duplicate detection phase. DogmatiX starts by reducing the number of pairwise comparisons

204

P. Calado, M. Herschel, and L. Leit˜ao

needed, filtering out candidates, and hence all comparisons to these, that share few descriptions with all other candidates. This is achieved by defining a filtering function f (·), which computes the ratio between the amount of information in a given object description and the amount of information in all other object descriptions. The amount of information is measured using a variation of the well-known Inverse Document Frequency measure (IDF) [2]. Essentially, this so-called soft IDF considers similar values as the same term. After filtering, all the remaining candidates are pairwise compared and the pairs whose similarity is above a given threshold are considered duplicates. Note that the filter function f (·) is defined as an upper bound to the similarity measure used to classify candidates, so the filtering phase does not affect the effectiveness of duplicate detection. The similarity measure used by DogmatiX computes the relevance of similar object descriptions between two candidates, relative to the relevance of non-similar object descriptions. Similar object descriptions are determined by computing the normalized edit distance between their values. If the normalized edit distance is below a predefined distance threshold, the values are considered similar, otherwise they are non-similar. For both similar and non-similar descriptions, the relevance is given by the soft IDF measure. As an example, let us again consider the object descriptions and the candidates shown in Tab. 2. Further, we assume that titles are recognized as similar, whereas directors are not. DogmatiX determines the overall similarity of candidates U and U as sim(U,U ) =

so f tIDF(Pros and Cons) so f tIDF(Pros and Cons) + so f tIDF(John S.) + so f tIDF(J. Smith) (1)

So far, we have discussed DogmatiX in the case where there is only one candidate type defined during candidate definition. To detect duplicates in XML when multiple candidate types have been specified, we use a top-down algorithm, first proposed for XML data in [43]. We start by comparing candidates on the highest level of the hierarchy, e.g., movies in our example. Once we have detected duplicates among movies, we assume that duplicate children, e.g., equal casts, can only occur if the movies are the same. Hence, casts are only compared when their parent movies have been classified as duplicates. This assumption is valid when a parent-child relationship describes a 1:N relationship and it improves efficiency since it saves expensive candidate comparisons. However, effectiveness potentially degrades when the relationship is not 1:N. Indeed, an actor can star in multiple movies but using the above algorithm, duplicate actors are not detected if the movies they star in are not duplicates. This algorithm was inspired by the DELPHI system [1], where a similar top-down approach was used to detect duplicates in the dimensional tables of a data warehouse.

An Overview of XML Duplicate Detection Algorithms

205

3.4 A Structure-Aware XML Distance Measure In [31], Milano et al. propose a distance measure between two XML candidates that takes into account both their structure and their data values. As is common to all iterative duplicate detection algorithms, this measure is used to perform a pairwise comparison between all candidates. If the distance measure determines that two XML candidates are closer than a given threshold, the pair is classified as a duplicate. This distance measure is defined based on the concept of overlays. An overlay between two XML trees U and V is a mapping between their nodes, such that a node u â&#x2C6;&#x2C6; U, is mapped to a single node v â&#x2C6;&#x2C6; V if, and only if, they have the same path from the root. This means that u and v are mapped to each other if all their corresponding ancestor nodes are also mapped to each other. Using the trees in Fig. 3, a possible overlay is shown in Fig. 5. The dashed lines show which nodes are matched.

mv1

@year@title Pros 1983 and Cons

mv1

dr1 John S. ac1

cst1

@year@title ac2

cst1

dr1

Pros 1984 and J. Smith ac1 Cons

Templeton H.M. Murdock P.

T. Peck

ac2

ac3

B.A. Baracus Murdock

Fig. 5 Optimal overlay between trees U and U from Fig. 3 when using Levenshteinâ&#x20AC;&#x2122;s distance [28] to compute the cost

As in the top-down approach described for DogmatiX, this definition of overlay improves efficiency and, potentially, also effectiveness by avoiding false positives. However, it also prevents finding duplicate actors that star in different movies. Hence, this algorithm is, like DogmatiX, only suited when the objects described by the XML elements are in a 1:N relationship, which is translated by the parent-child relationship in the XML document. Two trees can have several overlays. An overlay is said to be complete, if it is not contained by any other overlay. An overlay has a cost, which is defined as the sum of the string distances between the data in each mapped pair of nodes. The string distances can be measured using any string comparison function. An overlay is said to be optimal if it is complete and there is no other overlay with a lower cost. The proposed structure-aware distance between two XML trees U and V is thus defined as the cost of an optimal overlay between U and V . To illustrate the above ideas, consider the three overlays represented in Fig. 6.

206

P. Calado, M. Herschel, and L. Leit˜ao

(b) A complete overlay O2 ⊃ O1 with cost 3

(a) An arbitrary overlay O1 with cost 2

y a

Fig. 6 Examples of overlays and their associated cost, computed as the sum of the edit distance between matching node labels

Fig. 6(a) shows an arbitrary overlay O1 between two trees. The roots match and have equal labels, hence, the cost of this match, computed as the edit distance between the two labels equals 0. On level further below, we see that two nodes with respective labels y and x are matched. This results in a matching cost of 1. No further match exists on that level of the hierarchy and we observe that the overlay is therefore not complete, because an overlay exists that consists of a superset of matches in O1 (see Fig. 6(b) for complete overlay). On the leaf level, note that we can only match children from y to children of x, because of the restriction that nodes can only match if all their ancestors match. In total, the cost of O1 equals 2, as we add up the edit distance of all matching labels. A complete overlay O2 is depicted in Fig. 6(b). It is a superset of O1 and it can be easily verified that there is no other overlay O such that O2 ⊂ O. The cost of this overlay is 3. This overlay is not optimal, because it does not have minimal cost, compared to the cost of further possible complete overlays. For instance, Fig. 6(c) shows another complete overlay O3 that has a cost of 1, which is less than the cost of O2 . It can be verified that O3 is the optimal overlay in this example, and hence, both XML trees have a structure-aware-distance of 1. Note that the overlay represented in Fig. 5 is an optimal overlay between the two trees U and U of our running example. Milano et al. [31] propose an algorithm to compute the optimal overlay between two XML trees, together with its cost. The algorithm works recursively, comparing the root nodes first, and then proceeding to their children, their children’s children, and so on. As it descends the tree, it computes the string distances between the values in the nodes. To match pairs of nodes, it uses a variation of the Hungarian method [25], which is able to discover the lowest cost matching in polynomial time.

An Overview of XML Duplicate Detection Algorithms

207

3.5 SXNM—Sorted Neighborhood for XML Duplicate Detection SXNM (Sorted XML Neighborhood Method) [36] is a duplicate detection method that adapts the relational sorted neighborhood approach (SNM) [21] to XML data. Like the original SNM, the idea is to avoid performing useless comparisons between objects by grouping together those that are more likely to be similar. In a relational database, SNM works by first generating a key for each tuple to be compared. This key attempts to, somehow, summarize the tuple contents. It can, for instance, contain parts of each of the tuple fields. The keys are then sorted lexicographically, which should group the keys of duplicates in nearby positions. To detect duplicates, we now need only to go through the set of sorted keys and compare the corresponding candidates to those whose keys are within a fixed size window in the sorted list. If the keys were appropriately defined, this process should avoid comparisons between very dissimilar candidates, thus speeding up the duplicate detection process, while maintaining high effectiveness. The runtime complexity of this algorithm is dominated by the complexity of the sorting phase and hence the total complexity is O(n log n), where n is the number of candidates. To compensate for a possibly poor choice of a sorting key, SNM (as well as SXNM) can perform the above process in multiple passes, using multiple different keys. All duplicate pairs detected based on any key are combined to clusters by applying the transitive closure. Similar to the DogmatiX framework (see Sec. 3.3), SXNM takes as input an XML database and a definition of the duplicate candidates. From each duplicate candidate we can extract its object description. Differently from DogmatiX, however, object descriptions also include a relevance value, representing the relative importance of each field in the duplicate detection process. In the key generation phase, one or more keys for each candidate are built, summarizing the information in their object descriptions. For example, say we use the elements /mv@title and /mv@year to define the object descriptions of the movies in Fig. 3. Also, to summarize their information, suppose we use the first three characters of the first object description and the last two digits of the second to form the object key. The resulting descriptions and keys are shown in Tab. 3. Table 3 Object descriptions and keys for the movies in Fig. 3 Object Object Description Key U {“Pros and Cons”, 1983} PRO83 U {“Pros and Cons”, 1984} PRO84

Candidates are compared using a similarity measure that combines the similarities of the object descriptions and the similarities of their descendants, if they exist. The total similarity of the object descriptions is simply the average of the similarities of each object description, weighted by the given relevance value. The similarity of an individual pair of object descriptions can be given by any similarity measure, such as their string edit distance.

208

P. Calado, M. Herschel, and L. Leit˜ao

Since an XML element can have several sets of descendants (e.g., in the XML elements of Fig. 3, the mv element has two sets of descendants: mv/dr and mv/cst), the similarity of the descendants is computed by (1) applying a similarity function to each set and (2) aggregating the individual set similarities into a single value. Thus, if we are comparing two XML elements u and v, for the first case, the authors suggest using the Jaccard coefficient of the descendent clusters, i.e., the ratio between the cardinalities of the intersection and of the union of the sets of duplicate clusters to which the descendants of u and v belong. For the second case, we can simply use the average of all set similarities. To illustrate, suppose we are comparing the elements /mv/cst of the XML elements in Fig. 3, according to their descendants. SXNM executes the detection process in a bottom-up fashion, hence it would have already discovered the duplicate /mv/cst/ac elements. Note that this bottom-up traversal distinguishes SXNM from the previously described algorithms and allows it to detect duplicate actors starring in different movies. Assume that this resulted in the duplicate clusters C1 = {ac1U , ac1U }, C2 = {ac2U , ac3U }, and C3 = {ac2U }. To measure the descendant similarity between U and U , we observe that the /mv/cst/ac elements in U occur in clusters κ = {C1 ,C2 }, while the /mv/cst/ac elements in U occur in clusters κ = {C1 ,C3 ,C2 }. Using the ratio suggested above, the similarity would therefore be |κ ∩ κ |/|κ ∪ κ | = 2/3.

3.6 Bayesian Networks for XML Duplicate Detection Using a similar bottom-up strategy, but based on a different principle, Leit˜ao et al. [27] propose a Bayesian network model for XML duplicate detection, called XMLDup. Bayesian networks provide a graphical representation to specify a joint probability distribution [35]. This representation is based on a directed acyclic graph where a set of random variables makes up the nodes of the network. An edge from one node to another means that the first has a direct influence on the second. This influence is quantified through a conditional probability distribution correlating the states of each node with the states of its parents. The model of Leit˜ao et al. is based on the assumption that the fact that two nodes are duplicates depends only on the fact that their values are duplicates and that their children are duplicates. In Fig. 3, this would mean that the two movies (mv) are duplicates depending on whether or not their children nodes (dr and cst) and their values for attributes year and title are duplicates. Furthermore, the nodes tagged dr are duplicates depending on whether or not their values are duplicates and the nodes tagged cst are duplicates depending on whether or not their children nodes (tagged ac) are duplicates. This process goes on recursively until the leaf nodes are reached. The actual algorithm for determining if the two candidates are duplicates can be implemented by the Bayesian Network shown in Fig. 7. In this network, the node labeled mv11 represents the possibility of node mv1U being a duplicate of node mv1U . This node has two parent nodes: Vmv11 , which represents the possibility of the values in the mv nodes being duplicates, and node Cmv11 , which represents the possibility of the children of the mv nodes being duplicates. Following the same

An Overview of XML Duplicate Detection Algorithms

209

mv11 Vmv11 mv11 [year]

Duplicate probability of:

Cmv11 dr11

cst11

Vdr11

Ccst11

dr11 [value]

ac∗∗

mv11 [title]

Value sets (V) or children sets (C) Two nodes Two values

ac1∗

ac2∗

ac11

ac12

ac13

ac21

ac22

ac23

Vac11

Vac12

Vac13

Vac21

Vac22

Vac23

ac11 [value] ac12 [value] ac13 [value] ac21 [value] ac22 [value] ac23 [value] Fig. 7 Bayesian network to compute the similarity of the trees in Fig. 3, as shown in [27]

reasoning, node Vmv11 has two parent nodes, shown as rectangles in Fig. 7, which represent the possibility of the year values in the mv nodes being duplicates and of the title values in the mv nodes being duplicates. This can be repeated recursively until all node comparisons are represented. A slightly different procedure is taken when representing multiple nodes of the same type, as is the case of the XML nodes labeled ac. Since the full set of nodes needs to be compared, instead of each node independently, this is represented by nodes ac∗∗ , ac1∗ , and ac2∗ in the network of Fig. 7. All nodes are assigned binary random variables, taking the value 1 to represent the fact that the corresponding data in trees U and U are duplicates, and taking the value 0 to represent the opposite. Thus, to decide if two candidates are duplicates the algorithm has to compute the probability of the root nodes being duplicates, P(mv11 = 1), which can be interpreted as a similarity value between the two XML elements. To obtain this probability, the algorithm propagates the prior probabilities associated to the network’s leaf nodes, which will set the intermediate nodes probabilities, until the root probability is found. This yields the formula4: P(mv11 ) = P(mv11 [year]) + P(mv11[title]) × P(dr11 [value])+ 3 3 1 1 1 − ∏(1 − P(ac1i[value])) + 1 − ∏(1 − P(ac2i[value])) × × 2 2 i=1 i=1 4

Further details of this process can be found in [27].

(2)

210

P. Calado, M. Herschel, and L. LeitË&#x153;ao

In order to decide if two XML candidates are duplicates, the prior probabilities in Eq. (2) need to be given a value. This can be accomplished using a similarity measure between the values in the corresponding XML nodes, with the restriction that it should return a value between 0 and 1 (0 meaning that the values are totally dissimilar and 1 meaning that they are exactly equal). Thus, we can execute the duplicate detection process using any measure, or any possible combination of measures. In [27], the authors use a normalized version of Levenshteinâ&#x20AC;&#x2122;s string edit distance [28] for all values. Since the Bayesian network built has no cycles, computing the probabilities is linear in the number of nodes. Although, in the worst case, this number of nodes can be quadratic in relation to the number of nodes in the XML trees, in practice this would only occur if all the nodes in the XML objects were of the same type, which is unlikely to happen. Also, all probabilities can be computed in linear or constant time, if we account for the cost of computing the similarity between XML node values as constant.

3.7 Summary In this section, we discussed various XML duplicate detection methods. Before we evaluate these experimentally, we summarize the main features of the different algorithms in Tab. 4. More specifically, we highlight (i) the theoretical worst case time complexity in terms of the number of comparisons performed by an algorithm, given n candidates as input, (ii) the similarity measure each method uses, (iii) the method employed to potentially save similarity comparisons and thus improving runtime, and (iv) the traversal order in the hierarchy. As a reminder, a top-down traversal order means that the algorithms are best suited for a nesting that reflects 1:N relationships whereas a bottom-up traversal is advisable when the nested XML data actually represents objects in an M:N relationship.

4 Experimental Comparison To complement our overview of XML duplicate detection methods, we performed an empirical comparison of DogmatiX, SXNM, and XMLDup (Sec. 3.3, 3.5, and 3.6, respectively). To have a better understanding of how these approaches perform when facing data in several contexts, we used five different data sets, from four distinct domains. In this section, we present the setup prepared for the experiments and show the effectiveness and efficiency results achieved by each method.

4.1 Datasets Our tests were performed using five different datasets, representing different data domains. The first dataset, IMDB, consists of a set of XML objects taken from a real database and artificially polluted by inserting duplicate data and different types

An Overview of XML Duplicate Detection Algorithms

211

Table 4 Summary of XML duplicate detection algorithms Algorithm

Time complexity Similarity measure (worst case)

DogmatiX [43, 44]

O(n2 )

StructureAware [31]

O(n2 )

SXNM [36]

O(n log n)

XMLDup [27]

O(n2 )

Relevance of similar descriptions relative to relevance of nonsimilar descriptions, where description similarity is computed based on the normalized edit distance and the relevance based on soft IDF. Based on overlays that represent a 1:1 matching of XML nodes. Requires parents to match to also match children XML elements. The similarity is computed based on the cost of an optimal overlay. Computing the cost involves comparing string values using edit distance and matching elements using the Hungarian algorithm. Weighted sum of description similarity and children similarity. Description similarity is the weighted average of description similarities using a secondary (string) similarity function. Children similarity is the Jaccard similarity of clusters children belong to. Based on prior probabilities of children and attribute nodes that propagate through a Bayesian Network. Probabilities can be computed by any similarity measure returning a result between 0 and 1.

Comparison pruning

Traversal order

Exact filter defined as an upper top-down bound to the similarity measure + pruning children comparisons in case of 1:N parent-child relationship

none

top-down

Heuristic based on key definition bottom-up and window size.

none

bottom-up

of errors. The last four datasets, Restaurant, Cora, IMDB+FilmDienst and FreeDB, are composed exclusively of real data, containing naturally occurring duplicates. The artificial duplicates were generated using the Dirty XML Generator tool5 . Each original object has exactly one duplicate in the database. This duplicate is a replica of the original object, but containing three different types of randomly generated errors: (i) typographical errors, (ii) missing data (e.g., a missing title), and (iii) duplicate erroneous data (e.g., the same movie containing two different titles). We now describe each dataset in detail. Table 5 shows a summary of some of their relevant statistics. Unless otherwise stated, in the artificially generated duplicates, typographical errors, missing data, and duplicate erroneous data occur with 20%, 10%, and 8% probability, respectively. 5

http://www.hpi.uni-potsdam.de/naumann/projekte/ dirtyxml.html

212

P. Calado, M. Herschel, and L. Leit˜ao

Table 5 Datasets used in the comparison experiments. Avg. depth is average of the depths of each branch starting from the root node (at depth zero), in the XML object. Dataset # candidates # objects IMDB 5748 2874 Restaurant 864 112 Cora 1878 1693 FreeDB 9763 298 IMDB+FilmDienst 1000 500

size (Kb) avg. depth 9932.8 1.54 699.6 2.2 928.1 2.2 7577.6 1.14 1433.6 1.67

IMDB Dataset. IMDB is a movie dataset containing 2874 objects, randomly extracted from the Internet Movie Database6 , plus their generated duplicates. The attributes in each object are: title, director, author, year, movie key, rating, genre, keywords, cast member, runtime, country, language, and certification. Restaurant Dataset. The Restaurant dataset consists of a collection of 864 restaurant records, composed by joining 533 and 331 non-duplicate records from Fodor’s and Zagat’s restaurant guides7 . The set contains 112 duplicates, yielding an average of 0.13 duplicates per object. Each object holds the following attributes: name, address, city, type, phone, code, and geographic coordinates. Cora Dataset. The Cora dataset contains bibliographical information, extracted from citations in scientific papers8. It consists of 1878 objects, of which 1693 are duplicates. Thus, the average number of duplicates per object is approximately 9.15. Each object contains the following attributes: author, title, venue name, volume, and date. FreeDB Dataset. The FreeDB dataset contains 9763 objects, representing CDs, extracted from the FreeDB database9 . Of these, 298 objects are duplicates10. Each CD object contains the attributes artist, disc title, category, genre, year, CD extra information, and track titles. IMDB+FilmDienst Dataset. Finally, the IMDB+FilmDienst dataset was generated by integrating 500 movie objects extracted from the Internet Movie Database and the same 500 movies extracted from the Film Dienst website11 , yielding 1000 6 7 8

9 10 11

http://www.imdb.com/ The original data for the Restaurant dataset can be found at http://www.cs.utexas. edu/users/ml/riddle/data.html The dataset used is available at http://www.hpi.uni-potsdam.de/ fileadmin/hpi/FG_Naumann/projekte/repeatability/CORA/ cora-all-id.xml. The original data can be found at http://www.cs.umass. edu/˜mccallum/ http://www.freedb.org/ This dataset can be obtained at http://www.hpi.uni-potsdam.de/naumann/ projekte/repeatability/datasets/cd_datasets.html http://film-dienst.kim-info.de/

An Overview of XML Duplicate Detection Algorithms

213

candidates12. After integration, the object attributes from the two datasets were mapped into the following attributes: year, title, aka-title, info, genre and actor.

4.2 Experimental Setup Experiments were performed to compare the effectiveness and efficiency of the tested algorithms. To asses effectiveness, we apply the commonly used precision and recall measures [2]. Precision measures the percentage of correctly identified duplicates over the total set of objects determined as duplicates by the system. Recall measures the percentage of duplicates correctly identified by the system over the total set of duplicate objects. To compare the object data, all the tested methods consider the attribute values as textual strings and use the formula p = 1â&#x2C6;&#x2019;

ed(V1 ,V2 ) max(|V1 |, |V2 |)

(3)

as the string similarity measure, where ed(V1 ,V2 ) is the string edit distance between values V1 and V2 , and |Vi | is the length (number of characters) of string Vi . Due to resource constraints, it was not possible to perform all experiments in exactly the same environment. Nevertheless, we took some measures of the algorithms runtime, to give a sense of their efficiency from a userâ&#x20AC;&#x2122;s perspective. Although not directly comparable, these numbers can still provide the reader with a sense of the time it would take to run each system on a real case scenario. All algorithms are fully implemented in Java, using the DOM API to process XML documents. The SXNM algorithm also uses the DB2 database management system to process some of its internal data. The implementations used were kindly provided by the authors of the respective methods.

4.3 Results We now present the experimental results achieved when comparing the three XML duplicate detection algorithms. We start by testing the impact of different levels of data quality on the performance of the algorithms. This was done on the artificially generated dataset, by varying the probability of occurrence of each type of error. Following, to compare the algorithms in a more realistic scenario, we perform experiments on the four real-world datasets. We conclude each section with a discussion of the achieved results. 4.3.1

Impact of Data Quality

To test the impact of varying the quality of the data being processed, we used the IMDB dataset, varying the probability of occurrence of typographical errors, 12

The dataset is available at http://www.hpi.uni-potsdam.de/naumann/ projekte/repeatability/datasets/movie_dataset.html

214

P. Calado, M. Herschel, and L. LeitË&#x153;ao

Fig. 8 Precision and recall values for different amounts of typographical errors on (a) DogmatiX, (b) XMLDup, and (c) SXNM. For improved readability, only precision values above 90% are shown

missing data, and duplicate erroneous data. Each probability was varied between 0% and 50% when generating duplicates, while maintaining the remaining at a fixed value. To detect duplicates, the DogmatiX and SXNM algorithms use all attributes described in Section. 4.1 as object descriptors. Additionally, SXNM uses a window of size 10 and its key is formed by the concatenation of title and year attribute values. Regarding XMLDup, the model only considers the attributes title, director, author, and cast member. These combinations of parameters achieved the best results for all models. Fig. 8 shows the results obtained by the three approaches when varying the occurrence of typographical errors. We can see, all models perform well, maintaining near 100% precision scores until late recall values. The DogmatiX system is able to hold a 100% precision, until approximately 85% of recall, even when dealing with the highest amount of typographical errors. Interestingly, it shows a little instability at low recall values, which reveals the presence of some false positives with a high similarity score.

An Overview of XML Duplicate Detection Algorithms

215

Fig. 9 Precision and recall values for different amounts of duplicate erroneous data on (a) DogmatiX, (b) XMLDup, and (c) SXNM. Note that a different precision and recall scales are used in the DogmatiX chart, in order to improve its readability

The SXNM model, on the other hand, despite also starting with high precision scores, presents a significant decrease of performance when typographical errors are introduced. This behavior is due to its method of comparing object pairs according to the proximity of their key value. Typographical errors in key attribute values can cause similar objects to be placed in far away positions, and thus never be reached by the comparison window. The XMLDup model seems to be uniformly reliable when dealing with the presence of this kind of errors, maintaining high precision results until approximately 85% of recall. Following, we observed the behavior of the three algorithms in the presence of duplicate erroneous data. We present the results in Fig. 9. Clearly, the algorithms do not suffer much from the introduction of erroneous data. This is confirmed by the high precision values, even when data contains 50% of erroneous fields. Contrarily to the other two models, DogmatiX is capable of maintaining the same high precision values at high recall values, as before. Nevertheless, an increase in the number of false positives with high similarity scores occurs, yielding the initial lower precision values.

216

P. Calado, M. Herschel, and L. LeitË&#x153;ao

Fig. 10 Precision and recall values for different amounts of missing data on (a) DogmatiX, (b) XMLDup, and (c) SXNM

For XMLDup the impact at high recall is more significant. Still, it maintains near 100% of precision above 80% of recall. The results achieved by SXNM reveal that it is not affected by duplicate erroneous data, having a very similar output, independently of the amount of errors. Finally, Fig. 10 presents the results for the last test performed on the IMDB dataset, in which we can see how the different algorithms deal with the absence of information. DogmatiX is capable of reaching high precision values for amounts of missing data below 40%. However, we can see that for higher values its initial precision starts to fall drastically, again revealing many false positives with high similarity values. XMLDup is able to maintain initial 100% precision values up to, at least, a recall value of 20%, in the worst case scenario. The SXNM behavior is similar to that observed for XMLDup. In this case, however, precision starts to fall much sooner, even before 10% of recall, for 50% and 40% of missing data. In sum, we can see that the amount of missing data affects the three algorithms much more than the remaining types of errors, whereas duplicate erroneous data

An Overview of XML Duplicate Detection Algorithms

217

has very little impact. We can argue that DogmatiX is less precision oriented for low recall values, but more resilient to data deterioration than the remaining models. SXNM seems to be the algorithm most affected by changes in the data, likely due to its initial filtering strategy, which is greatly dependent on a small set of attributes. Finally, XMLDup has a behavior similar to that of DogmatiX, although being affected by errors mostly at high recall values. 4.3.2

Experiments with Real Data

To give a sense of the performance achievable by the three algorithms on a realworld scenario, we now present their evaluation on the real data datasets. Results for the Restaurant Dataset To perform experiments on the Restaurant dataset, the SXNM model used a key formed by the name and address attributes, together with a window of size 10. The XMLDup and DogmatiX models used all the attributes described in Sec. 4.1. In Fig. 11 we show the results produced by the three models.

Fig. 11 Precision and recall values for the Restaurant dataset

Contrarily to what we have observed with the artificial data, DogmatiX was able to maintain a constant, 100%, precision score that only starts falling after 50% of recall. The remaining two models are also capable of maintaining near 100% precision score before they start to experience false positives. Overall, all the models deal with this dataset similarly, easily detecting about half of the duplicate objects. Curiously, we observe that the algorithms that start loosing precision sooner first, are also those that maintain higher precision values at higher recall values. In DogmatiX, this happens because there is a relatively wide gap between the similarity scores of pairs of objects that are, in fact, duplicates and those pairs that are not. The opposite holds for SXNM and XMLDup, where there is no clear separation between the scores.

218

P. Calado, M. Herschel, and L. LeitË&#x153;ao

Results for the Cora Dataset The Cora dataset was tested using all attributes. For this dataset we configured the SXNM model to use four keys, in a multi-pass way. The keys were formed by the following pairs: author and date, author and volume, author and venue name, title and date. The size of the sliding window was 20. We present the results in Fig. 12.

Fig. 12 Precision and recall values for the Cora dataset

DogmatiX and XMLDup present very similar curves, with a smooth decrease of precision that reaches 50% of recall with, approximately, 90% of precision. On the other hand the SXNM model remains close to the other two curves but falling sharply after passing 30% of recall. Additionally, it is only capable of identifying 39% of the duplicates in the database. This low recall value can be explained by the fact that none of the available attributes is good enough to produce differentiating keys. Since this algorithm depends directly on such keys, its results are negatively affected. This lack of a group of distinctive attributes also explains the smooth curves presented by the remaining algorithms. Their progressive decrease in precision happens because false positives are equally distributed along the sorted result set. One other cause is missing data in some objects of this dataset, which tends to raise the similarity score of different object pairs. Results for the FreeDB Dataset The results of the experimental evaluation performed on the FreeDB dataset are shown in Fig. 13. The attributes used for XMLDup were disc title, artist, and track tile. SXNM performed its filtering strategy using keys formed by the groups of attributes artist plus year and genre plus year plus artist. In this case we used a window of size 5. In the FreeDB dataset, XMLDup starts with low precision scores, achieving the highest values only at about 20% recall. The fact that this dataset has many dummy

An Overview of XML Duplicate Detection Algorithms

219

Fig. 13 Precision and recall values for the FreeDB dataset

values in the track title attribute (e.g. “track1”, “track2”, etc.) contributes to lower the precision values, by causing a high number of false positives. This effect is also caused by the occurrence of many similar classical music CDs, such as multiple different editions of the “Best of Mozart”. DogmatiX is somewhat less affected by this phenomenon since its able to use the soft IDF measure to counterbalance its effect (see Section 3.3). Nevertheless, we can see that both XMLDup and DogmatiX perform significantly better than the SXNM model, which falls linearly, reaching precision scores below 60% approximately after reaching 20% of recall. This is due to the fact that, although initially the chosen set of key attributes were enough to detect part of the duplicate objects, many other objects would require the comparison of the remaining attributes. Thus, XMLDup and DogmatiX are able to maintain performance, while SXNM drops rapidly. Although close to each other, XMLDup maintains consistently higher precision scores than DogmatiX, until about 75% recall. Results for the IMDB+FilmDienst Dataset We conclude this series of experiments with the real world dataset IMDB+FilmDienst. For this last test the XMLDup algorithm considered the attributes: title, year, aka-tile, and actor. The key used by the SXNM algorithm was year plus title and the window size was 10. This final test further reinforces some of the conclusions stated before. Again we can observe two patterns mentioned previously regarding XMLDup and DogmatiX—XMLDup is more precision oriented and, therefore, is able to maintain steadily higher precision results for greater (low) recall periods, while DogmatiX is more resilient to data deterioration and can easily recover from precision drops, achieving high precisions at high recall values. Again, The SXNM model reveals itself as sensitive, achieving poor results in this data set. An overview of this section shows that experiments with real data yield a very distinct set of results. This is not unexpected, since we are dealing with very distinct

220

P. Calado, M. Herschel, and L. LeitË&#x153;ao

Fig. 14 Precision and recall values for the IMDB+FilmDienst dataset

domains. The SXNM algorithm showed a poor performance in three of the test datasets, mostly due to the unavailability of relevant attributes to be used as keys. However, it is hard to say if this would be a problem in any real life scenario, since it is possible that good keys will often be available. Also, further experiments would be necessary to determine exactly how this problem affects the results and how it can be minimized. Results of DogmatiX and XMLDup were very similar, with an apparent tendency of XMLDup to yield higher precision at low recall and DogmatiX showing higher precision at high recall. Nevertheless, both models seem capable of dealing appropriately with real-world data, often reaching maximum precision values above 90%. 4.3.3

Experimental Runtime

To provide the reader with a reasonable perspective of the time the tested algorithms spent to perform the duplicate detection process, we present the total runtime each one consumed to produce the final outcome. Results are shown in Tab. 6. As mentioned before, these times are not directly comparable, but can still provide a notion of the time it would take to run each system on a real case scenario. Table 6 Runtime spent to perform duplicate detection on DogmatiX, XMLDup and SXNM. Dataset DogmatiX XMLDup IMDB 43m12s 19h17m22s Restaurant 1m58s 2m39s Cora 19m30s 36m23s FreeDB 5h41m34s 1d19h09m02s IMDB+FilmDienst 46m55s 2h33m52s

SXNM 3m45s 26s 11m29s 5m54s 26s

We observe that XMLDup is the slowest system on all datasets. Computing the intermediate similarities requires building a model of the Bayesian Network for each

An Overview of XML Duplicate Detection Algorithms

221

object, which, although linear in the number of nodes (for the tested cases), still adds a noticeable overhead. Also, unlike DogmatiX, all values are always compared and there is no provision to choose only a subset of nodes. In fact, about 85% of the time spent by XMLDup is in computing the prior probabilities. The fastest system is evidently SXNM. Its sliding window filtering strategy, although taking a toll on accuracy, clearly pays off in terms of efficiency. SXNM is capable of processing all datasets, on average, about 4 times faster than DogmatiX and 80 times faster than XMLDup. There is, therefore, a clear trade-off between precision and speed. SXNM is more appropriate for very large databases, where accuracy is not a fundamental requirement, whereas DogmatiX and XMLDup should be used in settings where high precision is needed and speed is not essential.

5 Conclusions and Future Directions In this chapter, we discussed several algorithms to detect multiple representations of a real-world object, so called duplicates, in XML data. Detecting duplicates is a crucial task in many applications, such as data cleaning and data integration. So far, all algorithms tackling the problem of XML duplicate detection belong to the class of iterative duplicate detection algorithms. Common to these algorithms is that they use a similarity measure to compare pairs of object representations, so called candidates. If the similarity of a pair of candidates is above a predefined threshold, the candidates are considered to represent the same real-world object, otherwise, they are considered non-duplicates. Our discussion focused on four algorithms, namely DogmatiX [44], SXNM [36], XMLDup [27], and the structure aware similarity measure of Milano et al. [31]. We provided a high-level description of these algorithms, summarized in Tab. 4, and performed an experimental evaluation on generated and real-world data. Experiments showed two main points. First, the effectiveness of all methods depends on the types of errors in the data. Indeed, we observe that the amount of missing data affects similarity measures much more than the remaining types of errors, whereas duplicate erroneous data has very little impact. Second, the DogmatiX system is the best in terms of effectiveness, as it obtains the highest recall and precision values in most scenarios. However, we only consider one candidate type throughout our experiments and, as established in [42], DogmatiX performs less well when multiple, interrelated candidate types that are not in a 1:N relationship are considered. In these cases, XMLDup and SXNM are applicable. In the tested scenarios, SXNM seems to be the algorithm most affected by errors in the data, whereas DogmatiX has a behavior similar to that of XMLDup, although being affected by errors mostly at low recall values. Not surprisingly considering the theoretical complexity (see Tab. 4), our experiments on runtime show that with a small but reasonable window size, SXNM is the fastest system. Clearly, there is a trade-off between efficiency and effectiveness that has to be considered when choosing an appropriate XML duplicate detection algorithm.

222

P. Calado, M. Herschel, and L. Leit˜ao

One drawback of all approaches discussed in this chapter is that they all require setting various parameter such as similarity thresholds, string similarity functions, weights of descriptions, etc. Furthermore, we have seen that not only parameters, but the choice of the algorithm itself depend on the scenario to which duplicate detection is applied. It is highly unlikely that even expert users can manually determine all parameters or even the algorithm to use. An important improvement is therefore, to support users with algorithms and tools to (semi-)automatically configure their duplicate detection tasks. In this chapter, we focused on duplicate detection as a batch process, where all duplicates in a data set need to be detected in one run. For data sets that change over time, i.e., through insertion, deletion, or update of the data, and that therefore require periodic cleaning, it would be helpful to have algorithms that incrementally detect duplicates. It is conceivable that such algorithms do not only consider a snapshot of the data to classify object representations as duplicates but instead also consider the changes from one snapshot to the next. This may allow to find additional duplicates or even to revoke duplicates when newly inserted data now suggests that candidates are no longer duplicates. Very related to duplicate detection is the area of similarity search. Whereas duplicate detection detects duplicates in a batch process, similarity search aims at determining similar or even duplicate entries to a search query. The methods used for duplicate detection do not directly apply, as search is an on-line and time-critical process. Nevertheless, the metrics used to rank search results may be similar in spirit to similarity measures targeted towards duplicate detection. Acknowledgments. The research performed on XML duplicate detection by M. Herschel was supported by the German Research Society (DFG grant no. NA 432).

References 1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Conference on Very Large Databases (VLDB), Hong Kong, China, pp. 586– 597 (2002) 2. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999) 3. Batini, C., Scannapieco, M.: Data quality - concepts, methodologies and techniques. Springer, Berlin (2006) 4. Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: KDD Workshop on Multi-Relational Data Mining, MRDM (2005) 5. Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Conference on Data Mining (SDM), Bethesda, MD (2006) 6. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data 1(1) (2007) 7. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, pp. 39–48 (2003) 8. Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1–3), 217–239 (2005)

An Overview of XML Duplicate Detection Algorithms

223

9. Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008), http://doi.acm.org/10.1145/1456650.1456651 10. Carvalho, J.C.P., da Silva, A.S.: Finding similar identities among objects from multiple web sources. In: CIKM Workshop on Web Information and Data Management (WIDM), New Orleans, Louisiana, USA, pp. 90–93 (2003) 11. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Conference on the Management of Data (SIGMOD), San Diego, CA, pp. 313–324 (2003) 12. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: International Conference on Data Engineering (ICDE), Tokyo, Japan, pp. 865–876 (2005) 13. Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting relationships for object consolidation. In: SIGMOD Workshop on Information Quality in Information Systems (IQIS), Baltimore, MD (2005) 14. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, Canada, pp. 475–480 (2002) 15. Doan, A., Lu, Y., Lee, Y., Han, J.: Profile-based object matching for information integration. IEEE Intelligent Systems 18(5), 54–59 (2003) 16. Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Conference on the Management of Data (SIGMOD), Baltimore, MD, pp. 85–96 (2005) 17. Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: Tailor: A record linkage tool box. In: International Conference on Data Engineering (ICDE), San Jose, CA, pp. 17–28 (2002) 18. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1) (2007) 19. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association (1969) 20. Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML joins. In: Conference on the Management of Data (SIGMOD), Madison, WI (2002) 21. Hern´andez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Conference on the Management of Data (SIGMOD), San Jose, CA, pp. 127–138 (1995) 22. Herschel, M., Naumann, F.: Scaling up duplicate detection in graph data. In: Conference on Information and Knowledge Management (CIKM), pp. 1325–1326 (2008) 23. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Conference on Database Systems for Advanced Applications (DASFAA), Kyoto, Japan (2003) 24. Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (TODS) 31(2), 716– 767 (2006) 25. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1–2), 83–97 (1955) 26. Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. Journal on Data Semantics VII 4244, 136–164 (2006) 27. Leit˜ao, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Conference on Information and Knowledge Management (CIKM), pp. 293–302 (2007) 28. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966) 29. Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. In: International Conference on Data Engineering (ICDE), Vienna, Austria, pp. 294–301 (1993)

224

P. Calado, M. Herschel, and L. Leit˜ao

30. McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC (2003) 31. Milano, D., Scannapieco, M., Catarci, T.: Structure aware XML object identification. In: VLDB Workshop on Clean Databases (CleanDB), Seoul, Korea (2006) 32. Minkov, E., Cohen, W.W., Ng, A.Y.: Contextual search and name disambiguation in email using graphs. In: Conference on Research and Development in Information Retrieval (SIGIR), Seattle, Washington, USA, pp. 27–34 (2006) 33. Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Tucson, AZ (1997) 34. Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959) 35. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1988) 36. Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neigborhoods. In: Conference on Extending Database Technology (EDBT), Munich, Germany, pp. 773–791 (2006) 37. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001) 38. Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23, 3–13 (2000) 39. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, pp. 269– 278 (2002) 40. Singla, P., Domingos, P.: Multi-relational record linkage. In: KDD Workshop on MultiRelational Data Mining (MRDM), Seattle, WA (2004) 41. Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Conference on Principals and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, pp. 297–308 (2005) 42. Weis, M.: Duplicate Detection in XML Data. WiKu-Verlag Verlag fuer Wissenschaft und Kultur (2008) 43. Weis, M., Naumann, F.: Duplicate detection in XML. In: SIGMOD Workshop on Information Quality in Information Systems (IQIS), Paris, France, pp. 10–19 (2004) 44. Weis, M., Naumann, F.: Dogmatix tracks down duplicates in XML. In: Conference on the Management of Data (SIGMOD), Baltimore, MD, pp. 431–442 (2005) 45. Weis, M., Naumann, F.: Detecting duplicates in complex XML data. In: International Conference on Data Engineering (ICDE), Atlanta, GA (2006) 46. Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: Proceedings of the VLDB Endowment (PVLDB), vol. 1(2), pp. 1253–1264 (2008) 47. Whang, S.E., Menestrina, D., Koutrika, G., Tin Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Conference on the Management of Data (SIGMOD). Providence, Rhode Island (2009) 48. Yin, X., Han, J., Yu, P.S.: LinkClus: Efficient clustering via heterogeneous semantic links. In: Conference on Very Large Databases (VLDB), Seoul, Korea, pp. 427–438 (2006)

Fuzzy-EPC Markup Language: XML Based Interchange Formats for Fuzzy Process Models Oliver Thomas and Thorsten Dollmann 1

Abstract. Recent research has led to proposals for the integration of fuzzy based information and decision rules in business process models with use of concepts based on the fuzzy set theory. While the proposed fuzzy-EPCs provide an adequate method for the conceptual representation of fuzzy business process models, the issue of exchanging and transforming such models, together with its enclosed executable components with other dedicated information systems, such as workflow management systems and fuzzy modeling tools has not yet been approached. As a first step in this direction, our paper proposes a machine-readable fuzzy-EPC representation in XML based on the EPC Markup Language (EPML). We will start with the formal fuzzy-EPC syntax definition and then introduce our extensions to EPML. An application scenario will next highlight the potential and future application areas of the fuzzy-EPC schema.

1 Introduction Many concepts have been developed for designing and improving business processes, generalizing them in reference models and using them in implementation projects. A large number of these approaches emphasize the intuitive usability of methods by approximating them with human ways of thinking. However, one usually requires the exact quantification of decision rules in decision-making situations. Often however, only uncertain, imprecise and vague information is available for business processes on the often not technically determined procedures. In addition, the underlying target system is usually characterized by vague formulations and implicit interdependencies. This is, for example, illustrated by the statement “the processing time for commissions with very high priority should be considerably reduced while retaining a high Oliver Thomas Institute of Information Management and Corporate Governance, Chair in Information Management and Information Systems, University of Osnabrück, Katharinenstraße 3, 49074 Osnabrück, Germany e-mail: oliver.thomas@uni-osnabrueck.de 1

Thorsten Dollmann Institute for Information Systems (IWi) at the German Research Center for Artificial Intelligence (DFKI), Saarland University, Campus D 3 2, 66123 Saarbrücken, Germany e-mail: thorsten.dollmann@iwi.dfki.de 2

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 227–257. © Springer-Verlag Berlin Heidelberg 2010 springerlink.com

228

O. Thomas and T. Dollmann

processing quality by proportionately reducing the processing intensity”. In this example, neither the concrete parameter values of both the named goals with regard to processing time and processing quality, nor the derived measures can be quantified and thus, made directly processable without loss of information. However, fuzziness is not only found in the meaning of the linguistic terms used when modeling expert knowledge, the knowledge formulated in this manner and the decision-making process that reverts back to this knowledge also contains fuzziness. Fuzziness is usually defined by way of differentiation with deterministic, stochastic and uncertain states of information [22]. Here, fuzziness is seen as uncertainty with regard to data and its interdependencies. Different reasons for fuzziness can be identified in the business context [44]. First, fuzziness occurs due to the complexity of the environment and the limits in human perception when comprehending reality. The resulting informational fuzziness, determined by human language and thought, can be ascribed to a surplus of information [52]. Fuzziness also exists in human preference and goal conceptions. This leads to vagueness in the goal system, which is related to the informational fuzziness. The description of reality in natural language generates intrinsic (also: verbal or linguistic) fuzziness. The creation of a linguistic model and the context sensitivity of linguistic statements contribute to the creation of this fuzziness. Here, the cause of fuzziness is not in the language itself, but rather in the limitation and subjectivity human reality perception [44]. Finally, fuzziness when comprehending reality results from the fact that data and relationships between data can‘t or shouldn‘t be recorded exactly. The use of inaccurate data can also be advantageous when suitable measuring methods are lacking, the realworld is characterized by high dynamics or dependencies exist that cannot be determined accurately. Humans tend to register reality with verbal descriptions, which is another reason for the intrinsic fuzziness described above. The fuzzy event-driven process chain has been in discussion for over a decade as a means of considering this form of fuzziness. For example, Becker, Rehfeldt, Turowski [6; 23; 24; 25; 26; 26; 27; 28] exemplarily demonstrate the consideration of fuzzy data in process modeling with event-driven process chains on the example of industrial order processing. Vague sales information transformed into tentative customer commissions is studied as relevant, exogenous input data with fuzziness in the form of uncertainty. Forte [9] also strives to extend the eventdriven process chain, in order to illustrate fuzziness. In doing so, he incorporates data and information objects into his thinking. His extension is oriented on keeping the structure and the logic of the EPC itself and in addition, being able to derive new notations systematically and use them easily. A further approach to a fuzzy extension of the EPC is taken by Hüsselmann [12; 13]. At first, he sees no necessity to extend the EPC syntactically for the inclusion of fuzzy data in business processes. He also introduces methods for illustrating fuzziness in EPC models using new constructs on the basis of the extension of Petri-nets to fuzzy Petri-nets [7; 11; 18; 35; 49; 50]: fuzzy functions, fuzzy events, fuzzy operators and the mapping of fuzzy resource capacities. Thomas and other co-authors [1; 3; 4; 5; 37; 38; 39; 40; 42; 43] investigate how fuzzy data and its implementation in application systems can be used for the design of knowledge-intensive and weakly structured business processes. They also use a fuzzy extended EPC.

Fuzzy-EPC Markup Language: XML Based Interchange Formats

229

Existing works on Fuzzy-EPCs have two essential similarities. First, modeling aspects are addressed in Business Process Reengineering or when introducing ERP systems, i. e. these aspects apply to the build-time of the process models. Second, the point in all of them is to embed the approaches successful in fuzzy logic for the control and regulation of the decision situations relevant for company processes. Thus, the latter applies to the runtime of the process models. Independent of the methodological basis of the existing works, almost all of the authors aim at the integration of two classes of tools: on the one hand, process modeling tools and on the other, fuzzy systems. In doing so, the modeling-technical integration must be supported by a suitable information technical design. Still, despite the many studies available, up to now no interchange and storage format exists for the fuzzy-EPC that is supported by both classes of tools. One requirement for the design of an interchange and storage format for fuzzy business process models is that the models and model changes created can be transferred and stored in standardized form, platform-independent, as XML documents. The specification of fuzzy-EPC models with an XML schema connected with this is the topic of our paper. EPML [21] forms the foundation for this specification. After the introducing the fuzzy-EPC and its formal principles and syntactic elements in Chapter 2, the related work of interchange formats for fuzzy systems will be discussed in Chapter 3. The interchange format fuzzy-EPML will be derived in Chapter 4 and parts of it will be specified. An application scenario will be presented in Chapter 5. The article will close in Chapter 6 with a summary and an outlook on future research challenges.

2 Fuzzy Event-Driven Process Chains In formal notation, an EPC-model is a quadruple

EPC = ( E , F , C , A) .

E is thereby a finite (non-empty) set of events, F a finite (non-empty) set of functions, C = C AND ∪ COR ∪ C XOR a finite set of logical connectors, whereby C AND , COR and C XOR are paired disjunctive subsets of C and A ⊆ ( E × F ) ∪ ( F × E ) ∪ ( E × C ) ∪ (C × E ) ∪ ( F × C ) ∪ (C × F ) ∪ (C × C ) is a set of edges. The relation A specifies the set of directed control flow edges (arcs), which connect functions, events and connectors with each other. V = E ∪ F ∪ C is the set of all nodes of the EPC-model [16; 29]. Further statements result through the use of the EPC as a central modeling language within the architecture of integrated information systems (ARIS) [32; 33]. These are based on the ARIS view concept. They are made through the annotation of other language constructs on EPC functions [34]. For example, language constructs that represent the environment data, news, manpower, machine resources and computer hardware, application software, outputs in the form of contributions in kind, services and information services, financial resources, organizational units or corporate goals are

230

O. Thomas and T. Dollmann

recommended. The linkage of constructs, which can only take place with functions from the EPC, is created with edges, which, in addition to the control flow already introduced, can be differentiated in organization/ resource, information, information services and contribution in kind, as well as financial resources flow. In this article, we choose the EPC elements of the organization, data and output view as additional artifacts for process modeling, add them to the formal representation of the EPC and, in a next step, enrich them with attributes. This extension will be consulted later for the demonstration of the exemplary processing of fuzziness in business processes. For this, we introduce an EPC model extended by ARIS language constructs as a tuple

EPC ARIS = ( E , F , C , A, O, D, L, R) . ( E , F , C , A) is an EPC model with the set of control flow nodes V = E ∪ F ∪ C and the set of control flow edges A . The node set, which represents the artifacts of the organization, data resp. output view, are O for the set of organizational units, D for the set of data objects and L for the set of outputs. It is required that the sets O, D and L are pairwise disjointed. The set R contains sets of relations, which assign the functions of the various artifacts (for further details cp. [43]). We define a fuzzy-EPC-model FEPC = ( E , F , C , A, O, D, L, R, M , FC )

as an ARIS EPC model enriched with the following properties: M is the set of fuzzy attributes of the fuzzy-EPC model FEPC. The term “fuzzy attribute” refers to two aspects here. First, one assumes that the value domains of the attributes are not necessarily crisp sets, but rather may consist of fuzzy sets. And second, the attributes can be interpreted as linguistic variables. This implies that the name of the linguistic variable corresponds with the name of the attribute and that the value domain of the attribute is, at the same time, the basic set of the linguistic variable. O , D resp. L are sets of organizational units, data objects resp. outputs, which contain the fuzzy organizational units, fuzzy data objects resp. fuzzy outputs. A fuzzy organizational unit, a fuzzy data object resp. fuzzy output is here an organizational unit, a data object resp. output with fuzzy attributes. FC is a set of fuzzy systems. The possible input and output quantities are restricted by the function assigned to such a system. F is the set of fuzzy functions of the EPC-model. A fuzzy function is characterized here by either one or more fuzzy attributes or by the assignment of a fuzzy system FS ∈ FC for decision support on the basis of fuzzy formulated rules during process execution. Thereby all of the organizational units, data objects resp. outputs of the EPC model whose attributes represent the input and output quantities of the assigned fuzzy system must be connected with this fuzzy function via an edge. If the fuzzy system is used directly as a classificator for the decision on the next control flow, then only the following events of this function may occur in the conclusion part of the rules.

Fuzzy-EPC Markup Language: XML Based Interchange Formats

231

The set R contains sets of fuzzy relations between control flow objects and the various artifacts. The relations in the crisp model can thus be seen as a special instance of the fuzzy case in the sense of Zadeh’s extension principle. Fig. 1 shows an example of the reference process for customer order processing. The process is represented in the form of a fuzzy-EPC. The fuzzy constructs of the EPC are characterized by grey shading. After defining the customer order, its acceptance is checked. The checking of the individual functions in the “crisp” processes is however, extended by way of checks pertaining to the value of the order and customer rating. The functions are not modeled as “subordinate” activities of the customer order check, but rather as fuzzy object attributes of the respective data object and input types in the form of linguistic variables. In Fig. 1 the object attribute “Order value” of the data object type “Customer order” is shown. It has the linguistic variables “very low”, “low”, “medium”, “high” and “very high” as terms. From the previous remarks it becomes clear that the integration of a tool for process modeling with the consideration of fuzzy data in the given working environment is of high importance. This integration is not assured with an adequate information technology design. This concerns ergonomic user interfaces, as well as open interfaces to existing IT-systems and implies the usage of state-of-the-arttechnologies. Linguistic Variable „Order Value“

Customer calls

Customer

Define Customer Order

Call Center

Product

Customer Order defined

Customer Order

Order Value Very low

Order Value

Technical Feasibility

low

medium

high

Very high

...

Check Customer Order

Sales

Priority

Procurement Fuzzy System

Customer Order is to be accepted

Customer Order is to be rejected

Accept Customer Order

Reject Customer Order

50.000

Decision Tree

100.000

Rule Base „Customer Order“

Antecedence

Customer Order accepted

Customer Order rejected

Consequence

Customer Rating

Order Value

Estimation

low

very low

low

...

low

high

medium

high

medium

...

Fig. 1 Example of a fuzzy-EPC model

The fuzzy-modeling-tool should not only consider the fact that the process models have to be created with the usage of a modeling language (model development),

232

O. Thomas and T. Dollmann

but also that process models shall be used (model usage) within an implementation project or when introducing standard software. Because the period of time between the development and the application of a model can be quite long, a storage format has to be used which is mostly independent of the short innovation cycles in the information technology. Beyond this, the chosen storage format used as a description language has to be license-free, as well as platform- and manufacturer independent. To consider fuzziness in the process modeling, in amendment to the discussion taking place in literature for over a decade [6; 9; 12; 26] the fuzzy-event driven process chain is presented in this contribution. Existing approaches as the performed amendment to the fuzzy extended EPC have two basic similarities. First they deal with methodology and content-functional problems that must be answered in business process reengineering or during the implementation of ERPsystems, i.e. aspects more concerning the build time of the process models. Second, they deal with embedding approaches successful in the fuzzy-set-theory for control and regulation in relevant decision situations in business activities. The latter concerns more the runtime of process models. Independent of the methodical basis in existing research, almost all authors target the integration of two tool classes: on the one hand, process modeling tools and on the other, fuzzyinformation systems. The modeling technique integration is hereby not at least made possible with a suitable information technology configuration. So far there is no existing interchange- and storage format for fuzzy-EPC, despite numerous analyses that are supported by both toolsets. It is desirable for the development of an interchange- and storage format for fuzzy business process models that the generated models and model changes are transmitted and stored as an XML-document in a standardized format and platform independent.

3 Interchange Formats for Fuzzy Systems There are a few articles in literature where the interchange format is established in the field of soft computing on the basis of the Extensible Markup Language. This way, Lee & Fanjiang [17] developed for example a specific XML-scheme to gather fuzzy information with a fuzzy theoretical concept in an object- oriented data model. Here it is a case of a model-oriented specification that allows one to determine the affinity of objects based on their attributes. Thus, the grade of affiliation of objects to classes or rather between classes and their super- or subclasses is reviewed. Rules based on linguistic expressions are also established to express relationships between the attributes of a class. However, beyond the specifications of fuzzy attributes, the approach does not offer a schema on the basis of which it would be possible to conceive a comprehensive specification of fuzzy-systems with modules to inferences. A similar focus can be identified in Witte [51] who introduces new data types for a fuzzy XML format with the help of a Document Type Definition (DTD) to allow the data interchange between different fuzzy information systems. Thus, object-oriented fuzzy models are discussed. Here fuzzy approaches are also focussed

Fuzzy-EPC Markup Language: XML Based Interchange Formats

233

which – beyond an attribution – focus on class hierarchies and polymorphism or rather the gradual degree of affiliation of a class. The central question of the fuzzification of conceptual data modeling also comes to the fore in Ma [19; 20]. Thus, from the view of fuzzy process modeling, little knowledge can be transferred to a domain-specific XML-scheme. Whereas the proprietary schematization of fuzzy attributes is partly established, the focus lies, from the process modeling point of view, first on the profound data structure relationships. Therefore, in the following we will only test the suitability and integratability into EPML of the interchange formats for the specification of fuzzy systems tailored especially to the application aspect of fuzzy process modeling. One of the first dedicated schemes for the representation of fuzzy-systems in XML was offered by de Soto, Capdevila & Fernández [8] with iXSC (Extensible Soft Computing Language). Their approach followed the principle of reusability. In analogy to the logical structure of EPML the composition of complex constructs takes place here on the basis of simple elements and with the help of type definitions. This way, at the topmost level, the component fuzzyRulesSystem is generated which is composed of cohesive blocs of inputs and outputs, fuzzy rule bases, a linguisticContext (collection of variables) and a bindings-construct. With the help of those “bindings” variables, terms and operators on one rule base – there defined as parameters – are connected with concrete specifications. Figuratively the names used as representatives are assigned to the concretely defined elements in the rule base. It is a question of a name based instantiation- and referencing concept to bind the parameter-based specifications of the rule parts to concrete operators resp. assign them to the variable specifications. The iXSC takes a rather abstract approach with regard to the representation of operators and membership functions. For this purpose, the abstract types defuzzOperator and membershipFunction are defined and can be arbitrarily developed in the instance document. A concrete choice of the most common operators and types of membership functions resp. their scheme description is not given and is therefore not available as a standard. In Version 1.0 one can see that the demand in this direction has been recognized and that such definitions for membership functions and operators are being strived for. In separate documents a schematization of membership functions and operators is begun without however reaching the stadium of universality resp. representing the elementary standard repertoire. For this reason, this format is not suitable for acting as a generic interchange format resp. interface format without a manual interference. Therefore, beyond the possibility of the manual embellishing of those parameters with structure descriptions a high substantial standard operators library and scheme definition for the popular parameterized membership functions in the scheme of EPML shall be provided. The iXSC goes one step further beyond the schematization of fuzzy systems. The vocabulary allows not only the description of fuzzy systems, but already has language modules for the definition of neuronal networks [8]. Hence, there is a future potential for the integration of learning methods. Also Turowski & Weng [48] face the need for the integration of fuzzy application systems on different platforms and conventional application systems with the

234

O. Thomas and T. Dollmann

definition of XML data types from a fuzzy rule basis in a Document Type Definition-form. Their approach does not include any cross references in the schema or likewise binding concepts, so that redundant information (particularly linguistic variables and membership functions) in the sub-trees of a rule basis could not be avoided. Whereas a batch of fuzzy operators is syntactically specified with the possibility of declaring needed parameters, the schema draft is restricted to only two general types. According to this, continuous membership functions can be declared via a direct declaration of membership functions to each object in the domain. In the second case of continuous membership functions only distinctive x-coordinates can be tagged with degrees of affiliation from which the interim values can be calculated with interpolation. Linear membership functions can be defined with this. Semantic or syntactic restrictions to the particular gradient, for instance non-convex gradients, are therefore not expressible in the schema. A further XML-schema definition for a fuzzy rule basis is the ABLE Rule Language (ARL) from the IBM Corporation [14]. Here a subset of languages for the specification of fuzzy-if-then-else-rules of exact and fuzzy input and output variables with membership functions of a particular type is definable. This information can be gathered in an AbleRuleSet. With the ABLE Rule Engine it is possible to evaluate fuzzy rules. The XML schema of the rule basis does not keep any syntactic elements about the definition of the inference of fuzzy. While ARL identifies suitable modules for the syntactic “integration” of different exact and fuzzy rule- and inference models, the language is not entirely applicable for the use case “universal representation of fuzzy-systems”. However, the application of attribute types for the identification and referencing of information modules in the XML schema is favoured, so that in rules existing variable definitions are referred to and this makes it possible to avoid redundant information in the schema. Tseng, Khamisy & Vu [47] deal explicitly with the problem of the universal representation of fuzzy systems in XML for the integration of different representation formats. Against the background of the growing quantity of product-dependent and therefore, proprietary formats, the interchange between different application systems, as well as their extension of fuzzy-concepts shall be made easier. The current XMLschema definition is provided online by Tseng & Khamisy [46]. Here it is the matter of numerous XSD data sets that can be imported into each other if necessary. A referencing- and identification mechanism is not used, which results in the central weak point in this approach because the rule basis as well as the definition of the input and output variables are semantically and syntactically separated and are thus provided redundantly. A similar argument also applies to the declaration of the variable definitions and membership functions in the specific rule components. The strength of this approach is the powerful language definition for the definition of fuzzy-operators and membership functions. In addition to predefined operators (but without the possibility of defining parameters, e.g. at the compensation operator), there is the possibility of assembling an operator on the basis of step-by-step definitions with the help of given arithmetic, logical and trigonometric operators. Thus, a lot of degrees of freedom consist not only in the choice of fuzzy-operators, but also with respect to types of membership functions. On the basis of twelve standard-functions (amongst others Gauß- and

Fuzzy-EPC Markup Language: XML Based Interchange Formats

235

trapeze-functions) the previously mentioned mathematical operators can also be used for the step-by-step definition of membership functions. One problem with the language definition of predefined membership functions is that semantic information is not included in the identifiers of the parameters. The crest of a triangle-function does not carry functionality in its name in XML but is, like the other two parameters of a triangle-function, specified as a parameter. Therefore, ambiguity is created in the definition and problems in the translation and transformation to other formats arise, because outside of the sequence of their appearance, no information is enclosed about the type of parameter for the human creator of scripts and translation between different XML-formats. This problem can be solved by naming the parameters, as far as possible, with their functionality and thus, following the convention of readability. To sum things up, it can be stated that many of the approaches discussed only provide proprietary language definitions for standard-fuzzy-systems. On the one hand, some approaches cannot avoid large amounts of redundant information due to the fact that they do without identification and referencing mechanisms. On the other hand, some of the approaches do without a standardized vocabulary for the operators and membership functions and thus, restrict the universality of the language definition. Therefore, a basic vocabulary of the most common characteristics should be available to standardize the interchange of standardized fuzzy-systems. A step on the way to the acceptance of this method in economical surroundings is therefore to be seen in providing a standardized vocabulary at the interchange level. In addition, the circumstance should be emphasized that none of the existing approaches explicitly addresses the application field of fuzzy-classification-systems. This is justifiable from the fuzzy-theoretical point of view with the embedding of classification-systems as special fuzzy-rule-systems. However, the existing article represents a perfect example of the usage of fuzzy-classification-systems with special decision-rules concerning the completion of business processes in companies and justifies the independent handling of it. Furthermore, “normal” fuzzy-rules should play a role when representing multi-level decision-hierarchies with multi-level fuzzy systems. In conclusion, we would like to complete the remarks on interchange formats in the field of soft computing with a comment on the term “fuzzy-XML”. This word creation, which is used in many other papers [10; 15; 19; 20; 51], is inadequate for the domain discussed in this article because of its suggestion of “fuzzification” in the XML language. Here, we do not intend to “fuzzificate” relationships between XML-elements or the application of attribute values so that the XML language modules experience an extension of “fuzzification”. In fact, we intended to draw upon XML as a meta-description language for the syntax definition of fuzzy information. Because however, the description “Fuzzy-XML” has already been established in literature for the identification of an XML-based exchange format for fuzzysystems, it is also used here. In the following, when we use the term, we always mean the family of all markup-languages for fuzzy systems defined on XML

236

O. Thomas and T. Dollmann

basis. The extension of EPML with language elements for the representation of the described fuzzy information as the construction result is also characterized with the term â&#x20AC;&#x153;Fuzzy-EPMLâ&#x20AC;?.

4 Fuzzy-EPC Markup Language 4.1 Extension Strategy In this chapter, EPML is systematically extended with structures for the consideration of fuzzy information and for this reason is developed into a representation format for the fuzzy-EPC. In analogy to the terminology used so far in the scope of formal and meta-model based extensions of the EPC to the Fuzzy-EPC, the fuzzy extension of EPML shall be called fuzzy-EPML. In the following, the essential modifications to the XML-schema will be outlined on EPML and the respective change-constructions justified. At this point we can assume in a figurative sense that the exchange of data between the components of a fuzzy-information system and the systems for the support of the process execution is done via XML. Thus, it must be shown how a custom-designed XML-document can be systematically extended to include fuzzy-information. New elements in EPML must be created for the definition of fuzzy-EPML. These elements should represent the introduced language-constructs of the Fuzzy-EPC. This are, in particular, fuzzy attributes, fuzzy functions, fuzzy organizational units, fuzzy data- and capacity objects and fuzzy-rule-systems [41]. Nevertheless, the extension should be oriented towards compatibility. Process modeling tools that support EPML, but not potential fuzzy-extensions should continue to be able to process the model elements related to business-logic models as they have up to now. This demand on compatibility implies that no special elements can be implemented in EPML such as for fuzzy functions; fuzzy organizational units or fuzzy data. These considerations fit the meta-model based definition of the fuzzy-EPC as a modeling language well. Also, in the meta-model of the fuzzy-EPC no separate tagging of the mentioned fuzzy constructs occurs, but rather a meta-model oriented integration of the EPC-language constructs and the language artefacts common in fuzzy-sets. Thereby it is possible to realize the fuzzy-extension of EPML with the definition of additional XML-attributes or rather elements in the XMLschema of the particular EPC-language artefacts and thus, the assignment of fuzzy information to EPC-models in EPML. The only exception to this procedure creates an extended attribution for the ARIS-language artefacts in EPML. For this first a new abstract attribute type with the notation typeExtendedAttribute is specified on the basis of the attribute type typeAttribute which already exists in EPML and then special attribute types, marked as typeQualitativeAttribute, typeCrisp Attribute and typeFuzzyAttribute are specified. The following chapter discusses this in detail. All in all, this results in the relationships shown in Fig. 2 between XML, fuzzyXML, EPML and Fuzzy-EPML. EPML and all of the derivates belonging to

Fuzzy-EPC Markup Language: XML Based Interchange Formats

237

fuzzy-XML are based on XML. This correlation is also valid for the definition of the fuzzy-EPML format. The findings from research on the definition of interchange formats for fuzzy-systems will be taken into account in the planed extension of EPML to fuzzy-EPML. In this respect fuzzy-XML, as well as EPML are, in a figurative sense, components of fuzzy-XML, which is developed from a change construction. This relationship is implied in Fig. 2 with an aggregation, according to the typical UML notation form. XML

defines

Fuzzy-EPML

Fuzzy-XML

defines

EPML

Fig. 2 Fuzzy-EPML definition framework

4.2 Extended Attribution and Fuzzy Attributes A basis concept for the definition of fuzzy event-driven process chains is the extension of the EPC with attributes which hasn’t been discussed in previous articles. The object types in EPC-models, e.g. the single data- and capacity objects, as well as the organizational units, have attributes that are used for the description of the single object instances and also for their representation in application systems. Each attribute has a domain, which defines the amount of the possible attribute values. For example the domain for the attribute “contract sum” of a data object called “contact” can be defined as an amount of integer values. In the same manner, the domain for the attribute “name” of the object type called “customer” can be defined on an amount of alphabetical characters. The respective attribute values, for example the amount of sales in EURO [€ € ] or the rating of the customer in percent [%] represent the relevant information at decision nodes and thus, must be especially considered in the fuzzy-EPC approach. Due to this, first the EPML scheme attributes id and defRef, used for referencing in instance documents, are added to the original attribute type typeAttribute gets beyond of its specification in EPML. <xs:complexType name=“typeAttribute”> <xs:attribute name=“typeRef” type=“xs:string” use=“required”/>

238

O. Thomas and T. Dollmann

<xs:attribute name=“value” type=“xs:string” use=“optional”/> <xs:attribute name=“id” type=“xs:positiveInteger”/> <xs:attribute name=“defRef” type=“xs:positiveInteger” use=“optional”/> </xs:complexType>

For the new, extended attribution concept the focus is also on fulfilling the extension in such a way, that crisp EPML-documents also keep their validity in the sense of the new schema and can be validated on it. Therefore, the above attribute type was considered in its simple form. However, to use special characteristics necessary for the representation of Fuzzy-EPC-models and their values, special attribute types have to be defined. This is carried out by first introducing a new abstract attribute type typeExtendedAttribute, which is derived from the EPML-basic type typeAttribute. In contrast to its basic type, this type has the restriction that the attribute value cannot exist. <xs:complexType abstract=“true” name=“typeExtendedAttribute”> <xs:complexContent> <xs:restriction base=“typeAttribute”> <xs:attribute name=“value” use=“prohibited”/> </xs:restriction> </xs:complexContent> </xs:complexType>

With the definition of this type as an abstract XML-data type one can prevented this auxiliary construct from being used in the instance-documents. On the basis of this construct it is possible in a next step to define new attribute types by once again deriving in the form of an extension of the attribute type typeExtendedAttribute. These new attribute types are more specific and can contain special types of characteristic values as values. Among these are qualitatively described characteristics on the basis of the data type string, crisp numeric values on the basis of the data type double and fuzzy attribute values based on the special type typeMembershipFunction. This means characteristic values are describable in the form of a relation-function. In addition, it is possible to add a qualitative description to this fuzzy-attribute via the XML-attribute label. Special attribute types typeQualitativeAttribute, typeCrispAttribute and typeFuzzyAttribute for ARIS-language artefacts in Fuzzy-EPML. <xs:complexType name=“typeQualitativeAttribute”> <xs:complexContent> <xs:extension base=“typeExtendedAttribute”> <xs:attribute name=“value” type=“xs:string”/> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name=“typeCrispAttribute”> <xs:complexContent> <xs:extension base=“typeExtendedAttribute”>

Fuzzy-EPC Markup Language: XML Based Interchange Formats

239

<xs:attribute name=“value” type=“xs:double”/> </xs:extension> </xs:complexContent> </xs:complexType>

<xs:complexType name=“typeFuzzyAttribute”> <xs:complexContent> <xs:extension base=“typeExtendedAttribute”> <xs:sequence> <xs:element name=“value” type=“epml:typeMembershipFunction” minOccurs=“0” maxOccurs=“1”/> </xs:sequence> <xs:attribute name=“qualitativeLabel” use=“optional” </xs:extension> </xs:complexContent> </xs:complexType>

Based on the mentioned extension, data and capacity objects, as well as organisational units can be complemented with additional attribute types without changing of adding to the schema at a – however natured – different position in the EPML. EPML is already prepared for this with the reference to the type typeAttribute, so that it is possible to implement the attribute types typeQualitativeAttribute, typeCrispAttribute derived in two steps or rather typeFuzzyAttribute. <xs:complexType name=“typeExtension”> <xs:sequence> .. <xs:choice minOccurs=“0” maxOccurs=“unbounded”> <xs:element name=“attribute” type=“epml:typeAttribute”/> </xs:choice> </xs:sequence> .. </xs:complexType>

Ultimately, only the special derivation of the basis type using xsi:type has to be referenced in the instance-documents. Beside the ARIS-artefacts, which possess the type typeExtension in EPML and may have attribute elements at the respective position, the scheme file allows the possibility of assigning fuzzy attribute types to functions and events as carriers of attributes. This is not intended in the developed methodical approach of the fuzzy-event-driven process chain because the selected way of decision support is built upon the attributes of the ARIS-language artefacts “data object” “capacity object” and “organisational unit”.

4.3 Basic Elements of Fuzzy Systems The syntactic modules on the fuzzy level considered in the model of the fuzzy-EPC also have to find their equivalence in the representation format EPML. This refers to the element, as well as to the connection types. On the top level there is first a generic type typeRuleSystem defined as an abstract element. It represents the basis type from which the special types of fuzzy-rule-systems in the Fuzzy-EPC approach can be

240

O. Thomas and T. Dollmann

deducted. Analogous to the construction decision in the meta model of the Fuzzy-EPC, the elements for the defuzzification method DefuzzMethod, the fuzzy operators FuzzyOperators and the variable base VariableBase, as well as the attributes id and defRef in the basic type are defined as common syntax elements and complemented by individual different elements in the deducted types. typeRuleSystem

defuzzMethod

fuzzyOperators

accumulationOperator

aggregationOperator

implicationOperator

variableBase typeVariableBase

linguisticVariable 1..*

type: typeVariableBase

@ id @ defRef

Fig. 3 Structure of the XML-schema definition of the Abstract Type typeRuleSystem

In Fig. 3 the structure of the XML-schema-definition of the abstract type typeRuleSystem is illustrated. For the representation a tree structure implemented in almost all XML-editors (comparable with the controllable structure view of a file system) is used, in this case the platform independent XML-editor <oXygen/> (http://www.oxygenxml.com/). Basically XML-editors are computer programs for editing XML documents. In comparison to pure text editors with the possibility of entering plain text, the XML editors are enriched with functionalities that support the user with data input whose correct structure is defined by a document type definition that belongs to a document or an XML-schema. 4.3.1 Defuzzification Methods

Widely spread methods for defuzzification are provided in the extended schema file. It is a case of a standardized method for whose specification a commitment of further parameters is unneeded. Because the name of the method is sufficient for the data interchange and for the interpretation with software tools, the type of the defuzzification method can be given as a simple string value. <xs:simpleType name=“typeDefuzzMethod”> <xs:restriction base=“xs:string”> <xs:enumeration value=“meanOfMaximum”/> <xs:enumeration value=“leftOfMaximum”/> <xs:enumeration value=“rightOfMaximum”/> <xs:enumeration value=“centerOfArea”/>

Fuzzy-EPC Markup Language: XML Based Interchange Formats

241

<xs:enumeration value=“centerOfMaximum”/> <xs:enumeration value=“linear”/> </xs:restriction> </xs:simpleType>

4.3.2 Fuzzy Operators

When specifying fuzzy system, operators must be designated which form the basis of the different standardized (e.g. minimal- and maximal operators) or individual (especially on the basis of parameterized operators) inference strategies. In accordance with the meta-model for the fuzzy EPC the accumulation, aggregation and implication operators of the instance documents must be specified. In EPML, a type typeOperator is designated on the basis of which the individual operator types can be specified. <xs:element name=“fuzzyOperators”> <xs:complexType> <xs:sequence> <xs:element name=“accumulationOperator” type=“typeOperator”/> <xs:element name=“aggregationOperator” type=“typeOperator”/> <xs:element name=“implicationOperator” type=“typeOperator”/> </xs:sequence> </xs:complexType> </xs:element>

Operators based on the type typeOperator can be classic operators like minimum/maximum, algebraic product/sum, average/union according to LUKASIEWICZ, drastic product/sum, or parameterized operators like HAMACHERproduct/sum, fuzzy-and/.or according to WERNERS or algebraic γ -connection. In addition to the identifying name with a string, in the above case the respective degree of freedom must be specified in order to define an operator. In the case of an algebraic γ -connection, the parameter γ has to be fixed. In the extension of EPML the individual parameters are specified through declaration of parameter’s name, its value and description in the category of the parameterized operators. <xs:element name=“parameter”> <xs:complexType> <xs:attribute name=“name” type=“xs:string” use=“required”/> <xs:attribute name=“value” type=“xs:double” use=“required”/> <xs:attribute name=“comment” type=“xs:NMTOKEN” use=“optional”/> </xs:complexType> </xs:element>

An operator-dependent schematization of those parameters is – in analogy to the previously mentioned fuzzy-XML approaches – not included in the EPML schema, but in most cases it is usually clearly interpreted on the level of tools because of the dominance and distribution of parameters with only one degree of freedom. An individual definition of inference strategies and the representation in

242

O. Thomas and T. Dollmann

the interchange format are made possible due to the chosen open schematization. Fig. 4 illustrates the structure of the XML-schema definition. 4.3.3 Variable Basis

Another main component of each fuzzy-rule system is the basis of the given variables which are used to express the fuzzy information in fuzzy-EPC models. Such a basis from the type typeVariableBase is, as shown in the following listing, a sequence of linguistic variables with reference to the corresponding schema specification. <xs:complexType name=“typeVariableBase”> <xs:sequence> <xs:element name=“linguisticVariable” type=“typeLinguisticVariable” minOccurs=“1” maxOccurs=“unbounded”/> </xs:sequence> </xs:complexType> <xs:complexType name=“typeLinguisticVariable”> <xs:sequence> <xs:element name=“linguisticTerm” type=“typeLinguisticTerm” minOccurs=“1” maxOccurs=“unbounded”/> <xs:element name=“domain” type=“typeDomain” minOccurs=“1” maxOccurs=“1”/> </xs:sequence> <xs:attribute name=“label” type=“xs:string” use=“required”/> <xs:attribute name=“id” type=“xs:positiveInteger” use=“required”/> <xs:attribute name=“defRef” type=“xs:positiveInteger” use=“optional”/> <xs:attribute name=“comment” type=“xs:NMTOKEN” use=“optional”/> <xs:attribute name=“attributeIdRef” type=“xs:positiveInteger” use=“optional”/> <xs:attribute name=“typeRef” type=“xs:string” use=“optional”/> </xs:complexType>

Especially characteristic for the type typeLinguisticVariable is – beyond the obligatory id-, defRef- and comment-attributes for the central language elements in EPML – the attribute label, which carries the name of the linguistic variable and should comply with the attribute that is directly connected via the reference of the entry attributeIdRef. This connection expresses the connection of the “process world” to the “fuzzy world”. With the attribute typeRef the variable is linked with the type of the corresponding attribute defined in EPML and inserted in this manner in the collection of attributes in EPML. The content of a linguistic variable is also in the elements domain, as well as in the enumeration of one or more linguisticTerm-elements (linguistic terms). The correspondent type definition of the elements mentioned is introduced in the following. First, an abstract basis type typeDomain is constructed for the specification of the domain. The two usable types continuousInterval and discreteInterval are derived from this. While the minimum and maximum values as well as the unity of the domain must be given, the cardinality of the domain is also added in the discrete interval.

Fuzzy-EPC Markup Language: XML Based Interchange Formats

243

<xs:complexType abstract=“true” name=“typeDomain”/>

typeOperator

@ name

standardOperator

(1)

restricts: xs:string

string

restricts: xs:anySimpleType

base: string from: XMLSchema.xsd Min Max algebraicSum algebraicProduct drasticProduct drasticSum boundDifference boundSum hamacherProduct hamacherSum einsteinSum geometricMean arithmeticMean

parameterizedOperator

@ name

parameter

@ value @ comment @ name

restricts: xs:string

(2) string

restricts: xs:anySimpleType

base: string from: XMLSchema.xsd wernersAnd wernersOr yagerUnion yagerAverage minMaxCompensation productSumCompensation compensatoryAnd weightedCompensatoryAnd

Fig. 4 Structure of the XML-schema definition for the fuzzy-operators

(2)

244

O. Thomas and T. Dollmann

As shown in the following listing, a linguistic term is characterized with the attribute label, as well as with an element membershipFunction. While the former carries the linguistic information in the sense of a pertinent naming, the semantic of this naming is in the second element. The type of this element is specified with the type typeMembershipFunction. <xs:complexType name=“typeLinguisticTerm”> <xs:sequence> <xs:element name=“membershipFunction” minOccurs=“1” maxOccurs=“1” type=“typeMembershipFunction”/> </xs:sequence> <xs:attribute name=“label” type=“xs:string” use=“required”/> <xs:attribute name=“number” type=“xs:integer” use=“required”/> <xs:attribute name=“id” type=“xs:positiveInteger” use=“required”/> <xs:attribute name=“defRef” type=“xs:positiveInteger” use=“optional”/> </xs:complexType>

Special modes of connection functions are implemented in the scheme (Fig. 5). Each type is marked with its own parameters, which were again defined with the demand for “readability” from the background of data interchange and for interpretation with software tools. A specific type trapezoidFuzzySet is for example, specified by the parameters zeroLeftX, oneLeftX, oneRightX and zeroRightX whose functional semantic can be recognized directly by the naming of the date elements. typeMembershipFunction

trapezoidFuzzySet

rectangleFuzzySet

triangleFuzzySet

singletonFuzzySet

userDefinedFuzzySet

sFuzzySet

zFuzzySet

piFuzzySet

alphaCut

Fig. 5 Structure of a XML-schema definition for the membership functions

The above list is declared as a universal collection of the mainly used types of membership functions. However, beyond this there is a connection for the usage of further types of membership functions in the possibility of using own

Fuzzy-EPC Markup Language: XML Based Interchange Formats

245

application-independent types of membership functions through the derivation of own forms of the basis type typeMembershipFunction. Thus, this list should not be applied restrictively and allows for application of supplimental extensions.

4.4 Control Flow Fuzzy Classification Systems Fuzzy systems should be applied to decision conditional XOR-trees in EPC models for decision support or for automated process execution. On the basis of the above abstract basis type typeRuleSystem, the particular extensions must be complemented during its deduction. <xs:complexType name=“typeFlowFuzzyClassificationSystem”> <xs:complexContent> <xs:extension base=“typeRuleSystem”> <xs:sequence> <xs:element name=“ruleBase” type=“typeControlFlowFuzzyRuleBase”/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>

The control-flow-fuzzy-classification-system is marked by a special type of rule basis. Therefore, the special rules from the type typeControlFlowFuzzyRule are merged in a respective rule basis typeControlFlowFuzzyRuleBase in the schema element ruleBase <xs:complexType name=“typeControlFlowFuzzyRule”> <xs:sequence> <xs:element name=“antecedent” minOccurs=“1” maxOccurs=“1”> <xs:complexType> <xs:sequence> <xs:element name=“fuzzyProposition” type=“typeFuzzyProposition” minOccurs=“1” maxOccurs=“unbounded”/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name=“consequence” minOccurs=“1” maxOccurs=“1”> <xs:complexType> <xs:attribute name=“decisionToEvent” type=“xs:positiveInteger” use=“required”/> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name=“id” type=“xs:positiveInteger” use=“required <xs:attribute name=“defRef” type=“xs:positiveInteger” use=“optional”/> <xs:attribute name=“comment” type=“xs:NMTOKEN” use=“optional”/> </xs:complexType>

Embedded in the EPML structure, the rules also have to carry the attribute id and can posses the attribute defRef. For this reason it is also possible to clearly identify

246

O. Thomas and T. Dollmann

them at the rule level or rather to establish the connection between same rules in different systems. Especially the methodical surplus of the fuzzy EPC approach in respect of the simplified reuse of rules is also supported in the interchange format. Such a rule is described at the uppermost level by the two elements antecedent and consequence. The first element represents the correspondent conjunctive linked part of the rule and possesses one or more elements with the declaration fuzzyProposition. The according type contains an atomic fuzzy-expression, which is composed in the schema as a pair from the two elements linguisticVariableId and termVariableId. The declaration “total order value is high” is therefore, explained by a reference of a linguistic variable and one of its terms. In the present case, the consequence-part of this rule consists of the reference of an action. This information is stored in the attribute decisionToEvent. Elementary fuzzy-statements in the rules are represented in the schema on the basis of the type typeFuzzyProposition. A fuzzy statement like “the total order value is high” is therefore built by the association of the involved linguistic variables and their corresponding terms by referring to the id of the according component in the rule basis. <xs:complexType name=“typeFuzzyProposition”> <xs:sequence> <xs:element name=“linguisticVariableId” type=“xs:positiveInteger”/> <xs:element name=“linguisticTermId” type=“xs:positiveInteger”/> </xs:sequence> </xs:complexType>

This mechanism ensures that the linguistic variables only have to be specified once in the variable basis. Thus, this specification has to be quoted in EPML at only one position and doesn’t have to be saved redundantly.

4.5 Classic Fuzzy Systems After introducing in the last chapter the first special form of fuzzy-rule-systems in the form of control-flow-fuzzy-classification-systems, follows in this chapter the specification of classic fuzzy-systems with whose it is possible to aggregate the information to intermediate sizes or rather representing fuzzy statements with the characteristics of attributes on the bottom level of the decision hierarchy. The basis for this builds again the abstract type typeRuleSystem, which has to be complemented accordingly. <xs:complexType name=“typeFuzzyRuleBase”> <xs:sequence> <xs:element minOccurs=“1” maxOccurs=“unbounded” name=“rule” type=“typeFuzzyRule”/> </xs:sequence> <xs:attribute name=“id” type=“xs:positiveInteger” use=“required”/> <xs:attribute name=“defRef” type=“xs:positiveInteger” use=“optional”/> <xs:attribute name=“comment” type=“xs:NMTOKEN” use=“optional”/> </xs:complexType>

Fuzzy-EPC Markup Language: XML Based Interchange Formats

247

Classic fuzzy-systems possess another way of rule basis which are expressed in the schema element typeFuzzyRuleBase. Representative for such rule basis is an amount of one or more special rules which are filed in rule-elements of the type typeFuzzyRule. <xs:complexType name=“typeFuzzyRule”> <xs:sequence> <xs:element name=“antecedent” minOccurs=“1” maxOccurs=“1”> <xs:complexType> <xs:sequence> <xs:element name=“fuzzyProposition” type=“typeFuzzyProposition” minOccurs=“1” maxOccurs=“unbounded”/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name=“consequence” minOccurs=“1” maxOccurs=“1”> <xs:complexType> <xs:sequence> <xs:element name=“fuzzyProposition” type=“typeFuzzyProposition” minOccurs=“1” maxOccurs=“1”/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name=“id” type=“xs:positiveInteger” use=“required <xs:attribute name=“defRef” type=“xs:positiveInteger” use=“optional”/> <xs:attribute name=“comment” type=“xs:NMTOKEN” use=“optional”/> </xs:complexType>

The basic composition of such rules in form of the identification attributes and whose partition in an antecedent- and a consequence-part complies with the already described control-flow-fuzzy-classification-rules. The main difference of the rule type typeFuzzyRule for this is the fact that the consequence part of the rules is represented with fuzzy-expressions and so containing a fuzzyProposition-element on the basis of the corresponding type. Therewith it is possible to do statements about attributes, for example about the intermediate size at the representation of multilevel decision hierarchies with the help of an amount of rule systems which hierarchically related to each other.

4.6 Fuzzy Functions at Decision Points The essential extension of EPML is finally based on the XML-schema definition of fuzzy functions. The basis for this was built with the already described schema modules. The extension occurs with the allocation of at most one ruleSystem-element to the corresponding process function. This technically happens with the choiceconstruct which declares such an element as acceptable inside the function type in the schema. This element is based on the type typeRuleSystem which – what was already shown – is specialized technically and syntactic in the characteristic

248

O. Thomas and T. Dollmann

“fuzzy-system” and “control-flow-fuzzy-classification”. Both types typeFuzzySystem and typeFlowFuzzyClassificationSystem have already been introduced. <xs:complexType name=“typeFunction”> <xs:sequence> .. <xs:choice minOccurs=“0” maxOccurs=“1”> <xs:element name=“ruleSystem” type=“typeRuleSystem”/> </xs:choice> </xs:sequence> .. </xs:complexType>

Besides the framework of the Fuzzy-EPC approach deducted special types of fuzzy-rule-systems it would be possible to embed also an “exact” rule based decision supporting approach how it is for example contemplated by ROZINAT, VAN DER AALST [30]. This is not the focus in this research but it emphasizes the connectivity of the chosen approach. With the fuzzy-EPML it is now possible to represent Fuzzy-EPC model in a consistent XML-format. So this format can be implemented as a storage- and interchange format for fuzzy-modeling tools. With the consequent usage of XML it is possible to use XSLT-scripts to generate other representation formats for fuzzysoftware tools or fuzzy-application systems.

5 Application Scenario In the following, the use of the Fuzzy-EPC markup language will be demonstrated in an application scenario. The listing shows an excerpt from the Fuzzy-EPML representation of the demo example for sales order checks. The focus is on the specification of the fuzzy control flow classification system, with which decisions about the acceptance or rejection of orders are made. A first excerpt from the EPML representation shows the following listing. It shows exemplary how the linguistic variable “order value” from Fig. 1 is represented in EPML. <linguisticVariable label=“Order Value” id=“345” attributeIdRef=“200”> .. <linguisticTerm label=“hoch” id=“346” number=“4”> <membershipFunction> <trapezoidFuzzySet> <zeroLeftX>52500</zeroLeftX> <oneLeftX>67500</oneLeftX> <oneRightX>77500</oneRightX> <zeroRightX>92500</zeroRightX> </trapezoidFuzzySet> </membershipFunction> </linguisticTerm> <linguisticTerm label=“very high” id=“347” number=“5”> .. <domain xsi:type=“epml:continuousInterval”> <min>0</min>

Fuzzy-EPC Markup Language: XML Based Interchange Formats

249

According to the definition of such a variable, information about name, amount of terms, membership functions and domain has to be available. While the name of the variable “order value” is stored as information in the label-attribute, the terms are defined consecutively as sub elements of the XML-tree. Characteristic is the direct assignment of the semantic information of one term in the form of a membership function, which is given as a part of the term specification. Here is the representation of membership functions of the term “high” in the form of a trapeze function recognizable. This frequent appearing case, as shown, can be specified by a few parameters, so that the function shown in Fig. 1 finds a simple equivalence in fuzzy-EPML. Another semantic component of the linguistic variable “order value” is expressed in the shape of the domain in EPML. It is recognizable that the use case underlies a continuous domain from 0 to 100.000 €€ . With the assigning of id-information for the linguistic variables attributes and terms it is possible to reference the corresponding information in the XMLdocument. With the attribute IdRef-field, the connection between the definition of the linguistic variable “order value” and the corresponding attribute definition of the homonymous process element takes place. On the basis of the assignment of keys for the linguistic variables and terms a special redundant-free storage of the so composed fuzzy-rules is possible.

Fig. 6 Representation of the decision hierarchy “deciding on order acceptance or rejection” with help for a Fuzzy-EPC-model

250

O. Thomas and T. Dollmann

A very high significance is taken by the fuzzy-rules. They explicate the existent decision knowledge on the basis of the linguistic variables and terms. Fig. 6 shows in form of a Fuzzy-EPC model the representation form of the use case and the underlying decision hierarchy, which was already indicated by the two-staged decision tree in Fig. 1. A division of the fuzzy-rule systems is necessary in the chosen example in order to handle complexity. At the decision level an aggregation in the fuzzy-systems “define evaluation” and “define value” takes place first, basing on the customer- and order attributes “customer estimation” and “order value” or rather “urgency” and “feasibility” on the first hierarchy level. The resulting intermediate sizes “evaluation” and “priority” are afterwards aggregated on the second hierarchy level in a controlflow-fuzzy-classification-system to a final decision about the order acceptance or -rejection. Every step in the decision hierarchy equates an inference on the basis of a rule system. Consequently, as in Fig. 6 intended, a rule system with a corresponding rule basis has to be specified. Mentionable is the fact that such a decision hierarchy is not – like in fuzzy control often usual – represented only with the help of one system and an amount of rule blocks which are connected inside the system. Rather the single aggregation points and the way through the decision hierarchy from bottom to top are stored as a model part of the function checking customer order. So the aggregation point gets equivalence as a (fuzzy) EPC-function in association of a fuzzy-systemspecification. The advantage resulting from this approach is that in many cases it is possible to describe complex decision finding processes through the decision hierarchy as an EPC-model. In comparison to a more technical oriented approach based on abstract rule blocks, the comprehensibility of this approach is assured. The following listing shows an excerpt from the fuzzy EPC-representation of the demonstration example for checking a customer order. The focus here is on the specification of the control flow fuzzy classification system, on the basis of which a decision about order acceptance or rejection is made. <epml> .. <epc epcId=“1” name=“Scenario Customer Order Check”> .. <function id=“6”> <name>Check Customer Order</name> <toProcess linkToEpcId=“2”/> </function> .. </epc> <epc epcId=“2” name=“Check Customer Order”> .. <function id=“23”> <name>Decide on order acceptance or rejection</name> .. <ruleSystem xsi:type=“epml:typeFlowFuzzyClassificationSystem” id=“24”> <defuzzMethod>linear</defuzzMethod> <fuzzyOperators> <accumulationOperator> <parameterizedOperator name=“compensatoryAnd”>

Fuzzy-EPC Markup Language: XML Based Interchange Formats

251

<parameter name=“gamma” value=“0.5”/> </parameterizedOperator> </accumulationOperator> <aggregationOperator> <standardOperator name=“maximum”/> </aggregationOperator> <implicationOperator> <standardOperator name=“minimum”/> </implicationOperator> </fuzzyOperators> <variableBase> <linguisticVariable label=“Estimation” id=“300” attributeIdRef=“40”> .. </variableBase> <ruleBase id=“361”> <rule id=“362”> <antecedent> <fuzzyProposition> <linguisticVariableId>300</linguisticVariableId> <linguisticTermId>301</linguisticTermId> </fuzzyProposition> </antecedent> <consequence decisionToEvent=“8”/> </rule> </ruleBase> <eventShortlist idRefs=“8 9”/> </ruleSystem> </function> .. <event id=“58” defRef=“008”> <name>Customer Order is to be accepted</name> .. </event> <event id=“59” defRef=“009”> <name>Customer Order is to be rejected</name> .. </event> .. </epc> </epml>

It is recognizable that the scenario “check customer order” in Fig. 1 is stored in EPML as an EPC model with epcId=“1”. The function checking customer order is in a hierarchy and is linked with the second EPC model in the EPML code which represents the decision hierarchy about the order acceptance or rejection as pictured in Fig. 6. Here the specification of the fuzzy function order acceptance or rejection of this EPC model is emphasized, which finally leads to the mentioned decision. The accordant related control-flow-fuzzy-classification-system is stored in fuzzy-EPML in the element ruleSystem. In order that the model can act as basis for the inference at runtime, different fuzzy-parameters have to be chosen. It is also recognizable that the range in fuzzy-EPML is stored with the declaration of different parameter. For example the linear defuzzification was chosen as defuzzification method; also the choice of operators needed for the analysis of the rules is documented in the analogous component fuzzyOperators. For instance the output-fuzzy-quantities of the

252

O. Thomas and T. Dollmann

single rules shall be disjunctive linked on the basis of the compensatory conjunction of the parameter λ = 0,5 (accumulation operator). While Fig. 6 shows a linguistic section of the rule basis, the listing clarifies the representation in fuzzy-EPML. Following the introduced schema of the controlflow-fuzzy-classification-rules one builds on the splitting of the rules in antecedence and consequence. The contained fuzzy expressions in the present scenario, for example the “rating is high” from rule 9 (Fig. 6) are expressed on the basis of the linkage of the according references on the underlying linguistic variable (rating, id=300) and its term high (id=301) in Fuzzy-EPML. Accordant definitions are stored in the variableBase. For clarity this is only indicated for the specified system. The consequent of the rule presents the link to an event, in the case of rule 9 then to the event with the id=58, which refers to the identifier accept customer order. Through the described components the Fuzzy-EPC-model is completely represented in fuzzy-EPML.

6 Summary and Outlook This paper presented an XML schema based specification of the Fuzzy-EPC using the EPML format. We reported on the design of the Fuzzy-EPC compliant schema and showed major syntactical extensions. Furthermore we sketched a realistic example (sales order checks) showing that Fuzzy-EPML is able to serve as an adequate interchange format for fuzzy business process models. Since fuzzy business process models can now be transferred and stored in a standardized form and platform independent as XML documents, some further tasks will be approached in the near future. First, there is not yet adequate tool support for the modeling of Fuzzy-EPCs. However, as the schema is now available, we are currently working on respective tool support. Furthermore the task of transforming Fuzzy-EPC models for execution or for further analysis will be approached. Here we emphasize the need for further processing of model parts in fuzzy modeling tools with process improvement in combination with learning algorithms. Since the lack of integration between fuzzy applications and common business application systems is softened, the integration of other soft computing techniques to automatically discover process model parts and decision rules looks promising and is currently evaluated. A prospective challenge is mainly the answer to the question whether the creation of adequate, linguistic variables and rule basis is economically reasonable in the fuzzy-business process management. Particularly the assembling of fuzzyrules seems to be challenging in practice. Any misbehaviour has to be analyzed and respectively manually corrected by the developer. With the optimization of fuzzy-systems with neuronal networks, fuzzy-sets can be adapted and the rule basis can be learned or rather corrected. The importance of artificial neuronal networks for discovering business logic in processes (“process mining”) as well as the advancement of business processes by learning is discussed in literature

Fuzzy-EPC Markup Language: XML Based Interchange Formats

253

[2; 45]. Market-driven tools are required which allow the implementation of the presented concepts cost-effective in business practice. Using fuzzy-logic it was possible to consider fuzzy and vague information in business process models. Thus an adequate representation of business processes is possible and the decision supporting at the execution of business process could be enhanced. The activities and data objects occurring in business process services however underly a fuzzy afflicted interpretation. This occurs at the model construction, where a model constructor explicates the aspects of business process in a process model which are relevant to him. At this he allocates identifiers to the model elements which are used for the communication with the model users. The model users interpret on their part the identifier which are contained in the models and allocate accordant terms or data. Due to the fuzziness which comes with the usage of the natural language, the usage of semiformal process models is limited as means of communication between model constructors and -users. Further problems arise regarding to the feasibility of process models. This is usually not directly given but needs a new interpretation in the sense of implementation activities. The mentioned problem of semantic inconclusive process models is actually adressed with help of formal process documentations. Hereby a formalization of both the functional description of a process and the specification for implementation available information system components is focused. For example by so called semantic web services. The basic thought of most of the formalization efforts is that with the usage of ontology languages an explicit specification of model elements can be reached, which also allows an automatic processing. It is often disregarded that ontologies also have to be interpreted by the model constructor – a factum which is again connected with problems. So there is a difficulty that a model constructor may not be able to judge if the facts of a specific case, represented in a process model, are conformant with an ontologically defined concept. The actual two-valued logics dominating the field of semantic web and semantic web services do not allow such doubts or uncertainty: facts can be rather “true” or “false”, intermediate levels are not possible. While this exact representation of circumstances is benefiting at the automatic processing of semantics by machines, it does not generally accord with the thinking and the connected knowledge of humans: “In fact, computers require precise definitions but humans normally work better without” [31, preamble]. The consolidation of fuzzy-logic with technologies and languages of the semantic web assures that fuzzy aspects can be used also at the formal specification of ontologies, to dissolve the outlined dichotomy of “exact”, means processed by machines and “fuzzy” means only by human interpreted semantics. An according ontology language was proposed [36]. The general discussion about strategies of fuzzy integration in semantic web technologies is at an early stage. An essential question is, if fuzzy aspects have to be directly integrated into technologies and languages of the semantic web, or if they have to be seen as a special case. The direct integration of fuzziness in ontology languages has extensive consequences for the inference machines used because fuzziness reproduces itself in the polymorphism technology or among relations which has to be taken into account accordingly.

254

O. Thomas and T. Dollmann

References 1. Adam, O., Thomas, O.: A Fuzzy Based Approach to the Improvement of Business Processes. In: Castellanos, M., Weijters, T. (eds.) BPI 2005: Workshop on Business Process Intelligence, Nancy, France, September 5, pp. 25–35 (2005) 2. Adam, O., Thomas, O., Loos, P.: Soft Business Process Intelligence – Verbesserung von Geschäftsprozessen mit Neuro-Fuzzy-Methoden. In: Lehner, F., Nösekabel, H., Kleinschmidt, P. (eds.) Multikonferenz Wirtschaftsinformatik 2006: Band 2, GITO, Berlin, pp. 57–69 (2006) 3. Adam, O., Thomas, O., Martin, G.: Fuzzy Enhanced Process Management for the Industrial Order Handling. In: Scheer, A.-W. (ed.) Proceedings: 5th International Conference; The Modern Information Technology in the Innovation Processes of the Industrial Enterprises, MITIP 2003, German Research Center for Artificial Intelligence, Saarbruecken/Germany, September 4-6, Universität des Saarlandes, Saarbrücken, pp. 15–20 (2003) 4. Adam, O., Thomas, O., Martin, G.: Fuzzy Workflows - Enhancing Workflow Management with Vagueness. In: EURO/INFORMS Istanbul 2003 Joint International Meeting, Istanbul, July 06-10 (2003) 5. Adam, O., Thomas, O., Vanderhaeghen, D.: Fuzzy-Set-Based Modeling of Business Process Cases. In: Richter, M., et al. (eds.) ICCBR 2005: 6th International Conference on Case-Based Reasoning, Workshop 4: Similarities - Processes - Workflows, Chicago, Illinois, IL, August 23-26, pp. 251–260 (2005) 6. Becker, J., Rehfeldt, M., Turowski, K., Vering, O.: A Fuzzy Approach to a CustomerOriented Sales Workflow. In: Steele, N.C. (ed.) ISFL 1997: Second International ICSC Symposium on Fuzzy Logic and Applications, Swiss Federal Institute of Technology Zurich, Switzerland, February 12-14, pp. 370–376. ICSC Academic Press, Zürich (1997) 7. Cao, T., Sanderson, A.C.: Task sequence planning using fuzzy Petri nets. In: International conference on systems, man and cybernetics, conference proceedings: decision aiding for complex systems, Charlottesville, VA, October 13-16, pp. 349– 354. IEEE Computer Society Press, Los Alamitos (1991) 8. de Soto, A.R., Capdevila, C.A., Fernández, E.C.: Fuzzy Systems and Neural Networks XML Schemas for Soft Computing. Mathware and Soft Computing 10(2-3), 43–56 (2003) 9. Forte, M.: Unschärfen in Geschäftsprozessen. Weißensee, Berlin (2002) 10. Gaurav, A., Alhajj, R.: Incorporating fuzziness in XML and mapping fuzzy relational data into fuzzy XML. In: Haddad, H. (ed.) Proceedings of the 2006 ACM Symposium on Applied Computing (SAC), Dijon, France, April 23-27, pp. 456–460. ACM, New York (2006) 11. Günther, R., Lipp, H.-P.: A Fuzzy Petri Net Concept for Complex Decision Making Processes in Production Control. In: Zimmermann, H.-D. (ed.) Proceedings of the 1st European Congess on Fuzzy and Intelligent Technologies, EUFIT 1993, Aachen, Germany, September 7-10, Verl. der Augustinus Buchhandlung, Aachen, pp. 290–295 (1993) 12. Hüsselmann, C.: Fuzzy-Geschäftsprozessmanagement. Eul, Lohmar (2003)

Fuzzy-EPC Markup Language: XML Based Interchange Formats

255

13. Hüsselmann, C., Adam, O., Thomas, O.: Gestaltung und Steuerung wissensintensiver Geschäftsprozesse durch die Nutzung unscharfen Wissens. In: Reimer, U., et al. (eds.) WM 2003: Professionelles Wissensmanagement - Erfahrungen und Visionen: Beiträge der 2. Konferenz Professionelles Wissensmanagement - Erfahrungen und Visionen, April 2-4, pp. 343–350. Köllen, Bonn (2003) (in Luzern) 14. IBM (ed.): ABLE Rule Language: User’s Guide and Reference, Version 2.3.0. IBM Corporation (2005) 15. Jiwani, A., Alimohamed, Y., Spence, K., Özyer, T., Alhajj, R.: Fuzzy XML Model for Representing Fuzzy Relational Databases in Fuzzy XML Format. In: Manolopoulos, Y., et al. (eds.) ICEIS 2006 - Proceedings of the Eighth International Conference on Enterprise Information Systems: Databases and Information Systems Integration, Paphos, Cyprus, May 23-27, pp. 163–168 (2006) 16. Kindler, E.: On the semantics of EPCs: Resolving the vicious circle. Data & Knowledge Engineering 56(1), 23–40 (2006) 17. Lee, J., Fanjiang, Y.Y.: Modeling imprecise requirements with XML. Information and Software Technology 45(7), 445–460 (2003) 18. Lipp, H.-P.: Anwendung eines Fuzzy Petri Netzes zur Beschreibung von Koordinationssteuerungen in komplexen Produktionssystemen. Wissenschaftliche Zeitschrift der Technischen Universität Karl-Marx-Stadt 24(5), 633–639 (1982) 19. Ma, Z.: Fuzzy Database Modeling with XML. Springer, Berlin (2005) 20. Ma, Z., Yan, L.: Fuzzy XML data modeling with the UML and relational data models. Data & Knowledge Engineering 63(3), 970–994 (2007) 21. Mendling, J., Nüttgens, M.: EPC Markup Language (EPML) - An XML-Based Interchange Format for Event-Driven Process Chains (EPC). ISeB 4(3), 245–263 (2006) 22. Rehfeldt, M.: Koordination der Auftragsabwicklung: Verwendung von unscharfen Informationen. DUV, Wiesbaden (1998) 23. Rehfeldt, M., Turowski, K.: Impact on Integrated Information Systems through Fuzzy Technology. In: Zimmermann, H.-J. (ed.) Proceedings / EUFIT 1994, Second European Congress on Intelligent Techniques and Soft Computing: Aachen, Augustinus, Aachen, September 20-23. ELITE Foundation, pp. 1637–1645 (1994) 24. Rehfeldt, M., Turowski, K.: A Fuzzy Distributed Object-Oriented Database System as a Basis for a Workflow Management System. In: Proceedings of the sixth International Fuzzy Systems Association world congress, IFSA 1995, São Paulo, Brazil, July 21-28. NTIS, Springfield (1995) 25. Rehfeldt, M., Turowski, K.: Anticipating Coordination in Distributed Information Systems through Fuzzy Information. In: Zimmermann, H.-J. (ed.) Proceedings / EUFIT 1995, Third European Congress on Intelligent Techniques and Soft Computing: Aachen, Germany, pp. 1774–1779. Verl. Mainz, Wissenschaftsverlag, Aachen (1995) 26. Rehfeldt, M., Turowski, K.: A Tool-supported Distributed Application of Fuzzy Logic in Order Processing. In: Jamshidi, M., Junku, Y., Dauchez, P. (eds.) Proceedings of the World Automation Congress (WAC 1996), Intelligent automation and control: Recent trends in development and applications, Montpellier, France, May 28-30, pp. 585–589. TSI Press, Albuquerque (1996) 27. Rehfeldt, M., Turowski, K.: Fuzzy Objects in Production Planning and Control. In: Zimmermann, H.-J. (ed.) Proceedings / EUFIT 1996, Fourth European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany, September 2-5. ELITE Foundation, pp. 1985–1989. Verl. Mainz, Wissenschaftsverlag, Aachen (1996)

256

O. Thomas and T. Dollmann

28. Rehfeldt, M., Turowski, K.: A Flexible Java-based Fuzzy Kernel for Business Applications. In: Alpaydin, E. (ed.) International ICSC symposium on engineering of intelligent systems: EIS 1998, pp. 204–209. ICSC Academic Press, London (1998) 29. Rosemann, M., van der Aalst, W.M.P.: A configurable reference modelling language. Information Systems 32(1), 1–23 (2007) 30. Rozinat, A., van der Aalst, W.M.P.: Decision Mining in Business Processes. In BPM Center Report, BPM-06-10, Eindhoven University of Technology (2006) 31. Sanchez, E. (ed.): Fuzzy Logic and the Semantic Web. Elsevier, Amsterdam (2006) 32. Scheer, A.-W.: ARIS - business process frameworks, 2., completely rev. and enl. ed. Springer, Berlin (1998) 33. Scheer, A.-W.: ARIS - business process modeling. 2., completely rev. and enl. ed. Springer, Berlin (1999) 34. Scheer, A.-W., Thomas, O., Adam, O.: Process Modeling Using Event-driven Process Chains. In: Dumas, M., van der Aalst, W.M.P., ter Hofstede, A.H.M. (eds.) Processaware Information Systems: Bridging People and Software through Process Technology, pp. 119–145. Wiley, Hoboken (2005) 35. Srinivasan, P., Gracanin, D.: Approximate Reasoning with Fuzzy Petri Nets. In: Second IEEE International Conference on Fuzzy Systems, San Francisco, California, March 28-April 1, pp. 396–401. IEEE Computer Society, Piscataway (1993) 36. Straccia, U.: A Fuzzy Description Logic for the Semantic Web. In: Sanchez, E. (ed.) Fuzzy Logic and the Semantic Web, pp. 73–90. Elsevier, Amsterdam (2006) 37. Thomas, O., Adam, O., Leyking, K., Loos, P.: A Fuzzy Paradigm Approach for Business Process Intelligence. In: IEEE Joint Conference on E-Commerce Technology (CEC 2006) and Enterprise Computing, E-Commerce and E-Services (EEE 2006), San Francisco, California, June 26-29, pp. 206–213. IEEE Computer Society Press, Los Alamitos (2006) 38. Thomas, O., Adam, O., Loos, P.: Using Reference Models for Business Process Improvement: A Fuzzy Paradigm Approach. In: Abramowicz, W., Mayr, H.C. (eds.) Business Information Systems: 9th International Conference on Business Information Systems (BIS 2006), Klagenfurt, Austria, May 31-June 2, pp. 47–57. Köllen, Bonn (2006) 39. Thomas, O., Adam, O., Seel, C.: A fuzzy based approach to the management of agile processes. In: Althoff, K.-D., et al. (eds.) WM2005: Professional Knowledge Management Experiences and Visions: Contributions to the 3rd Conference Professional Knowledge Management Experiences and Visions, Proceedings, Kaiserslautern, April 10-13, DFKI GmbH, Kaiserslautern (2005) 40. Thomas, O., Adam, O., Seel, C.: Business Process Management with Vague Data. In: Proceedings: DEXA 2005: Sixteenth International Workshop on Database and Expert Systems Applications, Copenhagen, Denmark, August 22-26, pp. 962–966. IEEE Computer Society Press, Los Alamitos (2005) 41. Thomas, O., Dollmann, T.: Towards the Interchange of Fuzzy-EPCs: An XML-based Approach for Fuzzy Business Process Engineering. In: Bichler, M., et al. (eds.) Multikonferenz Wirtschaftsinformatik 2008, pp. 1999–2010. GITO, Berlin (2008) 42. Thomas, O., Dollmann, T., Loos, P.: Towards Enhanced Business Process Models Based on Fuzzy Attributes and Rules. In: Proceedings of the 13th Americas Conference on Information Systems, Keystone, Colorado, USA, August 09-12. AIS, Atlanta (2007)

Fuzzy-EPC Markup Language: XML Based Interchange Formats

257

43. Thomas, O., Dollmann, T., Loos, P.: Rules Integration in Business Process Models – A Fuzzy Oriented Approach. Enterprise Modelling and Information Systems Architecures 3(2), 18–30 (2008) 44. Tietze, M.: Einsatzmöglichkeiten der Fuzzy-Set-Theorie zur Modellierung von Unschärfe in Unternehmensplanspielen. Unitext, Göttingen (1999) 45. Tiwari, A., Turner, C.J., Majeed, B.: A review of business process mining: state-ofthe-art and future trends. Business Process Management Journal 14(1), 5–22 (2008) 46. Tseng, C., Khamisy, W.: XML Schema for Fuzzy Systems, Computational Intelligence Lab, SJSU (2006), http://mh213d.cs.sjsu.edu/webintelligence/fuzzyschema/ Schema.html#Schema (accessed 3 March 2006) 47. Tseng, C., Khamisy, W., Vu, T.: Universal fuzzy system representation with XML. Computer Standards & Interfaces 28(2), 218–230 (2005) 48. Turowski, K., Weng, U.: Representing and processing fuzzy information: an XMLbased approach. Knowledge-Based Systems 15(1-2), 67–75 (2002) 49. Valette, R., Courvoisier, M.: Petri nets and Artificial Intelligence. In: Zurawski, R., Dillon, T.S. (eds.) IEEE International Workshop on Emerging Technologies and Factory Automation: Technology for the intelligent factory, World Congress Centre, Melbourne, Australia, August 11 - 14, pp. 218–238. CRL Publishing, Aldershot (1992) 50. von Uthmann, C.: Improving the Use of Petri Nets for Business Process Modeling. Westfälische Wilhelms-Universität, Münster (1999) 51. Witte, R.: Architektur von Fuzzy-Informationssystemen. Books on Demand, Norderstedt (2002) 52. Zimmermann, H.-J., Angenstenberger, J., Lieven, K., Weber, R. (eds.): FuzzyTechnologien: Prinzipien, Werkzeuge, Potentiale. VDI-Verl, Düsseldorf (1993)

An XML Based Framework for Merging Incomplete and Inconsistent Statistical Information from Clinical Trials Jianbing Ma, Weiru Liu, Anthony Hunter, and Weiya Zhang

Abstract. Meta-analysis is a vital task for systematically summarizing statistical results from clinical trials that are carried out to compare the effect of one medication (or other treatment) against another. Currently, most meta-analysis activities are done by manually pooling data. This is a very time consuming and expensive task. An automated or even semi-automated tool that can support some of the processes underlying meta-analysis is greatly needed. Furthermore, statistical results from clinical trials are usually represented as sampling distributions (i.e., with the mean value and the SEM). When collecting statistical information from reports on clinical trials, not all reports contain full statistical information (i.e., some do not provide SEMs) whilst traditional meta-analysis excludes trials reports that contain incomplete information, which inevitably ignores many trials that could be valuable. Furthermore, some trials results can be significantly inconsistent with the rest of trials that address the same problem. Therefore, highlighting (resp. removing) such inconsistencies is also very important to reveal (resp. reduce) any potential flaws in some of the trials results. In this paper, we aim to design and develop a framework that tackles the above three issues. We first present an XML-based merging framework that aims to merge statistical information automatically with the potential to add a component to extract clinical trials information automatically. This framework shall consider any valid clinical trial including trials with partial information. Jianbing Ma Âˇ Weiru Liu Computer Science, Queenâ&#x20AC;&#x2122;s University Belfast, Belfast, Co Antrim BT7 1NN, UK e-mail: {jma03,w.liu}@qub.ac.uk Anthony Hunter Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK e-mail: a.hunter@cs.ucl.ac.uk Weiya Zhang Academic Rheumatology, Medical & Surgical Sciences, City Hospital, Nottingham, NG5 1PB, UK e-mail: Weiya.Zhang@nottingham.ac.uk Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 259â&#x20AC;&#x201C;290. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

260

J. Ma et al.

We then develop a method to analyze inconsistencies among a collection of clinical trials and if necessary to exclude any trials that are deemed to be illegible. Finally, we use two sets of clinical trials, trials on Type 2 diabetes and on neurocognitive outcomes after off-pump versus on-pump coronary revascularisation, to illustrate our framework.

1 Introduction Clinical trials are widely used to test new drugs or to compare the effect of different drugs. A clinical trial is a study that compares the effect of one medication (or other treatment) against another [16]. Trial results are a summary of the underlying statistical analysis. A huge number of clinical trials have been carried out in the last few decades and new trials are constantly being designed and implemented. For example, many clinical trials have been carried out to investigate issues including: the intraocular pressure (IOP) lowering efficacy of drugs, such as travoprost, bimatoprost, timolol, and latanoprost, (e.g., [4, 7, 15, 21, 30, 32, 34, 37, 40]); oral medications for adults with Type-2 diabetes (e.g., [3, 9, 27, 29, 35, 41]); the neurocognitive outcomes after off-pump versus on-pump coronary revascularisation (e.g., [14, 25, 26, 31, 43]). Given the huge number of clinical trials and the fact that clinical trials reports are time consuming to read and understand, systematic reviews of related trials is needed by medical practioners and other health care professionals to assess drugs/therapies of interest. Meta-analysis is the technique commonly used in clinical trial research to summarize related trial results, that is, to merge multiple sampling distributions into a single distribution. Meta-analysis is a very important step in the development of evidence-based medicine, and there are various tools supporting this task, such as SAS, STATA, MetaWin, WEasyMa, etc. However, there are still difficulties carrying out metaanalysis with these tools when a large number of clinical trials need to regularly be considered and when new trials are being completed. First, current meta-analysis technique requires input data to be extracted from clinical trial manually. This is a very time-consuming task particularly when the number of related reports is very large. Second, before inputting data into meta-analysis tools, it is necessary to systematically preprocess the semantic heterogeneity of data. This includes, for example, manually checking whether the data is about the same issue, whether the data uses the same unit of measurement and if not some conversion needs to be done, and whether these clinical trials are of the same duration, etc. So there are a number of low-level but important steps of standardizing the format and checking correspondences. Therefore, some kind of automated process that can extract information from clinical trials reports and can verify to some degree that whether some trials are eligible together for meta-analysis would be very useful. In a clinical trial, patients are divided into treatment groups, with each group receiving one of the drugs under study. Specific outcomes are measured and the differences between the measurements at the start of the trial and at the end

An XML Based Framework for Merging Incomplete

261

are compared for each group. By convention, clinical trials results are described using sampling distributions. When the full details about sampling distributions are available, merging the results from these trials entails systematic use of established techniques from statistics, as done in the current meta-analyses. However, in reality, some trials reported in the literature are statistically incomplete, for instance, the standard error of mean (SEM) can be missing from a sampling distribution. Traditionally, it is difficult to make use of those clinical trials in meta-analysis. In fact, a clinical trial with incomplete information is often abandoned. However, in [28], a prognostic method and an interval method are proposed to deal with meta-analysis with incomplete information. Obviously, these two methods are useful alternatives to the traditional meta-analysis. When a set of clinical trials on the same issue are collected for meta-analysis, there might be some clinical trials presenting highly conflict statistical results with results from other clinical trials. For these inconsistent trials results, it is very likely that these trials are done on different populations, and hence should be excluded to achieve a better meta-analysis result. As the popularity of XML in dynamic data exchange increases, a variety of tools to store and retrieve data in/from XML documents have been developed. Since a clinical trial result may be used on different occasions and in different meta-analysis, storing main statistical results of clinical trials in XML documents is an appealing idea. In this paper, we present an XML based framework for supporting meta-analysis by defining merging rules for combining complete and incomplete clinical trials data, with a longer-term objective to completely automate this process, e.g., to extract clinical trials information and pre-process the semantic heterogeneity automatically. More specifically, this paper contains the following contributions. 1. We present a general XML based merging framework that extends the fusion rule technique developed in [17] especially for clinical trials data. 2. We show how our framework can deal with clinical trials with incomplete information where current meta-analysis tools cannot. 3. We show how our framework can analyze inconsistent information and remove highly conflict information by excluding a trial with this nature. 4. We provide a brief discussion on semantic heterogeneity in statistical information merging and on automated information extraction highlights other two important aspects that we will develop in order to realize an automated meta-analysis tool. 5. We illustrate our framework with two case studies (Type-2 diabetes and neurocognitive outcomes after off-pump versus on-pump coronary revascularisation) showing the whole process and its efficacy. The remainder of this paper is organized as follows. In Section 2, we give a brief introduction to XML, define the XML document structure for representing the information contained in clinical trials reports, and discuss the automatic information extraction and semantic heterogeneity processing. In Section 3, we formally

262

J. Ma et al.

describe the XML-based merging framework including basic definitions and clinical trials oriented restrictions of tags. Section 4 discusses how to manage the possibly incomplete and inconsistent information contained in XML documents to perform a meta-analysis. Section 5 provides two case studies, one is on Type 2 diabetes and the other is on neurocognitive outcomes after off-pump versus on-pump coronary revascularisation. We use these studies to illustrate our framework. Finally, in Section 6, we conclude the paper. In addition, we put the full DTD description of the XML document structure in the Appendix.

2 XML Document In this section, we introduce some basic concepts of XML as well as the XML document structure we will use in this paper. We will also discuss issues related to semantic heterogeneity and information extraction from clinical trials.

2.1 Introduction to XML Extensible Markup Language (XML) has become an important part of Semantic Web, due to its simple and flexible format. An XML document is constructed based on a DTD or an XML Schema that specifies how tags in an XML should be arranged. Initially XML was mainly used to store and exchange static data, such as, metadata standards by Dublin core, but XML is now playing an increasingly important role in the exchange of a wide variety of dynamic data too, data that are retrieved or obtained upon requests. Typical examples of this kind are [11], [39], and [45], where the former constructs an XML document from a collection of multimedia data about a patient and the latter two generate XML documents that store probabilistic query results and predictive models obtained from data mining or intelligent analysis tools respectively. To facilitate the modelling of various types of data in XML, the need to represent uncertain data has emerged too, as in the case happened to traditional databases where numerous approaches were proposed to create and manipulate probabilistic databases (e.g. [2, 12]). Because XML documents are structured, uncertain information associated with data must be naturally assigned, interpreted and structured. Uncertainty can occur at different levels of granularity and uncertainty can be interpreted in different ways, such as in terms of probabilities, probability intervals, reliabilities, or beliefs. Furthermore, an integration result of XML documents having data values with certainty may create an XML document with uncertain data. Therefore, managing uncertain data in XML raises many challenging issues.

2.2 XML Document Structure XML based frameworks for representing and managing uncertain and incomplete information were proposed in many papers, e.g., [1, 17, 18, 19, 23, 33], etc. In [17],

An XML Based Framework for Merging Incomplete

263

a general XML based framework was proposed to merge XML documents with uncertain information like probabilities, possibilities, and belief functions. In [18], the proposed XML based framework was focused on merging uncertain information that is defined at different levels of granularity of XML textentries. In [19], the framework paid special interests to deal with reliabilities in different XML documents. In [17, 18, 19], structured reports with uncertain weather information were studied. The following two reports are examples. report report source TV1 /source source TV3 /source ... ... temperature temperature probability probability prob value = “0.4” 8◦C /prob prob value = “0.2” 8◦ C /prob ◦ prob value = “0.8” 12 C /prob prob value = “0.6” 12◦C /prob /probability /probability /temperature /temperature /report /report However, to our knowledge, there are no papers focusing on representing and managing possibly incomplete and inconsistent statistical information from clinical trials in XML frameworks. The needs of representing and combining clinical knowledge raised some important and interesting techniques issues. In this paper, we extend the ideas of [17, 18, 19] to create an XML based framework to deal with such information. We investigated many clinical trials reports in order to ensure our XML structure would cover a wide range of examples. That is, to accommodate our special needs of recording clinical trials information, the DTD of the XML documents should be adapted. The full XML document structure (DTD adaption) are given in the Appendix. Here we only give the DTD adaption of the Result element which contains information about clinical trials results (Fig. 1). Most of the time, clinical trials results are reported in the form of sampling distributions, and sometimes they are given in the form of confidence intervals. If so, we will transform confidence intervals to sampling distributions before putting the data into the XML documents (the transformation process will be introduced in Section 4.2). Sampling distributions are represented in the form of intervals, i.e., MeanInv and SEMInv, only when we use the interval method that will be introduced in Section 4.4 in dealing with trials with incomplete information. The value attribute of the SampleDist element indicates the target of a trial result such as “level of LDL cholesterol”, etc. In addition, the Unit element is optional because in some cases, there is no unit child in the Result element, e.g., an odds ratio does not have a unit of measurement.

264

J. Ma et al.

Fig. 1 DTD adaption of the Result element

Example 1. The following is a Result element. Result Drug ref = “l1”/ Duration 3 month /Duration Unit mmol/L /Unit PatientsNum 247 /PatientsNum SampleDist value = “Intraocular Pressure Reduction” Mean 4.1 /Mean SEM 3.8 /SEM /SampleDist /Result

2.3 Extracting Clinical Trials Information to Build XML Documents A vital aspect of building up a large collection of XML documents containing clinical trials information is to use an existing information extraction tool to extract relevant information. Information extraction (IE) technology (or synonymously text mining technology) aims to “read” text and pick out the bits of information that are needed. IE systems tend to be developed for focused applications where there is some regularity in the information being presented in the text. For example, in papers on clinical

An XML Based Framework for Merging Incomplete

265

trials, there are some regularities in the information being presented, such a paper is likely to include the patient class of the trial, treatment classes to which the patients were assigned, and the comparative outcomes of treatments. Hence, with an information extraction system for an application, there is the idea of a template that specifies the information that is sought by the system. A number of viable information extraction systems have been developed [8]. For example, the GATE System provides an implemented architecture for managing textual data storage and exchange, visualization of textual data structures, and plug-in modularity of text processing components [10]. The text processing components includes LaSIE which performs information extraction tasks including named entity recognition, coreference resolution, template element filling, and scenario template filling. Furthermore, a number of natural-language parsers have been developed that can be incorporated in information extraction systems (for a comparison for biological applications see [13]). Since our main task of the paper is not information extraction, rather it is how to represent and merge such extracted information, below we focus on what information we need to extract from clinical trials. In our study of clinical trials and in consultation with clinicians, we need to extract the following information from its report, in order to efficiently make use of each clinical trial, 1. The outcomes being measured and compared, including the name of the outcome and its unit. 2. Trial duration including the total length of the study, and any intermediate period intervals, e.g., a 12-month trial report may also provide results of 3 months, 6 months, etc. 3. For each trial group: a. Drug(s) used in that group. b. Number of patients in that group. c. The outcome measurements made for that group, namely: i. The mean and standard error of the mean at baseline and at each endpoint specified by the testing schedule, or alternatively, the difference of the two. ii. The p-value, if given. iii. The confidence interval (CI), if given. Certainly, there are other items of information that are valuable and useful as well, such as the main conclusion of a trial (e.g., Drug A is more effective than Drug B, or Drug A has severe side effects on patients with condition C etc). In our current merging framework, we have not considered these types of information yet. So although our XML documents will contain such types of information, now we mainly concentrate on statistical information provided in clinical trials and any additional information that is needed when using such statistical information.

266

J. Ma et al.

2.4 Heterogeneous Information Management through Ontologies As clinical trials reports come from different sources, semantic heterogeneity occurs frequently. For example, some reports use phrase “Low density lipoprotein cholesterol” while some other reports prefer it by the abbreviation “LDL-cholesterol”; some reports refer to “NF-kappa B” while others may write “p50/p60” as an equivalent term. Not only are different words used for the same meaning, but different reports may also use different units of measurement which are interchangeable. For instance, with regard to a trial duration, 1 year is equivalent to 12 months, 12 weeks is approximately equivalent to 3 months, etc. As another example, LDL cholesterol measurement in diabetes research has two different measurement units mmol/L and mg/L, and so clinicians interested in those reports must manually translate x mmol/l into y mg/L using formula y = x ∗ 39, or vice versa. Therefore, with knowledge and information fusion, semantic heterogeneity becomes a complex and multi-faced topic, and it is central to the merging approach we are discussing here. From the perspective of merging, we consider information to be merged in context. This means we undertake logical reasoning with the information to be merged to determine what it means. For example, in merging two reports on drug trials, we want to use any available information in the reports and background knowledge (e.g., NF-kappa B is equivalent to p50/p60) about the underlying assumptions in the experiments, the stages of the disease, etc, to determine whether merging is appropriate, and if so what kind of aggregation should be used on the constituent parts of this information. For determining whether information in two or more reports are referring to the same issue, we are investigating the use of ontological knowledge, e.g., [38, 42], to assist the selection of clinical trials for possible merging. The notion of ontology has had a long history in science. Once an ontology incorporates a large number of concepts and relationships, it gives us the ability to standardize the terminology, thereby minimizing ambiguities and facilitating communication. This is particularly important in a distributed environment where one may have numerous users who need to feel confident about the terms and concepts being used. Recourse to an ontology can ameliorate the complexity inherent in content in many applications by providing a common framework for structuring the content. In terms of clinical trials, we need to have an ontology to describe relevant concepts and their relationships used in each category of clinical trials, such as trials on diabetes etc. Such an ontology for example shall contain information about translation of words with the same meaning in the context, conversion of one measure into another when different trials use different measures, etc. In fact, there are already some known ontologies related to biomedical knowledge, e.g., SNOMED CT, Gene Ontology, etc. SNOMED CT is a clinical terminology (the Systematised Nomenclature of Medicine Clinical Terms) that provides a very large and wideranging common computerised language that can be used by all applications in a healthcare system to facilitate communications between healthcare professionals in clear and unambiguous terms. Further important ontological resources for medical

An XML Based Framework for Merging Incomplete

267

science include the Unified Medical Language System, and the framework for sharing of ontologies offered by The OpenBiomedical Ontologies Foundary. One of our next step research is to build an ontology for clinical trials of selected application domains. This will be done based on SNOMED CT and other related, publicly available ontologies. We will use Prot´eg´e, [36], a free, open source ontology editor developed by Stanford University to complete this task.

3 XML-Based Merging Frameworks In this section, we introduce an XML-based merging framework. The framework follows the idea of merging uncertain information in structured reports in [17]. First, we present a general definition of the XML based merging framework, for which we define a selection function to select a set of “compatible” trials (in terms of XML documents) and a merging rule to combine the selected trials to get a new XML document. Furthermore, we impose some clinical trials oriented constraints on the general framework.

3.1 Basics of XML-Based Merging Framework We use XML to represent clinical trials reports. For convenience, we will call them XML reports from now on. Following [17], we define an XML report formally as follows. Definition 1. (XML report) If ψ is a tagname (i.e., an element name), and φ is textentry, then ψ φ /ψ is an XML report. If ψ is a tagname, φ is textentry, θ is an attribute name and κ is an attribute value, then ψ θ = κ φ /ψ is an XML report. If ψ is a tagname, and φ1 , . . . , φn are XML reports, then ψ φ1 . . . φn /ψ is an XML report. This definition for an XML report is very general (similar to Def. 1 in [17] for structured news report). In practice, we would use DTD defined in the Appendix to adapt this definition. For example, we may restrict the root element of an XML report to be a Trial element. Furthermore, if there is a DTD element as !ELEMENT A (S) where A is an element name and S is a set of children names, then for an element named A in the XML report, if B is a child element of A, we restrict that B ∈ S. Since these kinds of application oriented adaptions are not the main topic of this paper and in fact are fully implied in the DTD definitions, here we will not consider these issues further. However, in this paper, we will impose some constraints on XML reports in Section 3.2, to support the handling of uncertainty. For convenience, hereafter we use L to denote the set of all XML reports. To define a general merging framework, we first define a mergeable relation. Definition 2. A mergeable relation R is a reflexive, symmetrical and transitive relation on L × L .

268

J. Ma et al.

This definition for a mergeable relation is also very general. In real applications, specific criteria should be introduced to instantiate the relation. In following sections, clinical trials oriented criteria will be given to adapt R. With a mergeable relation R, if α1 , α2 ∈ L are two XML reports, then α1 and α2 are said mergeable iff we have R(α1 , α2 ). Before performing the merging of XML reports, the XML based merging framework should first select the mergeable XML reports. Definition 3. (Selection function) A selection function S is a mapping from a set of XML reports to its mergeable subset such that if A is a set of XML reports, then S(A) ⊆ A and ∀α1 , α2 ∈ S(A), we have R(α1 , α2 ). This definition for a selection function will be instantiated when the mergeable relation R is practically adapted. Once we have a set of mergeable XML reports, we need to combine them into a new XML document. Definition 4. A merging rule is a total function M associating a set of mergeable XML reports to an XML report such that if α1 , .., αn ∈ L and ∀1 ≤ i, j ≤ n, R(αi , α j ), then M(α1 , . . . , αn ) ∈ L . Generally, a set of mergeable XML reports α1 , .., αn in Def. 4 are always from the result of a selection function S. To summarize, an XML based merging framework is a pair (S, M) that applies to sets of XML reports where S is a selection function and M is a merging rule.

3.2 Representing Statistical Information in XML Frameworks In this section, we want to introduce some constraints on clinical trials. These constraints are focused on representing and managing statistical information in clinical trials reports. Definition 5. The set of key statistical tagnames for this paper are PatientsNum and SampleDist. The set of subsidiary statistical tagnames for this paper are Mean, SEM, MeanInv and SEMInv. The set of auxiliary statistical tagnames are Drug, Duration and Unit. Now we define the representation of the SampleDist element which contains the most important statistical information. Definition 6. The XML report SampleDist σ1 . . . σn /SampleDist is a valid SampleDist element iff one of the following conditions is satisfied. 1. n = 1 and σ1 is of the form Mean φ /Mean where φ is a textentry. 2. n = 2 and σ1 is of the form Mean φ1 /Mean , σ2 is of the form SEM φ2 /SEM where φ1 , φ2 are two textentries. 3. n = 2 and σ1 is of the form MeanInv ψ1 /MeanInv , σ2 is of the form SEMInv ψ2 /SEMInv where ψ1 , ψ2 are of the form Min φ1 /Min Max φ2 /Max such that φ1 , φ2 are two textentries.

An XML Based Framework for Merging Incomplete

269

All textentries φi in the above definition can only be numerical values. Example 2. The following is a valid SampleDist element. SampleDist value = “intraocular pressure reduction” Mean 4.1 /Mean SEM 3.8 /SEM /SampleDist Definition 7. An XML report Result σ1 . . . σn /Result is a valid Result element iff 1. σi s are different auxiliary statistical tagnames or key statistical tagnames. 2. All key statistical tagnames exist in σi s in which the SampleDist tag is valid. In this paper, the main task is to merge sampling distributions contained in multiple XML documents when they refer to the same issue. Therefore, we define the following constraint for merging two valid Result elements. Definition 8. Given two valid Result elements Result Result Drug ref = id1/ Drug ref = id2/ Duration x1 /Duration Duration x2 /Duration Unit y1 /Unit Unit y2 /Unit PatientsNum z1 /PatientsNum PatientsNum z2 /PatientsNum SampleDist ref = purpose1 SampleDist ref = purpose2 ... ... /SampleDist /SampleDist /Result /Result , they are said mergeable iff we have id1 = id2, x1 x2, y1 = y2 and purpose1 = purpose2. That is, two clinical trials results can be merged iff they refer to the same drug, have approximately the same duration, use the same unit of measurement and for the same clinical purpose. This definition is a clinical specific restriction before using the merging rule in Def. 4. This restriction can be carried out with the assistance of ontologies tailored for such an application as discussed in Section 2.4. In addition, this definition is a clinical trial oriented instantiation of Def. 2, hence we can use it to select mergeable XML reports. More specific merging rules for statistical information are introduced in the next section.

270

J. Ma et al.

4 Managing Statistical Information in XML Documents In this section, we first recall some basic concepts of statistical information and then discuss how to model such information in XML documents. We define an instantiated selection function (in terms of an algorithm) to exclude inconsistent information and provide two instantiated merging rules to deal with meta-analysis with incomplete information.

4.1 Preliminaries In statistics, a normal distribution associated with a random variable is denoted as X ∼ N(μ , σ 2 ). For the convenience of further calculations in the rest of the paper, we use notation X ∼ N(μ , σ ) instead of X ∼ N(μ , σ 2 ) for a normal distribution of variable X. In statistics, random samples of individuals are often used as the representatives of the entire group of individuals (often denoted as a population) to estimate the values of some parameters of the population. The mean of variable X of the samples, when the sample size is reasonably large, follows a normal distribution. This distribution is typically referred to as a sampling distribution. We use X ∼ N(μ , SEM) to denote a sampling distribution with mean value μ and standard error of mean SEM. 1 Conventionally, let Xi ∼ N(μi , SEMi ), 1 ≤ i ≤ k and ωi = SEM 2 , then the meta-analysis result X ∼ N(μ , SEM) is as follows.

μ=

k ∑ki=1 μi ∗ ωi , ω = ωi . ∑ ∑ki=1 ωi i=1

(1)

4.2 Obtaining Sampling Distributions from Clinical Trials In this subsection, we show how we get sampling distributions from clinical trials. Clinical trials results are obtained from three different categories. • Category I: A sampling distribution can be identified when both μ and SEM are given. • Category II: A sampling distribution can be identified when only μ is given. • Category III: A sampling distribution can be constructed when a confidence interval is given. After looking through a large collection of papers of clinical trials on IOP reductions and on comparing drugs for type-2 diabetes, we believe that the above three categories cover a significant proportion of statistical information (e.g., [4, 7, 15, 21, 30, 32, 34, 37, 40], etc). For each category of statistical information, we interpret it in terms of a sampling distribution and then put the distribution into the corresponding XML document. We use X to denote the sample mean implied in the context of each clinical report.

An XML Based Framework for Merging Incomplete

271

For the first category, a sampling distribution is explicitly give, for example, X ∼ N(9.3, 2.9) gives SampleDist value = “LDL − C” Mean 9.3 /Mean SEM 2.9 /SEM /SampleDist For the second category, a sampling distribution can be defined with a missing SEM, for instance, X ∼ N(5.9, SEM), so we have an XML segment as SampleDist value = “LDL − C” Mean 5.9 /Mean /SampleDist For the third category of information, a confidence interval [a, b] is given. It is then possible to convert this confidence interval into a sampling distribution as follows

μ=

a+b , 2

SEM =

b−a . 2k

As a convention, the presented analysis of clinical trials results usually use the 95% confidence interval. In this case, we have k = 1.96. However, if a given confidence interval is not the usual 95% confidence interval (say, it uses the p-confidence interval), it is possible to use the standardization of the normal distribution as P(Z ∈ [−k, k]) = p. Then value k can be found by looking up the standard normal distribution table. Therefore, in this case, we get an XML representation in the same way as for Category one. To summarize, from our investigation, we can get sampling distributions (some with missing SEMs) from all the three types of information.

4.3 Inconsistency Analysis among Trials In this subsection, we aim to investigate how to analyze potential inconsistencies among trials. Based on this analysis, we are able to remove some highly conflicting trials from given trials with full statistical information. We only want to identify and remove those trials which may have been conducted from a different population. In fact, clinicians believe that only this type of inconsistency should result in a trial(s) being excluded from a meta-analysis. Since if a trial(s) is from a different population, then it should not be considered together with other trials. We take the assumption that the same/similar population shall have the same/similar standard deviation. That is, for a given k trials with full statistical infor√ mation, we want to measure whether each σi = SEMi ni , (1 ≤ i ≤ k) is marginally √ equivalent. In another word, these σi = SEMi ni values shall all be in a reasonably tight interval and this is the principle of our method to identify an inconsistent trial(s). Note that both SEMi and ni can be extracted from XML documents (i.e., elements SEM and PatientsNum).

272

J. Ma et al.

Assume that we have a set of values σ1 , . . . , σk from k trials. Without loss of generality, we can also assume that the list is already sorted, i.e., σ1 ≤ . . . ≤ σk . First, we find the median md of the list, namely, if k is odd, then md is σ k+1 , else 2

σ k +σ k 2

2 +1

md is . We choose the median value instead of the mean of the list because 2 inconsistent σi (s) may affect the mean value too much while the median value will be more stable. For example, if a given list has values {19, 27, 40, 400}, then md is 33.5, but the mean is 121.5. Obviously, 33.5 is closer to most σi s than 121.5 is, hence 33.5 can be used to identify the inconsistent trial(s) while 121.5 can not. Second, we check each σi against md to see to what extent it diverges from md. For this purpose, we should set a threshold t and generate an interval MDT = [md/t, md ∗ t]. If σi ∈ MDT, then trial i is consistent with most of the other trials and should be kept, otherwise, it should be identified as an inconsistent trial. Note that the σi s always vary, so the interval MDT should not be too narrow, otherwise it will reject too many σi s even if some are acceptable. In the other way round, MDT should not be too broad, otherwise some highly conflicting σi s will be included. After looking through a large amount of trials results, at moment we think t = 4 is an applicable threshold. Formally, we define the algorithm as follows. Algorithm IncRemover Begin Input: k Trials with (SEM1 , n1 ), . . . , (SEMk , nk ) and t. √ For i = 1 to k, let σi = SEMi ni ; s s Sort σ1 , . . . , σk to σ1 , . . ., σk in ascending order; If k is an odd value, let md = σ sk+1 , else let md = σi md

σ sk +σ sk 2

2 +1

;

≤ md ∗ t then keep trial i, otherwise remove trial i. For i = 1 to k, if md/t ≤ Output: All remaining trials. End

This algorithm is in fact an instantiated selection function (Def. 3) for which R(triali , trialj ) iff σi , σ j ∈ MDT. Proposition 1. Given k trials, let σi be the standard deviation of trial i, and define a selection function S for which R(triali , trialj ) iff σi , σ j ∈ MDT, then the output of algorithm IncRemover is the same as the selected subset by S. The proof is straightforward and omitted.

4.4 The Prognostic Method and Interval Method In this subsection, we introduce the methods proposed in [28] to simulate metaanalysis when some trials results do not have complete information. It should be noted that these methods are also applicable to the between group difference of two drugs/therapies about two groups [28]. Below we present the two methods in terms of merging rules based on Def. 4 in the XML framework.

An XML Based Framework for Merging Incomplete

273

Assume there are k + l trials altogether where k trials are with full information, i.e., (μ1 , SEM1 , n1 ), . . . , (μk , SEMk , nk ) and l trials with partial information, i.e., (μk+1 , nk+1 ), . . . , (μk+l , nk+l ). The task of meta-analysis is to get the merging result of those k + l trials. The prognostic method [28] uses the following equation to predict the missing SEM j value for trial j (k < j ≤ k + l) with sample size n j , given that for k trials, each of which has the SEMi value and its sample size ni . √ ∑ki=1 SEMi ni SEM j = √ k nj

(2)

When all SEM j , k < j ≤ k + l, are calculated, it is able to use the standard metaanalysis method, i.e., Equation 1, to merge all the k + l trials. This prognostic method can be defined as an instantiation of XML merging rule as follows. Definition 9. Given the following k + l mergeable Result elements such that for 1 ≤ i ≤ k, the SampleDist element in the ith Result element has both Mean and SEM sub tags, and for k + l ≤ j ≤ k + l, the SampleDist element in the jth Result element has only the Mean sub tag, Result Result ... ... PatientsNum zi /PatientsNum PatientsNum zj /PatientsNum SampleDist ref = purpose SampleDist ref = purpose Mean μi /Mean Mean μj /Mean SEM SEMi /SEM /SampleDist /SampleDist /Result /Result for 1 ≤ i ≤ k for k < j ≤ k + l the meta-analysis result by the prognostic method is Result ... PatientsNum ∑k+l i=1 zi /PatientsNum SampleDist ref = purpose Mean μ /Mean SEM SEM /SEM /SampleDist /Result where μ , SEM are obtained from Equation 1 for which SEM j are obtained from Equation 2, k < j ≤ k + l.

274

J. Ma et al.

In contrast, instead of estimating a single value for each missing SEM as done in the prognostic method, the interval method [28] estimates a reliable interval for each missing SEM. Let 2 ∑ki=1 μi ωi + ∑k+l i=k+1 ni μi /σi 1 μk+l = k+l k ∑i=1 ωi + ∑i=k+1 ni /σi2 and

2 μk+l =

2 ∑ki=1 μi ωi + ∑k+l i=k+1 ni μi /σi

2 ∑ki=1 ωi + ∑k+l i=k+1 ni /σi

∀i, k + 1 ≤ i ≤ k + l, we let

σi = σmin , σi = σmax , if μi ≤ μ k , and

σi = σmax , σi = σmin , if μi > μ k ,

then the interval method gives the following result. Let Xi ∼ N(μi , SEMi ), 1 ≤ i ≤ k + l, denote the ith sampling distribution with sample size ni such that SEMi is assumed missing when i > k, then the merged result N(μ , SEM) applying the interval method to these k + l trials is 1 2 μ ∈ [μk+l , μk+l ], SEM 2 ∈ [

1 2 ∑ki=1 ωi + ∑k+l i=k+1 ni /σmin

1 2 ∑ki=1 ωi + ∑k+l i=k+1 ni /σmax

(3)

The interval method is represented as an instantiation of XML merging rule as follows. Definition 10. Given the following k + l mergeable Result elements such that for 1 ≤ i ≤ k, the SampleDist element in the ith Result element has both Mean and SEM sub tags, and for k + l ≤ j ≤ k + l, the SampleDist element in the jth Result element has only the Mean sub tag, Result Result ... ... PatientsNum zi /PatientsNum PatientsNum zj /PatientsNum SampleDist ref = purpose SampleDist ref = purpose Mean μj /Mean Mean μi /Mean SEM SEMi /SEM /SampleDist /SampleDist /Result /Result for 1 ≤ i ≤ k for k < j ≤ k + l the meta-analysis result by the interval method is Result ... PatientsNum ∑k+l i=1 zi /PatientsNum

An XML Based Framework for Merging Incomplete

275

SampleDist ref = purpose MeanInv μ /MeanInv SEMInv SEM /SEMInv /SampleDist /Result where μ and SEM are described by Equation 3. Recall that an XML merging framework is represented by a pair of a selection function and a merging rule. Until now, with Def. 8 and the Algorithm in the last subsection as two selection functions S1 and S2 , respectively, i.e., S1 is used to select mergeable trials and S2 is used to select consistent trials, and with Def. 9 and Def. 10 as two merging rules M1 and M2 , alternatively, we have created two instantiated XML merging frameworks (S2 ◦ S1, M1 ) and (S2 ◦ S1, M2 ) where S2 ◦ S1 is the compound function of S1 and S2 which means first S1 is used to select and then S2 is used to select from the result of S1 .

5 Case Studies 5.1 A Case Study of Diabetes Medications In this subsection, we use the data from oral diabetes medication for adults with Type-2 diabetes as our first case study. Many research papers and reports have been published to show the effectiveness of various oral medications for Type-2 diabetes (e.g., [27, 9, 41, 35, 29], etc). Clinicians and patients need a thorough comparison of these oral medications with respect to different aspects of Type-2 diabetes. Meta-analysis is the most frequently used technique for this purpose. It systematically reviews and compares each pair of drugs or therapies from different perspectives. For oral medication of Type-2 diabetes, meta-analysis [3] compares each pair of drugs on systolic blood pressure (SBP for short), diastolic blood pressure (DBP), low density lipoprotein cholesterol (LDL-C) and high density lipoprotein cholesterol (HDL-C), etc. In this section, we create the XML documents for clinical trials reports and then merge the information contained in such XML documents. Here the meta-analysis is on the between group differences on the effectiveness of pairs of drugs for LDL-C. Example 3. (Triazolidinedione versus second generation Sulfonylureas) Low density lipoprotein effect is studied by many papers. They compare LDL-C between different trial groups. For example, to compare Thiazolidinedione and second generation Sulfonylureas, we get five clinical trials reports, i.e., [27, 9, 41, 35, 29]. Due to the limitation of space, we only provide a simplified XML document for [41]. Trial Source URL http://www3.interscience.wiley.com/cgi-bin/fulltext/ 118782480/PDFSTART /URL

276

J. Ma et al.

Title Sustained effects of pioglitazone vs. glibenclamide on insulin sensitivity, glycaemic control, and lipid profiles in patients with Type 2 diabetes /Title Author M. H. Tan, D. Johns, J. Strand, et al /Author /Source Objective Drug id = “P1 Name Pioglitazone /Name DrugCategory id = “Tria Triazolidinedione /DrugCategory /Drug Drug id = “G1 Name Glibenclamide /Name DrugCategory id = “sgs second generation Sulfonylureas /DrugCategory /Drug Aim To compare effects of different oral hypoglycemic drugs as first-line therapy on lipoprotein subfractions in type 2 diabetes /Aim PatientsType type 2 Diabetes /PatientsType /Objective MainOutcome ... Result Drug ref = “P1vsG1 / Duration 52 weeks /Duration Unit mg/dl /Unit PatientsNum 100 /PatientsNum SampleDistvalue = “P1vsG1 Mean 6.63 /Mean /SampleDist /Result ... /MainOutcome Conclusion CompareEfficacy Drug ref = “P1 / Degree more sustained /Degree Drug ref = “G1 Duration 52 weeks /Duration /CompareEfficacy /Conclusion /Trial

An XML Based Framework for Merging Incomplete

277

Note that in [41], we have mean change of drug P1 as 0.14 and of G1 as -0.03 in the unit of mmol/L. After using ontologies to relate mmol/L and mg/dl, we changed the unit of measurement to mg/dl, and obtained the between group difference of P1 and G1 as 6.63 = (0.14 + 0.03) ∗ 39 with the unit of mg/dl. The sampling distributions (in mg/dL) from these five trials are as follows. [27]: XLT ∼ N(10.5, 14.44) with n = 20. [9]: XCM ∼ N(11.31, 1.59) with n = 620. [41]: XT J ∼ N(6.63, SEMTJ ) with n = 100. [35]: XPM ∼ N(5, 6.04) with n = 86. [29]: XMC ∼ N(14.6, SEMMC ) with n = 315. Here n is the size of samples (number of patients) in each group of a trial, and SEMT J and SEMMC stand for the missing values (SEM value) from their respective trials data. There are two missing SEM values. When applying the prognostic method, we get the difference between groups in LDL-C as XP ∼ N[11.35, 1.32]. Alternatively, if we use the interval method, we get XI ∼ N([10.91, 11.88], [1.20, 1.38]). In [3], meta-analysis with known SEMT J and SEMMC gives XBW ∼ N(10.4, 1.61) from these five trials. XP is reasonably close to XBW . The XML output by the prognostic method (Definition 9) is as follows. ... Result ... PatientsNum 1141 /PatientsNum SampleDist ref = “Triazolidinedione vs second generation Sulfonylureas Mean 11.35 /Mean SEM 1.31 /SEM /SampleDist /Result ... The XML output by the interval method (Definition 10) is as follows. ... Result ... PatientsNum 1141 /PatientsNum SampleDist ref = “Triazolidinedione vs second generation Sulfonylureas MeanInv Min 10.91 /Min Max 11.88 /Max

278

J. Ma et al.

/MeanInv SEMInv Min 1.20 /Min Max 1.38 /Max /SEMInv /SampleDist /Result ...

5.2 A Case Study on Neurocognitive Outcomes In this subsection, we use data of neurocognitive outcomes after off-pump versus on-pump coronary revascularisation as our second case study. Off-pump (beating heart) coronary artery bypass grafting (CABG) is very popular as it is considered having numerous theoretical benefits including lower incidence of stroke and neurocognitive dysfunction. Therefore, considerable attentions have been devoted to this area (e.g., [14, 25, 26, 43], etc). We focus on a metaanalysis paper [31] on this topic which undertook quantitative systematic reviews to assess whether there were significant differences in neurocognitive outcomes in patients after undergoing off-pump versus on-pump CABG. As [31] provided a set of trials with full statistical information, i.e., both the mean and the SEM values, in order to apply methods mentioned in last section, we deleted an SEM value from a trial selected randomly from a set of trials, and applied the prognostic and interval methods to predict the missing value. We then applied the meta-analysis method to merge the trial with the predicted SEM value together with the rest of trials in the group to see how close this new result is to the original meta-analysis result. Furthermore, as the traditional method for trials with incomplete information always abandons trials with incomplete information, we also compared our methods with this traditional method. In the following example, we create the XML documents for clinical trials reports and then merge the information contained in such XML documents. Example 4. (neurocognitive outcomes after off-pump versus on-pump coronary revascularisation) Neurocognitive outcomes after off-pump or on-pump coronary revascularisation is studied by many papers. Here, we get four clinical trials reports, i.e., [14, 25, 26, 43], to survey whether there were significant differences in neurocognitive outcomes in patients after undergoing off-pump versus on-pump CABG. Due to the limitation of space, we only provide a simplified XML document for [14]. Trial Source URL http://ats.ctsnetjournals.org/cgi/content/abstract/81/6/2105 /URL Title Neurocognitive Outcomes in Off-Pump Versus On-Pump Bypass

An XML Based Framework for Merging Incomplete

279

Surgery: A Randomized Controlled Trial /Title Author Ernest C S, Worcester M U, Tatoulis J, Elliott P C, Murphy B M, Higgins R O, LeGrande M R, Goble A J /Author /Source ... MainOutcome ... Result Drug ref = “off − pump vs on− pump coronary revascularisation / PatientsNum 47 /PatientsNum Duration 47 /Duration SampleDistvalue = “off − pump vs on − pump coronary revascularisation Mean -0.34 /Mean SEM 2.49 /SEM /SampleDist /Result ... /MainOutcome ... /Trial The sampling distributions (no unit) from these four trials are as follows. [14]: XEW ∼ N(−0.34, 2.49) with n = 47. [25]: XLL ∼ N(1.00, 3.71) with n = 27. [26]: XLS ∼ N(4.40, 1.82) with n = 54. [43]: XV J ∼ N(−4.00, 1.26) with n = 130. Here we just delete the SEM value of XEW (others are similar), when applying the prognostic method, we get the merged sampling distribution as XP ∼ N[−0.99, 0.91]. Alternatively, if we use the interval method, we get XI ∼ N([−1.07, −0.92], [0.89, 0.94]). The traditional meta-analysis with full statistical data, i.e., SEMEW = 2.49 is known, gives X f ull ∼ N(−1.01, 0.93) from these four trials, and traditional metaanalysis for trials with incomplete information, i.e., abandoning XEW , gives Xtrad ∼ N(−1.11, 1.00). Obviously, we have that XP is closer to X f ull than Xtrad . The XML output by the prognostic method (Definition 9) is as follows. ... Result

280

J. Ma et al.

... PatientsNum 258 /PatientsNum SampleDist ref = “off − pump vs on − pump coronary revascularisation Mean − 0.99 /Mean SEM 0.91 /SEM /SampleDist /Result ... The XML output by the interval method (Definition 10) is as follows. ... Result ... PatientsNum 258 /PatientsNum SampleDist ref = “off − pump v on − pump coronary revascularisation MeanInv Min − 1.07 /Min Max − 0.92 /Max /MeanInv SEMInv Min 0.89 /Min Max 0.94 /Max /SEMInv /SampleDist /Result ...

6 Conclusion In this paper, we proposed an XML based framework to represent clinical trials information and then to merge them which make this framework an automatic tool for meta-analyses. The main task is to represent and merge the statistical information in XML documents. Moreover, we used two case studies, the Type 2 diabetes case and neurocognitive outcomes after off-pump versus on-pump coronary revascularisation, to verify our framework. Dealing with missing data in statistics, especially in meta-analysis is a very important topic (e.g., [24], [6], [44]). However, there are hardly any papers focusing on missing standard errors. [28] proposed some important results about how to deal with this situation. This paper used the methods in [28] to create a formal XML merging framework. There are a number of issues we will further look at. First, improvements can be made on the XML document structure to cover a wider range of clinical trials reports. Second, although we had a brief discussion in Section 2.4 about dealing with semantic heterogeneity, the role of ontologies, indexing schemes, and restricted

An XML Based Framework for Merging Incomplete

281

vocabularies, etc, for both the definitions of the XML tags and for the text entries, should be further studied. The creation of an application oriented ontology should facilitate the automated merging. Third, we will examine information extraction tools to see how information from clinical trials reports can be efficiently extracted, in order to generate XML documents automatically.

Appendix In this appendix, we will provide a full structure of our XML documents. To accommodate our special need of recording clinical trials information, we investigated many clinical trials reports and therefore adapted the DTD of XML documents as follows. First, information in each clinical report will be put in a Trial element. We define the Trial element and its children as in Fig. 2. Here the ? sign shows that the

Fig. 2 DTD adaption of the Title element

SideEffect child-element is optional. The Source element is used to provide some general information for a clinical trial report. We define it as in Fig. 3.

Fig. 3 DTD adaption of the Source element

Example 5. The following is a Source element. Source URL http://www.neurology.org/cgi/content/abstract/65/9/1415 /URL Title Prevalence and size of directly detected patent foramen ovale in migraine with aura /Title Author Schwerzmann M, Nedeltchev K, Lagger F, Mattle HP, Windecker S, Meier B, et al /Author /Source The Objective element tells the objective of a clinical trial. We define it as in Fig. 4. Here the DrugCategory element is for a category of drugs, e.g., Glibenclamide is a kind of second generation Sulfonylureas which is a category of drugs. In addition,

282

J. Ma et al.

Fig. 4 DTD adaption of the Objective element

so far the PatientsType element is a leaf element. For further study, it may need to be changed to a composite element containing some sub elements like AverageAge, Nationality, etc. Example 6. The following is an Objective element. Objective Drug id = “l1” Name latanoprost /Name /Drug Aim To test the drug efficacy for intraocular pressure reduction /Aim PatientsType Black American with Glaucoma /PatientsType /Objective The MainOutcome element is defined as in Fig. 5.

Fig. 5 DTD adaption of the MainOutcome element

Here the Result element is defined in Section 2.2. The reason why we put the Duration element as a child of the Result element instead of a child of the MainOutcome element is that a clinical trial can have more than one duration period. For example, a 12-month trial may provide results at 3th month, 6th month, 9th month and 12th month. Example 7. The following is a MainOutcome element. MainOutcome Result Drug ref = “l1”/ Duration 3 month /Duration Unit mmol/L /Unit PatientsNum 247 /PatientsNum pValue 0.016 /pValue SampleDist value = “intraocular pressure reduction” Mean 4.1 /Mean

An XML Based Framework for Merging Incomplete

283

SEM 3.8 /SEM /SampleDist /Result /MainOutcome The SideEffect element contained by the Trial element is defined as in Fig. 6.

Fig. 6 DTD adaption of the SideEffect element

The adverse event may be for a single drug or for comparing two drugs and the descriptions for side effects include (but not limited to) • may (cause) • ∗ more conjunctival (than) • ∗ increased incident (than) • ∗ higher percentage (than) • well tolerated • not severe Here Words with ∗ are used for comparison between two drugs. Example 8. The following is a SideEffect element. SideEffect Report AdverseEvent cough /AdverseEvent Compare Drug ref = “p1”/ Degree increased incident /Degree Drug ref = “11”/ /Compare /Report /SideEffect

284

J. Ma et al.

Fig. 7 DTD adaption of the Conclusion element

The Conclusion element is defined as in Fig. 7. The descriptions for efficacies include (but not limited to) • ∗ significantly (better than) • ∗ no significantly difference (with) • ∗ greater hypotensive (efficacy) (than) • ∗ more effective (than) • ∗ not more effective (than) • ∗ equivalent (to) • ∗ significantly greater (than) • significantly • ∗ superior (than) Similarly, words with * are used for comparison between two drugs. Example 9. The following is a Conclusion element. Conclusion Efficacy Drug ref = “p1”/ Degree siginificantly /Degree Duration 52 weeks /Duration /Efficacy /Conclusion Finally, we provide an example of a clinical trial report and its full XML document as follows.

An XML Based Framework for Merging Incomplete

285

Example 10. A clinical trial report entitiled “The effects of prostaglandin analogues on the blood aqueous barrier and corneal thickness of phakic patients with primary open-angle glaucoma and ocular hypertension” can be found at the following link http://www.ncbi.nlm.nih.gov/pubmed/16936646?dopt=Abstract. The authors are Arcieri ES, Pierre Filho PT, Wakamatsu TH, Costa VP. The main summary is in its abstract as: PURPOSE: To evaluate the effects of topical latanoprost, travoprost, and bimatoprost on the blood-aqueous barrier and central corneal thickness (CCT) of patients with primary open-angle glaucoma (POAG) and ocular hypertension (OHT). DESIGN: Prospective, randomized, masked-observer, crossover clinical trial. METHODS: A total of 34 phakic patients with POAG or OHT with no previous history of intraocular surgery or uveitis completed the study. Patients were randomized to use latanoprost 0.005%, travoprost 0.004%, or bimatoprost 0.03% once daily (2000 hours) for 1 month, followed by a washout period of 4 weeks between each drug. Aqueous flare was measured with a laser flare metre. CCT was calculated as the average of five measurements using ultrasound pachymetry. All measurements were performed by a masked observer (1000 h). RESULTS: There were no statistically significant differences between baseline mean IOP, mean CCT, and mean flare values among the groups. There was no statistically significant increase in mean flare values from baseline in all groups (P > 0.05). There were no statistically significant differences between mean flare values among the groups (P > 0.05). All medications significantly reduced the mean IOP from baseline (P < 0.0001). IOP reduction obtained with travoprost (7.3+/-3.8 mmHg) was significantly higher than that obtained with latanoprost (4.7+/-4.2 mmHg) (P=0.01). A statistically significant reduction in mean CCT (0.6+/-1.3%) from baseline was observed when patients instilled bimatoprost (P=0.01). CONCLUSIONS: Latanoprost, travoprost, and bimatoprost had no statistically significant effect on the blood-aqueous barrier of phakic patients with POAG or OHT. Bimatoprost may be associated with a clinically irrelevant reduction in mean CCT. The corresponding full XML document is extracted as follows. Trial Source URL http://www.ncbi.nlm.nih.gov/pubmed/16936646?dopt=Abstract /URL Title The effects of prostaglandin analogues on the blood aqueous barrier and corneal thickness of phakic patients with primary open-angle glaucoma and ocular hypertension /Title Author Arcieri ES, Pierre Filho PT, Wakamatsu TH, Costa VP /Author /Source Objective Drug id = “drug − a” Name latanoprost 0.005% /Name /Drug Drug id = “drug − b”

286

J. Ma et al.

Name travoprost 0.004% /Name /Drug Drug id = “drug − c” Name bimatoprost 0.03% /Name /Drug Aim Blood-adqueous barrier and central corneal thickness /Aim PatientsType primary open-angle glaucoma (POAG) and ocular hypertension (OHT) /PatientsType /Objective MainOutcome Result Drug ref = ”drug − a”/ Duration 1 month /Duration Unit mmHg /Unit PatientsNum 34 /PatientsNum pValue 0.01 /pValue SampleDist value = ”IOP Reduction” Mean 4.7 /Mean SEM 4.2 /SEM /SampleDist /Result Result Drug ref = ”drug − b”/ Duration 1 month /Duration Unit mmHg /Unit PatientsNum 34 /PatientsNum pValue 0.0001 /pValue SampleDist value = ”IOP Reduction” Mean 7.3 /Mean SEM 3.8 /SEM /SampleDist /Result /MainOutcome SideEffect Report AdverseEvent irrelevant reduction in mean CCT /AdverseEvent Drug ref = “drug − c”/ Degree may /Degree /Report /SideEffect Conclusion Efficacy Drug ref = “drug − a”/

An XML Based Framework for Merging Incomplete

287

Degree significantly /Degree Duration 1 month /Duration FromBaseline yes /Duration /Efficacy Efficacy Drug ref = “drug − b”/ Degree significantly /Degree Duration 1 month /Duration FromBaseline yes /Duration /Efficacy Efficacy Drug ref = “drug − c”/ Degree significantly /Degree Duration 1 month /Duration FromBaseline yes /Duration /Efficacy CompareEfficacy Drug ref = “drug − a”/ Degree no significant difference /Degree Drug ref = “drug − b”/ Duration 1 month /Duration /CompareEfficacy /Conclusion /Trial

References 1. Abiteboul, S., Segoufin, L., Vianu, V.: Representing and querying XML with incomplete information. ACM Trans. Database Syst. 31(1), 208–254 (2006) 2. Barbara, D., Garcia-Molina, H., Porter, D.: The management of probabilistic data. IEEE Trans. on Knowledge and Data Engineering 4(5), 487–502 (1992) 3. Bolen, S., Wilson, L., Vassy, J., Feldman, L., Yeh, J., Marinopoulos, S., Wilson, R., Cheng, D., Wiley, C., Selvin, E., Malaka, D., Akpala, C., Brancati, F., Bass, E.: Comparative effectiveness and safety of oral diabetes medications for adults with type 2 diabetes. Comparative effectiveness review (8) (2007) 4. Chiselita, D., Antohi, I., Medvichi, R., Danielescu, C.: Comparative analysis of the efficacy and safety of latanoprost, travoprost and the fixed combination timololdorzolamide; a prospective, randomized, masked, cross-over design study. Oftalmologia 49(3), 39–45 (2005) 5. Crangle, C.E., Cherry, J.M., Hong, E.L., Zbyslaw, A.: Mining experimental evidence of molecular function claims from the literature. Bioinformatics 23, 3232–3240 (2007) 6. Copas, J.B., Eguchi, S.: Local model uncertainty and incomplete-data bias. J. R. Statist. Soc. B 67(4), 459–513 (2005)

288

J. Ma et al.

7. Cantor, L.B., Hoop, J., Morgan, L., Wudunn, D., Catoira, Y.: Bimatoprost-Travoprost Study Group, Intraocular pressure-lowering efficacy of bimatoprost 0.03% and travoprost 0.004$ in patients with glaucoma or ocular hypertension. Br. J. Ophthalmol. 90(11), 1370–1373 (2006) 8. Cowie, J., Lehnert, W.: Information extraction. Communications of ACM 39, 81–91 (1996) 9. Charbonnel, B.H., Matthews, D.R., Schernthaner, G., Hanefeld, M., Brunetti, P.: for the QUARTET Study Group. A long-term comparison of pioglitazone and gliclazide in patients with Type 2 diabetes mellitus: a randomized, double-blind, parallel-group comparison trial. Diabetic Medicine 22, 399–405 (2004) 10. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, ACL 2002 (2002) 11. Combi, C., Oliboni, B., Rossato, R.: Merging multimedia presentations and semistructured temporal data: a graph-based model and its application to clinical information. Artificial Intelligence in Medicine (2005) 12. Cavallo, R., Pittarelli, M.: The theory of probabilistic databases. In: Proc. of VLBD 1987, pp. 71–81 (1987) 13. Clegg, A., Shepherd, A.: Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinformatics 8, 24 (2007) 14. Ernest, C.S., Worcester, M.U., Tatoulis, J., Elliott, P.C., Murphy, B.M., Higgins, R.O., LeGrande, M.R., Goble, A.J.: Neurocognitive outcomes in off-pump versus onpump bypass surgery: a randomized controlled trial. Ann. Thorac. Surg. 81(6), 2105–2114 (2006) 15. Gracia-Feijo, J., Martinez-de-la-Casa, J.M., Castillo, A., Mendez, C., Fernandez-Vidal, A., Garcia-Sanchez, J.: Circadian IOP-lowering efficacy of travoprost 0.004$ ophthalmic solution compared to latanoprost 0.005%. Curr. Med. Res. Opin. 22(9), 1689–1697 (2006) 16. Greenhalgh, T.: How to Read a Paper: The Basics of Evidence-Based Medicine. BMJ Press (1997) 17. Hunter, A., Liu, W.: Fusion rules for merging uncertain information. Information Fusion 7, 97–114 (2006) 18. Hunter, A., Liu, W.: Merging uncertain information with semantic heterogeneity in XML. Knowledge and Information Systems 9(2), 230–258 (2006) 19. Hunter, A., Liu, W.: A logical reasoning framework for modelling and merging uncertain semi-structured information. In: Bouchon-Meunier, B., Coletti, G., Yager, R.R. (eds.) Modern Information Processing: From Theory to Applications, pp. 345–356. Elsevier, Amsterdam (2006) 20. Hunter, L., Lu, Z., Firby, J., Baumgartner Jr., W.A., Johnson, H.L., Ogren, P.V., Cohen, K.B.: An open-source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-specific gene expression. BMC Bioinformatics 31 9(1), 78 (2008) 21. Howard, S., Silvia, O.N., Brian, E., John, S., Sushanta, M., Theresa, A., Michael, V.: The Safety and Efficacy of Travoprost 0.004%/Timolol 0.5% Fixed Combination Ophthalmic Solution. Ame. J. Ophthalmology 140(1), 1–8 (2005) 22. Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Critical assessment of information extraction for biology. BMC Bioinformatics 6(suppl. 1), S11 (2005) 23. van Keulen, M., de Keijzer, A., Alink, W.: A probabilistic XML approach to data integration. In: Proceedings of ICDE 2005, pp. 459–470 (2005)

An XML Based Framework for Merging Incomplete

289

24. Lu, G., Copas, J.B.: Missing at Random, Likelihood Ignorability and Model Completeness. The Annals of Statistics 32(2), 754–765 (2004) 25. Lee, J.D., Lee, S.J., Tsushima, W.T., Yamauchi, H., Lau, W.T., Popper, J., Stein, A., Johnson, D., Lee, D., Petrovitch, H., Dang, C.R.: Benefits of off-pump bypass on neurologic and clinical morbidity: a prospective randomized trial. Ann. Thorac. Surg. 76(1), 18–25 (2003) 26. Lund, C., Sundet, K., Tennoe, B., Hol, P.K., Rein, K.A., Fosse, E., Russell, D.: Cerebralischemic injury and cognitive impairment after off-pump and on-pump coronary artery bypass grafting surgery. Ann. Thorac. Surg. 80, 2126–2131 (2005) 27. Lawrence, J., Reid, J., Taylor, G., Stirling, C., Reckless, J.: Favorable Effects of Pioglitazone and Metformin Compared With Gliclazide on Lipoprotein Subfractions in Overweight Patients With Early Type 2 Diabetes. Diabetes care 27(1), 41–46 (2004) 28. Ma, J., Liu, W., Hunter, A., Zhang, W.: Performing meta-analysis with incomplete statistical information in clinical trials. BMC Informatics 8(1), 56 (2008) 29. Matthews, D.R., Charbonnel, B.H., Hanefeld, M., Brunetti, P., Schernthaner, G.: Longterm therapy with addition of pioglitazone to metformin compared with the addition of gliclazide to metformin in patients with type 2 diabetes: a randomized, comparative study. Diabetes Metab. Res. Rev. 21, 167–174 (2005) 30. Michael, T., David, W., Alan, L.: Projected impact of travoprost versus timolol and latanoprost on visual field deficit progression and costs among black glaucoma subjects. Trans. Am. Ophthalmol. Soc. 100, 109–118 (2002) 31. Marasco, S.F., Sharwood, L.N., Abramson, M.J.: No improvement in neurocognitive outcomes after off-pump versus on-pump coronary revascularisation: a meta-analysis. European Journal of Cardio-thoracic Surgery 33, 961–970 (2008) 32. Noecker, R.J., Earl, M.L., Mundorf, T.K., Silvestein, S.M., Phillips, M.: Comparing bimatoprost and travoprost in black Americans. Curr. Med. Res. Opin. 22(11), 2175–2180 (2006) 33. Nierman, A., Jagadish, H.: ProTDB: Probabilistic data in XML. In: Proc. of VLDB 2002. LNCS, vol. 2590, pp. 646–657. Springer, Heidelberg (2002) 34. Nicola, C., Michele, V., Tiziana, T., Francesco, C., Carlo, S.: Effects of Travoprost Eye Drops on Intraocular Pressure and Pulsatile Ocular Blood Flow: A 180-Day, Randomized, Double-Masked Comparison with Latanoprost Eye Drops in Patients with OpenAngle Glaucoma. Curr. Ther. Res. 64(7), 389–400 (2003) 35. Pfüzner, A., Marx, N., Lüben, G., Langenfeld, M., Walcher, D., Konrad, T., Forst, T.: Improvement of Cardiovascular Risk Markers by Pioglitazone Is Independent From Glycemic Control Results From the Pioneer Study. Journal of the American College of Cardiology 45(12), 1925–1931 (2005) 36. http://protege.stanford.edu/ 37. Parmarksiz, S., Yuksel, N., Karabas, V.L., Ozkan, B., Demirci, G., Caglar, Y.: A comparison of travoprost, latanoprost and the fixed combination of dorzolamide and timolol in patients with pseudoexfoliation glaucoma. Eur. J. Ophthalmol. 16(1), 73–80 (2006) 38. Qi, G., Hunter, A.: Measuring incoherence in description logic-based ontologies. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ISWC 2007. LNCS, vol. 4825, pp. 381–394. Springer, Heidelberg (2007) 39. Radev, D., Fan, W., Qi, H., Wu, H., Grewal, A.: Probabilistic question answering on the Web. In: Proc. of WWW 2002, pp. 408–419 (2002) 40. Stefan, C., Nenciu, A., Malcea, C., Tebeanu, E.: Axial length of the ocular globe and hypotensive effect in glaucoma therapy with prostaglandin analogs. Oftalmologia 49(4), 47–50 (2005)

290

J. Ma et al.

41. Tan, M.H., Johns, D., Strand, J., Halse, J., Madsbad, S., Eriksson, J.W., Clausen, J., Konkoy, C.S., Herz, M., For the GLAC Study Group.: Sustained effects of pioglitazone vs. glibenclamide on insulin sensitivity, glycaemic control, and lipid profiles in patients with Type 2 diabetes. Diabetic Medicine 21, 859–866 (2004) 42. Wang, Y., Liu, W., Bell, D.A.: Combining uncertain outputs from multiple ontology matchers. In: Prade, H., Subrahmanian, V.S. (eds.) SUM 2007. LNCS (LNAI), vol. 4772, pp. 201–214. Springer, Heidelberg (2007) 43. van Dijk, D., Jansen, E.W.L., Hijman, R., Nierich, A.P., Diephuis, J.C., Moons, K.G.M., Lahpor, J.R., Borst, C., Keizer, A.M.A., Grobbee, D.E., de Jaegere, P.P., Kalkman, C.J.: Cognitive outcome after off-pump and on-pump coronary artery bypass graft surgery: a randomized trial. JAMA 287, 1405–1412 (2002) 44. White, I.: Missing data and departures from randomised treatment in pragmatic trials, http://www.mrc-bsu.cam.ac.uk/BSUsite/Research/ Section11.shtml 45. Zupan, B., Demsar, J., Katten, M., Ohori, M., Graefen, M., Bojanec, M., Beck, R.: Orange and decisions-at-hand: bridging predictive data mining and decision support. In: Proc. of ECML/PKDD 2001 workshop on Integrating Aspects of Data Mining Decision Support and Meta-Learning, September 2001, pp. 151–162 (2001) 46. http://en.wikipedia.org/wiki/Sampling_distribution

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML Raquel D. Rodrigues, Adriano J. de O. Cruz, and Rafael T. Cavalcanti

Abstract. This chapter presents and discusses the main characteristics of the new Fuzzy Database Aliança (Alliance)1 . This name represents the fact that the system is the union of fuzzy logic techniques, a database relational management system and a fuzzy meta-knowledge base defined in XML. Aliança accepts a wide range of data types including all information already treated by traditional databases, as well as incorporating different forms of representing fuzzy data. Despite this fact, the system is simple due to the fact that it uses XML to represent meta-knowledge. An additional advantage of using XML is that makes it easy to maintain and understand the structure of imprecise information. Aliança was designed to allow easy upgrading of traditional database systems. The Fuzzy Database Architecture Aliança approximates the interaction with databases to the usual way in which humans reason.

1 Introduction Human beings are immersed in a sea of information. Our senses are continuously absorbing and processing external data. The most part of this information is intrinsically vague or imprecise. We are able to process such imprecise data and based on them take actions that guide our daily activities and interactions with other Raquel D. Rodrigues Universidade Federal do Rio de Janeiro, IM and NCE, CCMN, Cx Postal: 2324, CEP: 20010-974 Cidade Universit´aria, Rio de Janeiro, Brazil e-mail: raquel.defelippo@gmail.com Adriano J. de O. Cruz Universidade Federal do Rio de Janeiro, IM and NCE e-mail: adriano@nce.ufrj.br Rafael T. Cavalcanti Universidade Federal do Rio de Janeiro, IM and NCE e-mail: rafaelcavalcantiufrj@gmail.com 1

A shorter version of this chapter appeared in Fuzzy Sets and Systems 160 (2009) 269-279.

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 291–313. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

292

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

human beings. Most pieces of information that the brain receives through our senses, such as images, sounds and tastes are of imprecise nature, and despite that fact we reason, plan, solve complex problems, think abstractly, communicate, learn from experiences and so on. In short, we hold what can be understood as an important aspect of intelligence, that is evidently more than a large capacity of memorization or solving precisely complex mathematical operations. Everything in the Universe is in constant flow, and the borders between one state and the other are fluid and vary continuously. However, science, and particularly computing science, is based on a logic system that deals only with two states: true or false [12]. There is no place for degrees of truth and imprecisions. Computers process strings of ones and zeroes and are entirely based on a logic system in which every statement is true or false. At the beginning of the twentieth century vagueness and imprecision slowly started to emerge as important points to be considered in a wide range of scientific problems. The Heisenberg uncertainty principle that states that certain pairs of physical properties, such as position and momentum, cannot be known to arbitrary precision was a shock about the limits of physical knowledge. Logician Bertrand Russel identified vagueness at the level of symbolic Logic. The concept of fuzziness comes from the multivalued logic studied at the beginning of the twentieth century [13]. In the 1920s the Polish mathematician Jan Lukasiewicz developed fundamental concepts of a multivalued logic. Finally, in 1965, Lotfi Zadeh, of the University of California at Berkeley, published the founding article “Fuzzy Sets”, where for the first time the word “fuzzy” was used in place of “vague”. Fuzzy logic was successfully applied to control and decision making systems, and many examples are available in the literature. Industries worldwide are embedding fuzzy logic in all sorts of products and services. For example, fuzzy logic has been used in the control of cement manufacture, water purification processes and management information systems. One of its most famous applications, is the fuzzy control of the subway system in the Japanese city of Sendai, which opened in 1987 and was developed by Hitachi [21]. Consumer goods include television sets that adjust volume and contrast depending on noise level and lighting conditions; fuzzy washing machines that select the optimal washing cycle on the basis of quantity and type of dirt and load size. Photo and video cameras use fuzzy logic to map image data to lens settings. Most car manufactures use fuzzy logic in some of their components like anti-skid braking systems and fuel injection. In Japan, the term ”fuzzy” was presented as a synonymous to “efficient operation requiring minimal human intervention.” Since fuzzy logic expands the domain of information that computers are able to process, it is only natural that researchers seek ways to incorporate it to management systems and databases. The original database systems were designed for efficient treatment of large quantities of precisely defined data. These databases try to model the data from real world data using precise structures, but unfortunately we are surrounded by uncertainty and imprecise information. Human beings manipulate imprecise information very well and, in fact, we act based on these pieces of information. We decide what to buy based on information such as “The interest

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

293

rate is too high” or use the car brakes because “The car is going too fast”. These facts may be true (or false) only partially and computers are not prepared to process them. Fuzzy logic extends the computer processing domain providing it with tools to operate based on facts that are not right or wrong, but lay in the gray area. Some of the most important and common operations in databases are queries for information. For example, one might ask, “Which male suspects are between 55 and 65 years old?”. In many occasions it would be more important for the solution of a problem, or more reasonable given the available information, if we could ask the same question as a human would do: “Which male suspects are about 60 years old?”. The idea of “about” hints that if the suspect is 54 years old then he must be included in the list of persons to be considered. The label “about” in a traditional database would not even be considered because only precise numerical data is stored on the attribute age. The conclusion is that vital information was lost due to the lack of tools to treat fuzzy information. Imprecise information is important in many contexts and provides solutions that would not be obtained if only precise data were considered. Considering the importance of treating this kind of information in an efficient way, fuzzy databases were developed. This new type of database can store, handle and respond to queries about vague and imprecise information in a very flexible way. However, there are not many examples of large fuzzy databases in production. Some of the reasons for that can be traced back to the costs of replacing or modifying costly legacy systems and the lack of efficient ways to incorporate fuzzy knowledge into traditional relational databases. This chapter presents a fuzzy database architecture called Aliança (Alliance). This name represents the fact that the system is the union of fuzzy logic techniques, a database relational management system and a fuzzy meta-knowledge base (FMB) defined in XML, brought together in order to handle and represent vague information [17]. Aliança was designed to easily incorporate fuzzy knowledge and also allow easy upgrading of traditional database systems. Aliança architecture can be used by old and new database systems expanding their applications and the domain of the processed information.

2 Architecture Overview In this section, we present the architecture of Alianc¸a and discuss its basic elements and their relationships. The aim of this proposal is to provide a system that stores and handles imprecise information efficiently, and that can provide a simple path to upgrade traditional databases. The system uses a traditional database system to store information and a modified SQL language. The main goal of Alianc¸a is to provide an efficient way to create and modify traditional databases so that they can incorporate fuzzy information. Previous proposals, like the GEFRED model, for example, created by [14] and improved by [7], which was a fusion of all previous proposals to represent fuzzy information,

294

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

presented problems with the way such information was described and incorporated into the database. Fuzzy databases need to incorporate extra information in order to process fuzzy information. For instance, when storing the information that someone is “young” the semantics of this label must be described so that it can be compared with other fuzzy attributes and also with usual numeric information relative to age. In Aliança we call this aggregate of information used by the database system to handle fuzzy data Fuzzy Meta-Knowledge Base (FMB). Previous proposals kept the FMB as an extension of the system catalogue. Therefore, the FMB is organized as sets of tables or relations in the same way as precise information is. These tables are of a complicated nature and their creation and maintenance requires a considerable effort from the database administrators and users. When the user issues a query that needs to retrieve fuzzy information from a database, the Database Management System needs to access these extra tables, possibly reducing the performance. As it will be described in section 4.1, in Aliança these extra tables are not necessary because the information is stored in XML files that are easier to maintain and support. Aliança does not require the addition of extra tables but the addition of extra columns in the original tables. Figure 1 shows the general architecture of the Fuzzy Database architecture Aliança. The main modules of the system are: • RDBMS (Relational Database Management System): This is a traditional relational database manager. Therefore, all fuzzy operations or the ones that involve fuzzy data must be translated by the FSQL Server module into classical SQL operations before being sent to the RDBMS. Differently from previous proposals, for example, FIRST (in GEFRED) [7], in Aliança the RDBMS does not have any direct relationship with the fuzzy meta-knowledge base, unaware of its existence. From the RDBMS point of view the only changes to the database were the

Fig. 1 General Fuzzy Database Architecture Alianc¸a

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

•

295

extra columns that were added to the tables that store fuzzy data. Only the FSQL Server module knows how to put together the pieces that are needed to process a fuzzy query. DB (Database): The database, like all traditional relational databases, is a collection of tables and the relationships among those data. However, Alianc¸a expands the capabilities of traditional databases allowing the storage of fuzzy information in its tables. This new kind of information is stored using a set of hidden attributes that together with the FMB, define all relevant characteristics of the fuzzy data. It is important to observe that these hidden attributes are stored in the same tables and side by side with the traditional data formats, and they are transparent to the end user. Therefore, no extra tables are required, simplifying the incorporation of fuzzy information. Due to this fact, originates one of the main advantages of Alianc¸a its easiness of upgrading from a traditional database. FMB (Fuzzy Meta-Knowledge Base): The necessary information to define and describe data of fuzzy nature is stored in the FMB. This information is organized in XML format and only the FSQL Server can access it. The FMB does not store data, but information about the structure of data stored in the tables of the database. For instance, the system must retrieve from the XML text files information such as the labels of fuzzy attributes the parameters that define their semantics. There is also a special kind of data which is stored in the FMB that is the degree of similarity between concepts. FSQL Server: It is the main part of the system, because it deals with the relationship between the FMB and the RDBMS. One of its objectives is to transform FSQL information in traditional SQL information in order to allow the database management to process it and return an answer. This server was developed in Java. The FSQL receives queries, identifies the fuzzy parts and searches the FMB for the meta information describing them. Based on the results of this search the FSQL server is able to construct a classical SQL query that is submitted to the RDBMS. User’s Interface: The User’s Interface is a program that allows communication between the users and the FSQL Server. Users can submit queries and receive answers through the interface that also checks syntax errors.

It is important to note that the proposed architecture is not restricted to the relational model and can easily be extended to an object-oriented database. This is due mainly to the fact that the Fuzzy Meta-Knowledge Base is not stored in tables but on a text based XML document that is on an external level to the relational database. Comparing the two ways used to implement databases, relational and objectoriented, we observe that while the relational model is based on tuples, the OO model deals directly with objects and their persistency. Therefore, it would be simple to allow objects to receive a similar treatment to the one presented here. In addition to the fuzzy attribute of the object it would be necessary to add two other attributes in a very similar way as it will be discussed for relational databases in section 4.1.

296

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

Besides that, the structure of the Object Query Language (OQL) is similar to the structure of SQL and, therefore, the transformation of a fuzzy query into an object query would follow the same methodology presented in section 6.

3 Representation of Vague Knowledge in Aliança In this section we discuss the different types of information that can be stored in the database and its forms of representation. Aliança accepts 8 different data types, each one receiving a numeric label from 0 to 7. This is a very rich set and includes all information already stored in traditional databases, as well as incorporating different forms of representing fuzzy data. • Crisp Data: Crisp data is the usual precise data handled by the traditional databases and from the RDBMS point of view it receives the same treatment as the imprecise data. This kind of data is classified as type 0 and it does not need any additional information added to the Fuzzy Meta-knowledge Base (FMB). Strings, real and natural numbers and dates are examples of usual crisp data formats. Aliança acts as a traditional database when processing information of this type. • Unknown (but applicable): An attribute gets the value Unknown when it may receive any value from its domain, but it is impossible to define exactly what its value is. This kind of data is classified as type 1. The type Unknown is represented by using the possibility distribution, { 1u , ∀u ∈ U}, where U is the domain. Generally speaking a possibility distribution D can be defined through enumeration using the expression D = ∑ μD (xi )/xi

(1)

where the summation and addition stand for the union of (xi , μD (xi )) pairs and “/” is only a mark. Figure 2 shows this distribution. • Undefined (not applicable): An attribute gets the value Undefined when none of the values from the domain is applicable. This kind of data is classified as μ(x)

Unknow

Fig. 2 Possibility distribution for a type Unknown

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

• •

•

297

of type 2. The type Undefined is represented using the possibility distribution { 0u , ∀u ∈ U}, where U is the domain. Figure 3 shows this distribution. Null (absolute ignorance): An attribute gets the value Null when no information about it is available, either when we do not know it (Unknown) or when it is not applicable (Undefined). This kind of data is of type 3. Linguistic Label with a Possibility Distribution: When an attribute is associated to a vague value it receives a linguistic label with a possibility distribution. This kind of data is of type 4 and it has an associated trapezoidal possibility distribution whose definition is stored in the Fuzzy Meta-Knowledge Base. Figure 4 shows an example of a linguistic label. Trapezes are frequently used in fuzzy systems to represent vague values. Other functions like triangles and Gaussian may be used, however in Alianc¸a it was decided to use trapezes that represent satisfactorily the semantics of the vague concepts and are simple to manipulate. Possibility Interval [m, n]: This kind of data is of type 5 and it is associated to an interval possibility distribution. It is used to represent the fact that the only information about some piece of information is that it lies within an interval with equal possibility. Figure 5 shows an example of a possibility interval. This kind of data also needs additional data stored in the FMB. Approximate Value (approximately d): If the value d is in the domain, the vague concept approximately d is defined by a triangular possibility distribution defined around d with a margin a, as shown in Figure 6. The margin indicates the degree of certainty available about the value of the attribute. This is the type 6 kind of data and it also needs additional data stored in the FMB to define the margin used. Linguistic Label with Similarity: This kind of data is defined on a non ordered domain. In this domain a relationship of similarity is defined between the linguistic labels. The relation is represented by a table showing the strength of the relations between all pairs of values belonging to the domain. This is the type 7 kind of data and it needs additional data stored in the FMB. Table 3 shows an example of a similarity relationship .

μ(x)

Undefined

Fig. 3 Possibility distribution for a type Undefined

298

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

4 Fuzzy Meta-knowledge Base As we have seen in section 3, some data types need additional information in order to be correctly manipulated. The fuzzy meta-knowledge base (FMB) contains, in an efficient and organized way, the necessary additional information required by the system. Differently from what was proposed in FIRST (in GEFRED, by [14] and [7]), the Alianc¸a database does not store its fuzzy meta-knowledge using tables and relations within the database. The FMB in Alianc¸a is described in XML format. This format makes the process easier for understanding and maintenance. The information stored for each of the types previously presented is shown in Table 1. As can be seen from Table 1, data types 0, 1, 2 and 3 do not store any additional information in the FMB. Table 1 Information stored in the FMB Type of Data Type Information Stored Crisp Data 0 None Unknown 1 None Undefined 2 None Null 3 None Linguistic Label 4 Labels and their defining with Possibility Distribution characteristics Possibility Interval 5 minimum and maximum Approximate Value 6 Margin Linguistic Label 7 Pairs (a, b), Degree of with Similarity similarity between a and b

μ(Size)

Small

a=20 b=30 c=40

d=60 Size(m2 )

Fig. 4 An example of linguistic label for the concept “Small”

4.1 Structure of the FMB The fuzzy meta-knowledge base is where all additional information necessary to handle the database transactions is stored. Alianc¸a defines for the FMB a directory

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

299

μ(Size) 1

0 m=10

n=30

Size(m2 )

Fig. 5 An example of possibility interval

structure, where the root directory is named after the database. Each database table contains, in the root directory, a subdirectory for each table. This subdirectory contains one XML file for each attribute. We will describe, through an example, the internal structure of the fuzzy fields and the new way in which data related to the fuzzy attributes are stored in the FMB of Alianc¸a. The example is a database called Real Estate that stores a relation containing information about apartments for sale in Copacabana, a beach in Rio de Janeiro - Brazil. Figure 7 shows the FMB directory structure, while Table 2 lists the apartments used as examples. Table 2 List of Apartments Id Bedrooms Price Size Conservation 01 1 95000,00 33 Bad 02 2 600000,00 #140a Unknown 03 4 650000,00 Large Regular 04 1 145000,00 Small Regular 05 2 270000,00 78 Bad 06 4 800000,00 [130, 150]b Bad 07 2 480000,00 Large Excelent 08 3 360000,00 Unknown Good a The symbol # means “approximately” b [m, n] is an interval and means between m and n

The description of the attributes of Table 2 is shown below, according to the classification criteria used in Alianc¸a for the different fuzzy types. • Id: It is an integer serial numeric field automatically completed by the SGDB and it is the table primary key. • Bedrooms: The number of bedrooms of each apartment. It is a numeric field of crisp type. • Price: It is the price of each apartment. It is an attribute amenable to fuzzy treatment, so it can be filled with values of the type presented in Table 1. This attribute needs extra information stored in the BMN. The value of the margin (type 6) was

300

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti Îź(Size) 1

Size(m2 )

margin=10

Fig. 6 An example of approximate value

Fig. 7 FMB structure

defined as 5000. The definitions of the linguistic labels that use possibility distributions (type 4) are shown in Figure 8. Therefore, it is possible to store data according to the information available and, depending on the necessity or the possibility, users may reach conclusions based on this data. For instance, it is possible to store precise prices, should they be available, imprecise prices, like the apartment of average price or that the price is approximately R$ 500000.00. When considering this kind of attribute the ability to deal with fuzzy information may be very important during the negotiation process. â&#x20AC;˘ Size: Stores the size of each apartment. It is also an attribute that can store fuzzy data, therefore all types defined in Table 1 can be used. The value of the margin, for this attribute, was defined as 5. The linguistic labels are presented in Figure 9. Îź(Price) Low

Average

High

100 150

250 300

400

550 600

700 750

850

1000

Price(R$ 1000.00 )

Fig. 8 Definition of the labels for the attribute Price

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

301

μ(Size) Small

Medium

Large

20 30 40 50 60 70 80 90

120

150 Size(m2 )

Fig. 9 Definition of the labels for the attribute Size

• Conservation: Stores the conservation of each apartment. It is another attribute that can be treated as a fuzzy quantity. Differently from the previous attributes, this one is defined by a similarity relation, which is of type 7 value and this is the only type that can be considered. The linguistic labels and the similarities between all possible pairs of labels, for the attribute conservation, are presented in Table 3. Note that this similarity relation must be symmetric and reflexive. Table 3 Similarity relation sr defined over the attribute Conservation sr (d, d ) Bad Regular Good Excelent Bad 1 0.8 0.5 0.1 Regular 0.8 1 0.7 0.5 Good 0.5 0.7 1 0.8 Excelent 0.1 0.5 0.8 1

The attributes Price and Size are defined over as ordered domain, therefore they can be filled with values of type 0, 1, 2, 3, 4, 5 and 6 from Table 1. In order to discuss the contents of the XML files we will start considering the contents of the file Size.xml that is shown in Example 1. Example 1 (File Size.xml). <? XML VERSION = “1.0”> <SIZE> <DOMAIN A=20 B=150 /> <TYPE T=4> <LABELS> <MEDIUM A=50 B=60 C=70 D=80 /> <LARGE A=70 B=90 C=120 D=150 /> </LABELS> </TYPE> <TYPE T=5> <INTERVAL MIN=5 MAX=30 /> </TYPE> <TYPE T=6> <MARGIN M=5> </TYPE> </SIZE>

302

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

It is important to note that tags belonging to the application domain, for instance the label SMALL, vary according to the attribute and its domain. Tags that define the data structure, such as DOMAIN and TYPE, are fixed. XML text files are divided into sections and each section defines a characteristic of the attribute, for example, its domain. Considering an attribute of the type 4, the XML fixed tags are related to the way type 4 fuzzy variables are defined. A fuzzy variable V is defined by a quadruple V = {N, L, D, S} where N is the variable name, for instance, Size. L is the set of labels that can be applied to the attribute. In the example of the attribute Size the set of labels are: SMALL, MEDIUM and LARGE. D is the variable domain or its universe of discourse, that for sizes are from 20 to 150 meters. Finally, S is the set of semantic rules that define the meaning of the labels. The file Size.xml also illustrates the way attributes of the types 5 and 6 are described. The main items of the file are: • Tag <DOMAIN A=20 B=150 />: Presents the domain [A, B] over which the fuzzy attribute is defined. • Tag <TYPE T=4>: This section of the file defines the characteristics that represent the possibility distributions (type 4) of each label. – Tag <LABELS>: This subsection defines all the labels used by this attribute (see Figure 9) where A, B, C and D correspond, respectively, to a, b, c and d (Figure 4). · Tag : Presents the parameters that describe the trapezoidal distribution SMALL. · Tag <MEDIUM A=50 B=60 C=70 D=80 />: Presents the parameters that describe the trapezoidal distribution MEDIUM. · Tag <LARGE A=70 B=90 C=120 D=150 />: Presents the parameters that describe the trapezoidal distribution LARGE. • Tag <TYPE T=5>: This section presents the characteristics of the possibility intervals. – Tag <INTERVAL MIN=5 MAX=30 />: This tag establishes the possible minimum and maximum sizes for the interval. • Tag <TYPE T=6>: Presents the required characteristics to represent an approximate value. – Tag <MARGIN M=5>: The tag defines for approximate values the margin size M as 5. In a very similar way, the file Price.xml, shown in Example 2, stores the additional information of the attribute Price. Although the attribute Price (see Table 2) presents only crisp data, it has additional information in the fuzzy meta-knowledge base in order to make the vague treatment possible.

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

303

Example 2 (File Price.xml). <? XML VERSION = “1.0”> <PRICE> <DOMAIN A=500 B=4000 /> <TYPE T=4> <LABELS> <LOW A=100000 B=150000 C=250000 D=300000 /> <AVERAGE A=2500000 B=400000 C=550000 D=700000 /> <HIGH A=600000 B=750000 C=850000 D=1000000 /> </LABELS> </TYPE> <TYPE T=5> <INTERVAL MIN=5000 MAX=20000 /> </TYPE> <TYPE T=6> <MARGIN M=5000> </TYPE> </PRICE>

In order to present an example of type 7 XML file let us consider the attribute Conservation, which is defined over an unordered domain and can only be filled with data of this type. As explained before, the XML file must be called Conservation.xml and contains the information presented in Example 3. As in the previous example, the tag DOMAIN presents the discrete domain over which the fuzzy attribute is described. Note that the domain is composed of a set of labels. • Tag <DOMAIN A=bad B=regular C=good D=excellent />: Presents the discrete domain over which the fuzzy attribute is described. Example 3. File Conservation.xml <? XML VERSION = “1.0”> <CONSERVATION> <DOMAIN A=bad B=regular C=good D=excelent /> <TYPE T=7> <LABELS> <BAD bad=1 regular=0.8 good=0.5 excellent=0.1 /> <REGULAR bad=0.8 regular=1 good=0.7 excellent=0.5 /> <GOOD bad=0.5 regular=0.7 good=1 excellent=0.8 /> <EXCELENT bad=0.1 regular=0.5 good=0.8 excellent=1 /> </LABELS> </TYPE> </CONSERVATION>

Since the attribute Conservation is based on similarities it is necessary to describe the strength of each relation. • Tag <TYPE T=7>: Presents the characteristics necessary to represent a linguistic label based on similarities. – Tag <LABELS>: Presents the similarity degrees among the labels.

304

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

For example, the line describing the tag BAD shows the degrees of similarity of this tag to all the others. The same applies to the other lines. • Tag <BAD bad=1.0 regular=0.8 good=0.5 excellent=0.1 />. Considering these definitions just discussed it is possible to present in Table 4 the real internal structure of Table 2. It is important to note that this internal representation is transparent to the user. In order to explain these new attributes let us again consider the fuzzy attribute Size as an example. Since Size is defined over an ordered domain it can be filled with values of type 0, 1, 2, 3, 4, 5 and 6 from Table 1. In order to take into account all these types, three new attributes are added to the database: SizeT, Size1 and Size2 as shown in Table 5. SizeT defines the type of the stored data. Size1 and Size2 will receive information according to the type stored. For example, a data type of type 6 (Approximate value) requires one extra value, a margin. Table 4 Internal representation of the relation Real Estate and the proposed extension Id BRsa Price PriceT Price1 Price2 Size SizeT Size1 Size2 Conservation 01 1 95000 0 95000 33 0 33 $$Bad 02 2 600000 0 600000 #140 6 140 15 Unknown 03 4 650000 0 650000 $Large 4 $$Regular 04 1 145000 0 145000 $Small 4 $$Regular 05 2 270000 0 270000 78 0 78 $$Bad 06 4 800000 0 800000 [130, 150] 5 130 150 $$Bad 07 2 480000 0 480000 $Large 4 $$Excelent 08 3 360000 0 360000 Unknown 1 $$Good a

Bedrooms

ConservationT

CTb 7 1 7 7 7 7 7 7

Table 5 Description of the new attributes used in the internal representation of fuzzy values Size 33 Unknown Undeﬁned Null $Small

[130, 150] #140

SizeT Size1 Size2 Description 0 33 - 33 is a crisp value, threfore SizeT=0, and Size1=33 which indicates the type receives 0 and Age = 33 1 - The label Unknown is a type 1 attribute 2 - The label Undeﬁned is a type 2, therefore SizeT = 2 3 - The label Null is a type 2 attribute 4 - The label $Small is a trapezoidal function, therefore SizeT = 4 and the remaining information is stored in the ﬁle Size.xml 5 130 150 The value [130, 150] is of an interval type, threrefore SizeT = 5 and Size1 = 130 e Size2 = 150 6 140 5 The value 140 id of type Approximate, therefore SizeT = 6, Size1 = 140 e Size2 = margin size

Let us now consider the fuzzy attribute Conservation. Since this attribute is defined over unordered domain it can only be filled with type 1, 2, 3 and 7 attributes. This kind of attribute will imply in the addition of only one extra column

Alianc存a: A Proposal for a Fuzzy Database Architecture Incorporating XML

305

called ConservationT. All these new attributes are used when translating a fuzzy query to a traditional query. The use of a structure of directories and XML files facilitates the creation and maintenance of fuzzy databases.

4.2 Converting from Classical to Fuzzy Databases It can be seen from the previous examples that in order to store and process fuzzy information it is necessary to create a series of XML files that store descriptions of the attributes. Besides that, new columns that complete the definition of the data should be added to the tables. One of the main advantages of the Alianc存a architecture is that these modifications are easy to implement and maintain due to the fact that the XML files are simple descriptions of the attributes. They can be created automatically or directly by the user and can be easily extracted from the graphics of functions that define the fuzzy attributes. The addition of the new columns and the change of the type of data stored in the table is also simple to be implemented and does not require major modifications in the database. It is important to remember that in Alianc存a there is a restriction that a primary key cannot be a fuzzy attribute. As an example of the SQL commands required to modify a table, we will consider the column Size of the relation Real State shown in Table 2. Let us assume that this column originally stored the size of the apartment as a number. This attribute needs to be modified so that it can store fuzzy data and its final internal representation is the one shown in Table 4. In order do modify the column Size two steps should be executed, and the order in which they are executed is not relevant. One step is to add three new columns (SizeT, Size1 and Size2) of the type int. The other one is the modification of the column so that it can store attributes like $LARGE and #140. One possible way is to consider that it will contain char type attributes, and the set of commands to add the three new columns would be: ALTER TABLE Real Estate ADD COLUMN SizeT INT AFTER Size; ALTER TABLE Real Estate ADD COLUMN Size1 INT AFTER SizeT; ALTER TABLE Real Estate ADD COLUMN Size2 INT AFTER Size1;

In order to modify the column to store char type attributes the command could be the following: ALTER TABLE Real Estate MODIFY COLUMN Size CHAR(20);

This is a simple example, but it shows the main steps required by the architecture of Alianc存a to incorporate fuzzy attributes into traditional databases. It can be seen that the modifications required are not very extensive and do not create complex structures of new tables in order to represent the semantics of the fuzzy knowledge.

306

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

5 FSQL Server The FSQL server is the main module of the system, because it is responsible for the relationship between the RDBMS and FMB. One of its objectives is to transform the information produced by the FSQL into a classical SQL. The functions of the modules from which the AliancÂ¸a proposal is made of can be explained by following the six steps indicated in Figure 10.

Fig. 10 Six steps for execution of a query in FSQL

1. The User Interface module receives queries written in FSQL and sends them to the FSQL Server; 2. The Parser sub-module in the FSQL server receives FSQL queries and performs lexical and syntactical analysis on their contents. This process verifies their correctness and extracts the tokens; 3. The identified tokens trigger the XML Interpreter module that accesses the FMB in order to search for the meta-information relative to the tokens; 4. The search results are sent to the Translator module that assembles the SQL classical query equivalent to the original FSQL query submitted to the system by the user; 5. The classical queries can then be sent to the SGBDR, that processes them and gets the results; 6. Finally these results are shown to the user by the User Interface module. The server was written in Java and uses JavaCC [18] to generate the FSQL parser. The resulting query compares the given fuzzy type to all types stored in the database. It is important to note that this query is composed of n queries, where n is the number of different types shown in section 3. Besides that, each of these queries can be expanded according to the membership functions used to represent the fuzzy values.

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

307

6 FSQL - Fuzzy SQL FSQL assumes a conventional database and adds external tools to handle fuzzy data. This approach is also used by [1], [7] and [11]. The FSQL introduces new features to facilitate the handling of imprecise information. FSQL is based on similarity relations [22] and the theory of possibility introduced by Zadeh [23]. We will now describe some of the new operators and characteristics added to FSQL in order to facilitate the construction of database queries. • Linguistic Labels: When an attribute receives a linguistic label, this label is preceded by a symbol that makes its identification easy. There are two linguistic labels: linguistic labels with possibility distributions and linguistic labels with similarity relations. The linguistic label with similarity is preceded by the symbol $ and the labels with similarity relations by $$. • Possibility Intervals [m, n]: they are represented by two numerical values m and n between brackets [m, n]. • Approximate Values: they are preceded by the symbol #. For example, #10 expresses “approximately 10”. • Acceptance Degree: For every fuzzy condition used in a query, a degree similar to the alpha cut can be defined [13]. This degree appears between parenthesis and after the condition. It defines a threshold below which the answers will not be presented to the user. This improves the performance and produces uncluttered output. • Fuzzy Comparators: Besides the classical comparison operators (>, <, =, ...), the proposed FSQL has the following fuzzy comparators (FEQ - possibly fuzzy equal, FGEQ - possibly fuzzy greater than, NFEQ - necessarily fuzzy equal, NFLEQ - necessarily fuzzy less or equal than, ...). Some of these comparators will be described in the next section in order to explain how the translation to usual SQL is done.

6.1 Fuzzy Comparators In this section we will present some of the fuzzy comparators as examples of how Alianc¸a processes comparison between all sorts of data, including fuzzy data. It is important to note that the type 7 (Linguistic Labels with Similarity) is a special case and only the equal comparator is applied. In addition it can only be compared to other type 7 elements. 1. Possibly equal to (FEQ): This operator models the concept of possibly equal to data of fuzzy nature. In order to obtain the degree in which X is possibly equal to Y , its membership degree is formally defined by:

μFEQ (X,Y ) = sup [min (X(xi ),Y (xi ))] xi ∈Ω

(2)

308

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

where Ω is the universe of discourse of both fuzzy data X and Y . Figure 11 presents graphically an example of how the degree of equality between two linguistic labels, (X) and (Y ) can be obtained. In order to obtain the degree of similarity between type 7 data, simply search in the similarity matrix. μFEQ (a, b) = sr (a, b) (3) where a and b are type 7 data and sr (a, b) is their degree of similarity. 2. Possibly greater than or equal to (FGEQ): This operator models the concept of possibly greater than or equal to for fuzzy data. Its membership function, the degree by which X is possibly greater than or equal to Y , is formally described by the equation 4.

μFGEQ (X,Y ) = sup [min (X(xi ), ≥ (Y (yi )) )] xi ∈Ω

(4)

where Ω is the universe of discourse of the fuzzy data X and Y , which may be of types from 0 to 6 and ≥ is the extended operator “greater than or equal to”. In order to explain this operator let us consider the operator ≥ (X) for a generic trapezoidal function X. The operator can be described by the equation 5,

Fig. 11 Example of1 the operator FEQ

Fig. 12 Extended comparator ≥ (Xtrap )

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

309

where Ω is the universe of discourse of Xtrap and xi ∈ Ω . Its membership function is shown in Figure 12 ⎧ ⎪ ⎨ x − a0, se xi ≤ a i , se a < xi < b (5) ≥ (Xtrap ) = μ≥(Xtrap ) (xi ) = ⎪ b ⎩ −a 1, se xi ≥ b Now, let us consider that we need to find if X is greater than Y . First we find the result of the comparator ≥ Y as defined in equation 5. The next step is to find the final result as defined in equation 4. Figure 13 presents an approximated value (X) and a linguistic label (Y ) used as an example, while Figure 14 presents graphically how we obtain the membership degree by which the approximated value (X) is possibly greater than or equal to the linguistic label (Y ). 3. Possibly less than or equal to (FLEQ): This operator models the concept of possibly less than or equal to for fuzzy data. Its membership function, the degree by which X is possibly less than or equal to Y , is formally described as

μFGEQ (X,Y ) = sup [min (X(xi ), ≤ (Y (xi )) )] xi ∈Ω

(6)

where Ω is the universe of discourse of the fuzzy data X and Y , which may be of types from 0 to 6 and ≤ is the extended operator “less than or equal to”.

Fig. 13 Approximated value X and linguistic label Y

310

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

Fig. 14 ÎźFGEQ (X,Y )

This operator works in a very similar way as the FGEQ operator. In order to present a different kind of example, Figure 15 shows a precise value (X) and possibility interval (Y ) to be compared. Figure 16 presents graphically how is obtained the membership degree by which the precise value (X) is possibly less than or equal to the possibility interval (Y ).

Fig. 15 Precise value X and possibility interval Y

Fig. 16 ÎźFLEQ (X,Y )

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

311

4. Possibly greater than and Possibly less than (FGT and FLT): These operators are used to make direct comparisons between fuzzy data. The degree that X is possibly greater than Y is formally defined as:

μFGT (X,Y ) = 1 − μFLEQ (X,Y )

(7)

Similarly the degree that X is possibly less than Y can be obtained by the equation: μFLT (X,Y ) = 1 − μFGEQ (X,Y ) (8)

6.2 FSQL Examples In order to illustrate the use of FSQL we present Examples 4 and 5. Both use the relation Real Estate shown in Table 2. Example 4. Find all expensive apartments (with degree 0.55). SELECT Id, Bedrooms, Price FROM Real Estate WHERE Price FEQ High 0.55

When the query from example 4 is submitted to Alianc¸a, the six steps described in Figure 10 will be executed. In the second step, when the fuzzy operator FEQ is discovered, the identification of the data type that follows it also occurs, in this example, the fuzzy quantity High (type 4). During the third step the parameters that define the membership function of the linguistic label High are read from the file Price.xml. Therefore, in order that the classical SQL query may give a correct result, it is necessary to assemble it so that it can compare all types accepted by the attribute Price (types 0 to 6) to the value High (type 4). Example 5. Select number of bedrooms, price and conservation for apartments whose size is possibly between approximately 100 square meters and size Large, use a degree of acceptance of at least 0.7. SELECT Id, Bedrooms, Price, Conservation FROM Real Estate WHERE Size FGEQ #100 AND Size FLEQ LARGE (0.7)

The query presented in example 5, when submitted to Alianc¸a, will be broken in two queries (one to each condition). Each query will be treated separately and the six steps described in Figure 10 will be executed. After that, a join will be made and the final result will be presented to user. The degree of acceptance equals to 0.7 implies that values with a degree smaller than that will not be returned to the user that issued the query.

312

R.D. Rodrigues, A.J. de O. Cruz, and R.T. Cavalcanti

7 Bibliographical and Historical Notes Fuzzy Logic techniques have been applied to database and information retrieval areas for years. However, one of the first proposals for treatment of imprecise information on databases was presented by Codd [4] and further developed in [5] and [6], but the model did not use fuzzy logic. It proposed the use of the value NULL to indicate that an attribute can be any value of the domain. The model presented by Buckles-Petry [2] and [3] used the similarity measure defined by Zadeh [22]. The proposal of Prade-Testemale [16] went further, allowing attributes to receive fuzzy values. Attributes of precise and partial (imprecise and unknown) values were represented by the possibility distribution proposed by Zadeh [22]. One of the first relational database models was presented by Umano and Fukami [20]. This proposal also used possibility distributions to represent knowledge about the information in a way similar to the Prade-Testemale model. The model presented by Zemankova-Kaendel [24] created a structure to represent imprecise information and to manipulate uncertainty or vagueness in the query language. This proposal uses a new measure called certainty measure. The GEFRED (A GEneralised model Fuzzy RElational Databases) model, described in [14] and improved by Galindo [7] is a fusion of main proposals that existed at the time to manipulate and store imprecise information. Therefore, this model ([9], [8], [15]) presents the advantage of incorporating all these previous ideas within its framework. Turowski and Wing [19] presented a formal syntax, based on DTDs, to describe fuzzy data types used by business systems. The work of Gaurav and Alhajj [10] describes how to map data from a fuzzy relational database into an XML document. Their goal was limited to publishing fuzzy data on the Web as XML documents.

References [1] Bosc, P., Duval, L., Pivert, O.: An initial approach to the evaluation of possibilistic queries addressed to possibilistic databases. FSS 140, 151–166 (2003) [2] Buckles, B.P., Petry, F.E.: A fuzzy representation of data for relational databases. FSS 7, 213–226 (1982) [3] Buckles, B.P., Petry, F.E.: Extending the fuzzy databases with fuzzy numbers. Information Sciences 34(2), 145–155 (1984) [4] Codd, E.F.: Extending the database relational model to capture more meaning. ACM Transactions on Database Systems 4(4), 397–434 (1979) [5] Codd, E.F.: Missing information (applicable and inapplicable) in relational databases. ACM SIGMOD Record 15(4), 53–53 (1986) [6] Codd, E.F.: More commentary on missing information in relational databases (applicable and inapplicable information). ACM SIGMOD Record 16(1), 42–50 (1987) [7] Galindo, J.G.: Tratamiento de la Imprecisión en Bases de Datos Relacionales: Extensión del Modelo y Adaptación de los SGBD Actuales. PhD thesis, Universidad de Granada, España (1999) [8] Galindo, J.G., Aranda, M.C., Caro, J.L., Guevara, A., Aguayo, A.: Applying fuzzy databases and fsql to the management of rural accommodation. Tourism Management 24(4), 457–463 (2003)

Alianc¸a: A Proposal for a Fuzzy Database Architecture Incorporating XML

313

[9] Galindo, J.G., Medina, J.M., Aranda-Garrido, M.C.: Fuzzy division in fuzzy relational databases: An approach. FSS 121(3), 471–490 (2001) [10] Gaurav, A., Alhajj, R.: Incorporating fuzziness in xml and mapping fuzzy relational data into fuzzy xml. In: Proceedings of the 2006 ACM symposium on Applied computing, pp. 456–460. ACM, New York (2006) [11] Kacprzyk, J., Zadrozny, S.: Computing with words in intelligent database querying: standalone and internet-based applications. Information Sciences 134, 71–109 (2001) [12] Kosko, B.: Fuzzy Thinking: The New Science of Fuzzy Logic. Harper and Collins, USA (1994) [13] Kosko, B.: Fuzzy Engineering. Prentice-Hall, New Jersey (1997) [14] Medina, J.M.: Bases de Datos Relacionales Difusas. Modelo Teórico y Aspectos de su Implementación. PhD thesis, Universidad de Granada - España (1994) [15] Medina, J.M., Vila, M.A., Cubero, J.C., Pons, O.: Towards the implementation of a generalized fuzzy relational database model. FSS, 273–289 (1995) [16] Prade, H., Testemale, C.: Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences 34, 115– 143 (1984) [17] Rodrigues, R.D., de Oliveira Cruz, A.J., de Siqueira, T.C.R.: Aliança: A proposal for a fuzzy database architecture incorporating xml. Fuzzy sets and systems 160, 269–279 (2009) [18] Javacc. Java.net Web Site, https://javacc.dev.java.net/ (retrieved September 22, 2009) [19] Turowski, K., Weng, U.: Representing and processing fuzzy information - an xml-based approach. Knowledge-Based Systems 15, 67–75 (2002) [20] Umano, M., Fukami, S.: Fuzzy relational algebra for possibility-distribution-fuzzyrelational model of fuzzy data. Journal of Intelligent Information Systems 3, 7–28 (1994) [21] Yasunobu, S., Myyamoto, S., Ihara, H.: Fuzzy control for automatic train operation system. In: Proceedings of the 4th IFAC Control in Transportation Systems, BadenBaden, Germany, pp. 33–39 (1983) [22] Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3(2), 177– 200 (1971) [23] Zadeh, L.A.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 1, 3–28 (1978) [24] Zemankova, M., Kandel, A.: Implementing imprecision in information systems. Information Sciences 37, 107–141 (1985)

Leveraging Semantic Approximations in Heterogeneous XML Data Sharing Networks: The SUNRISE Approach Federica Mandreoli, Riccardo Martoglia, Wilma Penzo, Simona Sassatelli, and Giorgio Villani

Abstract. In recent years, the huge amount of data available from Internet information sources has focused much attention on the sharing of distributed information through P2P and, in line with the Semantic Web vision, through Peer Data Management Systems (PDMSs). On the other hand, XML is with no doubt the most popular data representation and exchange format on the Web and more and more Internet applications are conforming to this de facto standard for data sharing. In this chapter we present SUNRISE (System for Unified Network Routing, Indexing and Semantic Exploration) for XML data sharing. Federica Mandreoli DII - University of Modena and Reggio Emilia, via Vignolese 905/b - 41125 Modena - Italy IEITT - BO/CNR, viale Risorgimento 2, 40136 Bologna - Italy e-mail: federica.mandreoli@unimore.it Riccardo Martoglia DII - University of Modena and Reggio Emilia, via Vignolese 905/b - 41125 Modena - Italy e-mail: riccardo.martoglia@unimore.it Wilma Penzo DEIS - University of Bologna, viale Risorgimento 2 - 40136 Bologna - Italy IEITT - BO/CNR, viale Risorgimento 2, 40136 Bologna - Italy e-mail: wilma.penzo@unibo.it Simona Sassatelli DII - University of Modena and Reggio Emilia, via Vignolese 905/b - 41125 Modena -Italy e-mail: simona.sassatelli@unimore.it Giorgio Villani DII - University of Modena and Reggio Emilia, via Vignolese 905/b - 41125 Modena - Italy e-mail: giorgio.villani@unimore.it

This work is partially supported by the Italian Council co-funded Project NeP4B.

Z. Ma & L. Yan (Eds.): Soft Computing in XML Data Management, STUDFUZZ 255, pp. 315â&#x20AC;&#x201C;350. c Springer-Verlag Berlin Heidelberg 2010 springerlink.com

316

F. Mandreoli et al.

SUNRISE is a complete PDMS infrastructure aiming at semantic interoperability in heterogeneous networks. Decentralized data sharing is supported by a set of autonomous peers which model their local data through schemas and which are locally connected through semantic mappings. SUNRISE leverages the semantic approximations originating from schemas’ heterogeneity for an effective and efficient organization and exploration of the network. For these purposes, SUNRISE implements soft computing techniques which cluster peers in Semantic Overlay Networks according to their own contents, and promote the routing of queries towards the semantically best directions in the network.

1 Overview and Motivation In recent years, the enormous success of Internet has stressed the importance of a general agreement on the format for data exchange. For this purpose, XML proved to be a widely accepted standard both for its flexible machine-readable form, and for the semantic support it provides for data representation. This last feature, in line with the Semantic Web vision [7], has made XML to be extensively and successfully used by several applications dealing with semantically rich data. Its applicability is primarily evident in distributed realities, where actors are heterogeneous data sources which interact with each other for data sharing purposes. This is exactly the scenario envisioned by Peer Data Management Systems (PDMSs) [21], a recent evolution of peer-to-peer (P2P) systems towards a more semantics-based description of peers’ contents and relationships. In a PDMS peers are autonomous sources which model their local data according to a schema, and are connected in a peer-to-peer network by means of pairwise semantic mappings between the peers’ own schemas. Because of the absence of common understanding of the vocabulary used at each peer’s schema, the semantic mappings established between the peers implement a decentralized schema mediation [21]. Hence, joining a PDMS is inherently a more heavyweight operation than joining a P2P file-sharing system, since some semantic relationships need to be specified. Several application scenarios can benefit of a PDMS architecture, ranging from ad hoc sharing scenarios of collaborative networks to ones in which the membership changes infrequently or is restricted due to security or consistency requirements [23]. In these contexts peers are likely to stay available the majority of the time, though being able to join (or add new data) very easily. One of the main challenges in such a semantically heterogeneous environment is concerned with query processing. A query is routed through the network by means of a sequence of reformulations, according to the semantic mappings encountered in the routing path. As reformulations may lead to semantic approximations, thus inducing information loss, for a given peer, the linkage closeness to semantically similar peers is a crucial issue. This matter has also been evidenced recently by works on Semantic Overlay Networks (SONs) [12] for P2P systems, where peers with semantically similar content are clustered together in logical subnetworks. The main aim of a SON is to improve the efficiency of query processing by limiting the number of contacts only to relevant peers. Nevertheless, in a more complex environment like

The SUNRISE Approach for XML Data Sharing Networks

317

the PDMS one, SON principles substantially yield to a network organization which reduces the semantic degradation due to the traversal of irrelevant peers [30]. However, the problem of answering queries efficiently and effectively is only partially solved by simply relying on a carefully designed network organization. Indeed, an important aspect to be considered is that a PDMS underlies a potentially very large network able to handle huge amounts of data. In this context, any relevant peer may add new answers to a given query and different paths to the same peer may yield different answers [21]. Also for this reason, SONs would largely benefit of a support for query routing, i.e. a mechanism for selecting a small subset of relevant peers to forward a query to. In particular, in a semantically rich context like the XML data sharing one, a query posed over a given peer should be forwarded to the most relevant peers that offer semantically related results among its immediate neighbors first, then among their immediate neighbors, and so on. In this chapter we present an instantiation of SUNRISE (System for Unified Network Routing, Indexing and Semantic Exploration) for XML data sharing. SUNRISE is a complete PDMS infrastructure which enables an efficient and effective semantic data sharing among peers, and whose main strength is the decoupling of the data sharing capabilities support platform from the specific formats of the data to be shared. In particular, SUNRISE offers specific functionalities to peers in the following stages which characterize a PDMSâ&#x20AC;&#x2122;s life: Network construction. Techniques and index structures for selecting the best SONs to join to, as well as for efficiently locating the semantically closest neighbors to be connected to, are provided for each peer entering the network. This is achieved thanks to a suite of protocols and algorithms for managing the update and evolution of the infrastructure in an incremental fashion; Network exploration. Routing algorithms and a specifically devised indexing mechanism are at peersâ&#x20AC;&#x2122; disposal for a wise query answering which selectively locates the most relevant peers to be contacted. The overall process of network management/usage is semantics-driven, in that it is founded on the semantic approximations originating from the peersâ&#x20AC;&#x2122; schemas heterogeneity. These semantic approximations are interpreted in a fuzzy-oriented settlement which models the imprecision of schema mappings and uses this notion to summarize the semantic vagueness of entire subnetworks. Leveraging our previous works [30, 32, 33, 34, 35, 36, 39], the major contribution of this chapter is twofold. Primarily, we present a PDMS infrastructure for XML data sharing as an instantiation of the SUNRISE system. Since the SUNRISE architecture has been conceived in order to be completely independent from the specific format of the data to be shared, the infrastructure we present in this chapter founds on the implementation of ad-hoc software modules which specifically deal with the peculiarities which characterize the management of XML data. As a further contribution, we present the fuzzy foundations which model the semantic misalignment originating from schema heterogeneity between peers in a PDMS, and we show how the presence of such semantic approximations can be indeed exploited for a semantics-driven effective and efficient network organization and query routing.

318

F. Mandreoli et al.

We start introducing the basics of PDMSs together with the query processing related issues in such context and discuss related work (Section 2). We then provide an overview of the system working (Section 3) and of the fuzzy settlement it implements (Section 4). Afterwards, we go into details of the SUNRISE architecture by presenting a thorough description of the modules composing the system (Section 5). Then, we discuss a rich set of experiments we conducted with SUNRISE, demonstrating the benefits provided by the network construction and exploration techniques (Section 6). We finally conclude and discuss future research directions we intend to follow (Section 7).

2 Related Work PDMSs are a recent evolution of original P2P systems, synthesizing database world semantic expressiveness and P2P network flexibility. They intend to offer a decentralized and easily extensible architecture for advanced data management, in which anytime every node can act freely on its data, while in the meantime accessing data stored by other participants. The core idea is to allow peers to use different schemas to represent their own data, and to manage this heterogeneity by means of a semantic mediation layer which is used to relate data above the actual overlay network. The most successful PDMSs, such as Hyperion [4] or Piazza [21, 22], base their data interoperability on the concept of semantic mappings, where the correspondences between semantically equivalent schema portions are represented. This is crucial for query processing: a query posed at a given peer is usually answered presenting the local data but, most importantly, it might be propagated through the network to retrieve further useful information possibly owned by other peers. In this context, semantic mappings allow the query to be appropriately reformulated in order to be compatible with the different schemas. Query reformulation is thus a fundamental task, since the ability to obtain relevant data from other nodes in a network depends on the existence of a semantic path of mappings from the queried peer to that node [50]. The semantic path needs to relate the terms used in the query with the terms specified by the node providing the data. Because of the heterogeneity of the peersâ&#x20AC;&#x2122; local schemas, each reformulation performed in the traversal of the semantic path may lead to semantic approximations, thus possibly inducing information loss because of missing (or incomplete) mappings. This issue has been partially considered in the literature, e.g. in [13, 44, 50, 54]. Although Edutella [44] allows peers to autonomously define their local schemas, its schema mediation perspective is limited to the reconciliation of different data formats, yet not properly considering semantics. GridVine [13] instead addresses the problem of piecing together peer mappings to traverse a sequence of semantically related but syntactically heterogeneous schemas, however, it does not distinguish among the available paths as to the answers they may yield. This aspect is considered for efficiency purposes in Piazza [50], where semantic paths containing redundant reformulations are pruned, and in the Stanford Peers project [54], whose aim is to retrieve quality results in terms of expected number of answers, response time, aso.

The SUNRISE Approach for XML Data Sharing Networks

319

As opposed to viewing semantic misalignment between peers as a burden, we leverage the presence of semantic approximations as a means for effective and efficient network organization and query routing. Being able to improve the retrieval of the required data, while maintaining a high degree of node autonomy, is exactly the main goal pursued by one of the most recent evolutions in network organization strategies: Semantic Overlay Networks (SONs). SONs present a further evolution in the semantic mediation layer management: the key idea is to exploit self-organization principles in order to “cluster” together semantically similar nodes and, thus, similar contents. There are several proposals dealing with different aspects of SONs. [12] proposed the original SON idea, which is based on a classification method for assigning peers to SONs through hierarchies which are shared in the network. Other works address the problem of building SONs. The papers [1, 9, 29] adopt gossip-based membership protocols to derive neighborhoods and, then, SONs translate to the sets of those peers which are logically interconnected. In the schema-based P2P scenario considered in [1], semantic gossiping is applied to derive as neighbors the peers that have annotated their data according to the same schema. A similar approach is also adopted in [9] where the resulting unstructured network coexists with a DHT-based structured one in an hybrid topology. In [29], a metric distance for Language Models is adopted to compare the peer’s local collections of documents to decide whether two peers should become neighbors. Other works [6, 17, 28, 52] found on principles of clustering in order to provide answers to the issue of how to actually create SONs of autonomous peers for efficient IR search. In [17] the clustering method is guided by the description of peers in terms of processing power, storage capacity, etc., and the network of peers results in a hierarchical organization. Instead, [6, 28] propose to cluster together peers sharing objects with similar representations and maintains intra- and inter-cluster connections. In [52], the main focus is to ensure load balancing in a network where nodes are logically organized into a set of clusters and where a set of super-peer nodes maintains information about which documents are stored by which cluster nodes. In [45] a different perspective is considered. The approach relies on caching high quality peers to be remembered as “friends” to be connected to. Semantic similarity between peers is measured following an IR-style approach based on term frequency distributions. Querying histories and levels of trust of each participant are also considered when establishing peer connections. However, the problem of answering queries efficiently is only partially solved by simply relying on a carefully designed network organization. Indeed, SONs would largely benefit of a support for query routing, i.e. a mechanism for selecting a small subset of relevant peers to forward a query to. While simple P2P systems basically make use of inefficient flooding techniques (e.g. Gnutella), much research work in the P2P area has focused on this issue with the aim of cutting off the negative effects of query flooding techniques which both overwhelm the network with messages, and often return lots of irrelevant results [10, 11, 14, 20, 25, 27, 41, 43, 49, 51]. Some of these works discuss id/keyword-based search of documents [10, 41, 49], some assume a common vocabulary/ontology is shared by peers in the network

320

F. Mandreoli et al.

[11, 14, 20], some address scalability of query routing by means of a properly tailored super-peer topology for the network [43], or by adapting their own semantic topology according to the observation of query answering [51]. Most of these proposals are based on IR-style and machine-learning techniques exploiting quantitative information about the retrieved results [10, 11, 25, 27, 41, 51]. Basically, they utilize measures that rely on keyword statistics, on the probability of keywords to appear into documents, on the number of documents that can be found along a path of peers, on caching/learning from the number of results returned for a query. However, all these methods do not take into account effectively the presence of heterogeneous semantic knowledge about the contents of the peers in the network: for instance, in [25] peers are assumed to share the same set of keywords they store data about, whereas in [51] each peer approximates the query concepts with any concept (through the use of wildcards), thus disregarding the semantic similarity between them. Other approaches (such as [20]) exploit that semantic information. However, all these works (but [11]) provide routing techniques which either assume distributed indices which are indeed conceptually global [41, 49], or support completely decentralized search algorithms which, nevertheless, exploit information about neighboring peers only. More precisely, the only work [11] proposes a routing mechanism which does not limit the peer’s capability of selecting peers to the information available at a 1-hop horizon. However, the specific data structures proposed by [11], although providing a summary of subnetworks’ content to suggest a direction to send a query to, yet are limited to answering only IR-style queries. In particular, as to the sharing of XML data, different approaches have been proposed in the literature [2, 3, 8, 47, 53]; all these works rely on DHT-based indexing techniques, limiting their applicability to structured P2P systems. Nevertheless, they do not deal with heterogeneous data and, consequently, they do not cope with semantic approximation issues. Other approaches, e.g. [48], found on superpeer network organizations which are too restrictive for modeling our P2P context. The only work which deals with XML data in a PDMS is [50] which focuses on optimization techniques for query processing. In this chapter we present an instantiation of SUNRISE specifically tailored for XML data sharing. SUNRISE is a complete PDMS infrastructure which enables an efficient and effective semantic data sharing among peers along two specific directions: network organization and query routing. Besides the specific differences about these aspects with respect to other related approaches as previously discussed, SUNRISE further differs from them in that it founds its network management and query routing strategies on the presence of semantic approximations between peers. In line with approaches which have successfully applied fuzzy set theory in contexts where the uncertainty of description is intrinsic to the data’s nature [18, 46], we deem that these principles can provide a valid support to deal with the semantic approximation originated by the heterogeneity of the peers’ schemas in a PDMS. For this reason, SUNRISE implements a fuzzy settlement for its query processing purposes.

The SUNRISE Approach for XML Data Sharing Networks

321

3 The SUNRISE Infrastructure The SUNRISE infrastructure relies on a PDMS architecture where different autonomous peers model their local data through schemas and are pairwise connected through semantic mappings. As a reference example, let us consider the network depicted in Figure 1 whose peers are concerned with cinema-related data. item

CinemasSearch (Peer A)

cinema

film

cinema name address

credits production

name

story

city

rooms

HolidayInRome (Peer F) location

direction

movietheater

restaurant

event

address

Google Movies (Peer D)

awards

scenario

director

motion

title story-line

description name surname

movie

IMDb title (Peer B)

film

FindATheater (Peer C)

film plot

director

prizes

name story city

year

Yahoo! Movies (Peer E)

Fig. 1 Reference network

SUNRISE supports entering peers in the selection of their neighbors by exploiting a flexible network organization mechanism which virtually clusters together in SONs heterogeneous peers which are semantically related. Figure 2 shows two sample SONs for the network depicted in Figure 1. Each peer is represented by a set of concepts {c1 , ..., cm } describing its main topics of interest derived from its local schema. In case of tree-based structures, like XML schemas, they correspond to the abstract elements which can be obtained by applying a schema summarization technique like the one proposed in [55]. Notice that, some peers of the network, such as the Internet Movie Database (IMDb, Peer B) and the web site HolidayInRome (Peer F) are â&#x20AC;&#x153;monothematicâ&#x20AC;?, i.e. they only deal with movies and movie theaters, respectively. Other peers, instead, are concerned with both themes, e.g. FindATheater (Peer C). Peers react to the events issued to the network by interacting on the basis of a message exchange protocol. Basically three kinds of events are supported: the NeighborSelection and Connection events, which are managed in the network construction stage, and the Query event, which characterizes the network exploration stage. The NeighborSelection event is devoted to assist each newly entering peer in the selection of the semantically closest peers, among the available ones, as its neighbors. On the other hand, a Connection event allows for

322

F. Mandreoli et al.

APS SON1 PeerA {film} SON2 PeerC {movie-theatre}

{movie, picture,...}

…

{theater, cinema,…}

…

{movie-theatre,address…} { c ine ma ,add re ss ,… } {c in e m a ,c ity,… }

SON2 Theaters

{ m ov ie ,title ,p lo t ,a wa rd s ,… } { film ,na me ,s to ry,p r ize s ,… } { film ,name ,sto ry,p r izes ,… } {film,name,actors,prizes,…}

Entering Peer

{film,scenario,director,…}

{motion,picture,title,story-line,awards,…}

Semantic Peers

Neighbor Selection Movies.aol (Peer N)

SON1 Movies

CinemasSearch IMDb (Peer A) (Peer B) Connect

FindATheater (Peer C)

Google Yahoo! Movies Movies (Peer D) (Peer E)

HolidayInRome (Peer F)

Query

Fig. 2 Sample SONs for the reference network

the actual connection establishment between pairs of peers by triggering off all the required operations. Then, Query events are posed at peers in order to be answered by the most semantically related peers in the network. To implement the message exchange protocol, each peer maintains appropriate data structures and specific software modules, as it is depicted in Figure 3. In particular, as to the data structures, besides the already mentioned XML schema, each peer maintains a list of semantic mappings that provide the connections to its neighbors by specifying how to represent its XML schema in terms of the neighbors’ XML schemas. Two index structures are maintained too: a Semantic Clustering Index (SCI) which is used in the network construction phase for neighbor selection, and a Semantic Routing Index (SRI), which is exploited in the network exploration phase for query routing purposes. Notice that, while the SUNRISE architecture has been conceived in order to be completely independent from the data model adopted for schema representation and query formulation, the data model peculiarities are supported by the actual implementation of the software modules. This chapter focuses on the XML data model in order to provide a PDMS with an infrastructure for XML data sharing. In the following, we will show how such modules interact and access the data structures whereas functionality details will be given in Section 5.

The SUNRISE Approach for XML Data Sharing Networks

323

SUNRISE Semantic Peer Software modules ANNOTATION

Data structures SCI

MATCHING NETWORK ORGANIZATION SCI Management

SON Management

QUERY ROUTING SRI Management

SRI

Routing Management

QUERY REFORMULATION

Semantic Mappings XML Schema

Fig. 3 Peer’s internal architecture in SUNRISE

3.1 Network Construction As to network construction, every time a new peer joins the PDMS, it first activates the annotation module which makes explicit the semantics of its schema by disambiguating each schema’s term. This operation corresponds to associate each term with the sense which is most close to its interpretation in the schema, among all the available senses of the given term as offered by a reference thesaurus or dictionary. This operation is a very fundamental one, since schemas often contain many polysemous words whose meaning can be very different one from the others. Then, the peer has to choose its neighbors and a NeighborSelection event is generated. SUNRISE assists each newly entering peer in the selection of its neighbors in a two-fold fashion (see Figure 4): first, in a coarse-grained choice of the semantically closest overlay networks; then, within each overlay network, in a finegrained selection of the best neighbors among the most semantically related peers [30]. Peers are assigned to one or more SONs on the basis of their own concepts. In a PDMS, this operation is a really challenging one because of the lack of a common understanding among the peer’s local dictionaries. This means that similar or even the same contents in different peers are not usually described by the same concepts. Our proposal is to solve such heterogeneity by clustering together in the same SON nodes with semantically similar concepts. Semantic similarity is also at the basis of the approach we propose to guide the selection of the neighbors within each SON. As to the way of quantifying semantic similarity, our approach is general, in that it can be measured by means of any knowledge-based distance function between concepts, under the only requirement of being a metric, i.e. satisfying the non-negativity, the symmetry, and the triangle inequality properties. In particular, in the SUNRISE framework, different distance functions are available (see [30]), all of them taking advantage of the WordNet external knowledge source. SON selection relies on a “light” and scalable structure, the Access Point Structure (APS), which maintains summarized descriptions about the SONs available in

324

F. Mandreoli et al. APS

New Peer N List of Topics

Peer A

SON1

Peer A

…

SON2

Peer C

…

(Knn/Range) NeighborSelection

(Knn/Range) NeighborSelection Connect

SON Management SCI

Fig. 4 Actions performed following a NeighborSelection event

the network in order to help the newly entering peers to decide which SONs to join or whether to form new SONs. The APS ignores the linkage among peers and provides an abstraction of the SONs as clusters of concepts (e.g. f ilm, movie, picture... etc. for SON1 in Figure 2). In order for the APS to be a “light” structure which scales to the large, we do not keep all concepts at the APS level and follow an approach similar to the one adopted in [19] for clustering large datasets. For each SON, the APS treats its concepts collectively through a summarized representation called Semantic Feature (SF) which expresses the main characteristics of the SON, such as its clustroid (i.e. the centrally located concept according to the adopted distance function between the concepts) with the identifier of the peer it belongs to, some sample concepts, and other properties. Figure 2 shows a portion of the APS for the reference scenario. The APS contains the SFs of two SONs, SON1 and SON2, whose clustroids are the concepts film and movie-theatre owned by peers CinemaSearch (PeerA) and FindATheater (PeerC), respectively. In the third column some sample representative concepts for the two SONs are listed. Each newly entering peer exploits the information provided by the APS in order to decide which SONs to join or whether to form new SONs. As a first step, the peer computes the semantic distances between its own concepts and the clustroids of the SFs in the APS. Then, each peer’s concept is associated to the SON whose clustroid is the closest one.1 Then, the peer enters each SON which is associated to with at least one concept. Further, in order to avoid cluster distortions, a threshold T is used: concepts having a distance greater than T are clustered together and the peer originates a new SON for each cluster. As a running example for the network construction phase, let us consider the network join request of Movies.aol (Peer N) in Figure 2, which we suppose being “monothematic”. In this case, the concepts extracted from the XML schema of the entering peer {film, name, actors, prizes, ...} are more concerned with movies than with movie theaters, thus the peer enters SON1. After SON selection, the peer starts to navigate the link structure within each selected SON. To select neighbors in a SON, the entering peer starts from the clustroid peer and it navigates the link structure by visiting (some of) the peer’s immediate neighbors, then their immediate neighbors, and so on. Two neighbor selection 1

This complies with classical proposals in the field of incremental clustering [19].

The SUNRISE Approach for XML Data Sharing Networks

325

Peer B

New Neighbor N

Annotated XML Schema SCI data SRI data

Semantic Mappings MATCHING SCI SCI Management SRI SRI Management

Fig. 5 Actions performed following a Connection event

policies are supported: 1) a range-based selection, where the selected peers are those within a semantic distance bounded by a given threshold t, and 2) a k-NN selection, which finds out the k semantically nearest peers.2 Each peer receiving a (k-NN or range) NeighborSelection message activates the SON management module which determines whether it could belong to the selection required or not and, in the NeighborSelection message forwarding phase, it exploits its SCI to prune out non-relevant neighbors. SCIs are indeed used to lighten the neighbor selection process, with the objective of reducing the network load, i.e. the number of accessed peers and the computational effort each accessed peer is required to make. Figure 5 shows the actions which are performed when a Connection event occurs, i.e. every time a new connection is established in the network. Notice that each connection is a pairwise operation and consequently, as it is shown in the figure, basically involves two peers. Each peer receiving a Connection event (Peer B in the figure) also receives the annotated XML schema which is used by the matching module to find out the semantic correspondences with its schema. Then, the SRI management module and the SCI management module add a reference to the new neighbor in the corresponding index structures.

3.2 Network Exploration In a PDMS a query is posed at a peer and answers can come from any peer in the PDMS which is connected through a semantic path of mappings. Broadly speaking, the PDMS starts from the querying peer and reformulates the query over its immediate neighbors, then over their immediate neighbors, and so on [21]. Thus, when a query is forwarded through a semantic path, it undergoes a multi-step reformulation which may involve a chain of semantic approximations. Due to the heterogeneity of the schemas, each reformulation step may lead to some semantic approximation and, consequently, the returned data may not exactly fit with the query conditions. SUNRISE avoids query broadcasting and exploits such approximations for selecting the direction which is more likely to provide the best results to a given query 2

It is worth noting that the topology of the network is heavily influenced by the kind of neighbor selection policy each peer chooses when it joins the network.

326

F. Mandreoli et al.

QUERY ROUTING

Query Q

SRI

Routing Management

Next Peer P’ QUERY REFORMULATION

Semantic Mappings

Reformulated Query Q’ Fig. 6 Actions performed following a Query event

[32, 34]. As a reference example, we consider the request of an IMDb (Peer B) user asking for “the plot of the movie titled Indiana Jones IV and directed by Steven Spielberg”. Figure 6 depicts the actions a peer performs when it receives a Query event. After the execution of the query on the local data set and the collection of local results, the query routing module is activated. In particular, the Routing Management sub-module analyzes the received query and accesses the peer’s SRI in order to select the neighbor (P in the figure) rooting the most relevant subnetwork among the unvisited ones. Neighbor selection is done on the basis of the policies presented in Section 5.4. Then, the Query Reformulation Module uses the semantic mappings towards P to reformulate the received query Q into Q . Q can now be sent to P which will manage the Query message it receives in a similar way.

4 Modeling Semantic Approximations in a Fuzzy Settlement SUNRISE leverages semantic approximations originating from peers’ schemas heterogeneity as a mean for effective and efficient query processing. To this end, the system is founded on a fuzzy theoretical framework which defines the semantics of peer pairwise connections by conveniently extending the notion of mappings with scores, thus giving a measure of the semantic compatibility occurring between the involved portions of schemas. These scores, interpreted as fuzzy values, are used to model the semantic approximation given by each semantic path and each subnetwork queries can be forwarded through.

4.1 Mappings We denote with P a set of peers. Each peer pi ∈ P stores local data, modeled upon an XML schema Si . This makes a peer pi a semantic peer, in that its local schema Si describes the semantic content of its underlying data. Without loss of generality, we consider a peer schema Si as a set of semantic concepts {Ci1 , . . . ,Cimi }, each one understanding an XML schema element.

The SUNRISE Approach for XML Data Sharing Networks

327

Peers are pairwise connected in a semantic network through semantic mappings between peers’ schemas. This theoretical framework abstracts from the specific format that semantic mappings may have. For this reason, we consider a simplified scenario, and we assume directional, pairwise and one-to-one semantic mappings. The approach we propose can be straightforwardly applied to more complex mappings relying on query expressions as proposed in [4, 21]. A semantic mapping can be established from a source schema S j to a target schema Si , and it defines how to represent Si in terms of S j ’s vocabulary. In particular, it associates each concept in Si to a corresponding concept in S j according to a score, denoting the degree of semantic similarity between the two concepts. A formal definition of semantic mapping can be given according to a fuzzy interpretation, and it relies on the concept of fuzzy relation [26]. Definition 1 (Semantic Mapping). A semantic mapping from a source schema S j to a target schema Si , not necessarily distinct, is a fuzzy relation M(Si , S j ) ⊆ Si × S j where each instance (C,C ) has a membership grade denoted as μ (C,C ) ∈ [0, 1] and indicating the strength of the relation between C and C . This fuzzy relation satisfies the following properties: 1) it is a 0-function, i.e., for each C ∈ Si , it exists exactly one C in S j such that μ (C,C ) ≥ 0; 2) it is reflexive, i.e., given Si = S j , for each C ∈ Si μ (C,C) = 1. Without loss of generality, we assume that the self mapping M(S, S) is the identity relation. Notice that a non-mapped concept has membership grade 0. M(SB , SA ) (movie, film) = 0.97 is a sample tuple of the semantic mapping which could be established between SB and SA . Semantic mappings are used for query reformulation: When a querying peer pi forwards the query q to one of its neighbors, say, p j , q must be reformulated into q so that it refers to concepts in the p j ’s schema. To this end, pi uses the semantic mapping M(Si , S j ). In this context, reformulation amounts to unfolding [50].

4.2 Semantic Paths A semantic path is a chain of semantic mappings connecting a given pair of peers. PDMSs access data on remote peers by reformulating queries along the mappings a semantic path is made up. As local semantic mappings may involve semantic approximations, the semantic approximation given by a semantic path can be obtained by composing the fuzzy relations understood by the involved mappings. This relies on the notion of generalized composition of binary fuzzy relations [26]. Definition 2 (Composition of Mappings). Given a t-norm3 I and the semantic mappings, M(Si , S j ) ⊆ Si × S j and M(S j , Sk ) ⊆ S j × Sk , the I-composition of M(Si , S j ) and M(S j , Sk ) is the semantic mapping M(Si , S j ) ◦I M(S j , Sk ) ⊆ Si × Sk defined by: [M(Si , S j ) ◦I M(S j , Sk )](C,C ) =I[M (Si , S j )(C,C ), M(S j , Sk )(C ,C )], ∀C ∈ Si ,C ∈ Sk , with C ∈ S j . 3

A t-norm I is a binary operation on [0, 1] that is monotone, commutative, associative, and it satisfies the boundary condition I(a, 1) = a for all a in [0, 1].

328

F. Mandreoli et al.

The composition of more complex mappings require specific algorithms [50]. However, as we will see later, we are not properly interested in the instances of the resulting semantic mapping but rather on their membership grades. Definition 3 (Semantic Path). Given a t-norm I and a sequence of mappings < M(S1 , S2 ), . . . , M(Sk−1 , Sk ) > connecting peer p1 with peer pk , the path Pp1 ...pk ⊆ S1 × Sk is the semantic mapping M(S1 , S2 ) ◦I . . . ◦I M(Sk−1 , Sk ). The composition function should capture the intuition that the longer the chain of mappings, the lower the grades, thus denoting the accumulation of semantic approximations given by a sequence of connected peers. In order to obtain such effect of semantic attenuation due to the chain of mappings from C1 to Ck in the schema of a peer pk which is far away from p1 , several alternatives exist for the t-norm I [26]. For instance, a possible choice for the t-norm I is the algebraic product I(μ , μ ) = μ ∗ μ . In fact, given that the arguments are grades in [0, 1], their algebraic product is still in [0, 1], and it is lower than or at most equal to its arguments. Given M(SB , SA )(movie, film) = 0.97 and M(SA , SC )(film, motion) = 0.36 in the reference example, their composition based on the algebraic product yields to the following instance of the semantic path: PPeerB, PeerA, PeerC ⊆ SB × SC : (movie, motion) = 0.35.

4.3 Generalized Semantic Mappings The query execution process starts from the querying peer which reformulates the query over its immediate neighbors, then over their immediate neighbors and so on. Thus, from a multi-step reformulation point of view, whenever a query posed over peer pi is reformulated over peer p j , the query is moving from pi to the subnetwork rooted at p j and it might follow any of the semantic paths originating at p j . In order to model the semantic approximation of the p j ’s subnetwork w.r.t. the pi ’s schema, the semantic approximations given by each path in the p j ’s subnetwork are aggregated into a measure reflecting the relevance of the subnetwork as a whole. To this end, the notion of semantic mapping is generalized as follows. Let p j denote the set of peers in the subnetwork rooted in p j , S j the set of schemas {S jk |p jk ∈ p j }, and Ppi ...p j the set of paths from pi to any peer in p j . The generalized mapping relates each concept C in Si to a set of concepts C in S j taken from the mappings in Ppi...p j , according to an aggregated score which expresses the semantic similarity between C and C . In this context, a concept in Si can be associated to more than one concept in a schema S jk in S j , since more than one path may exist between pi and p jk . The following definition formalizes the notion of aggregation of the semantic paths starting from pi and ending in any peer in p j . Definition 4 (Generalized Semantic Mapping). Let pi and p j be two peers, not necessarily distinct, and g an aggregation function. A generalized semantic mapping

The SUNRISE Approach for XML Data Sharing Networks

329

between pi and p j is a fuzzy relation M(Si , S j ) where each instance (C,C ) is such that: • C is the set of concepts {C1 , . . . ,Ch } associated with C in Ppi ...p j , and • μ (C,C ) = g(μ (C,C1 ), . . . , μ (C,Ch )). As to the function g, the following properties, which express the essence of the notion of aggregation [26], must hold: 1) g is monotonic increasing in all its arguments; 2) g is a continuous function; 3) g respects the boundary conditions g(0, . . . , 0) = 0 and g(1, . . . , 1) = 1. Moreover, aggregating operations on fuzzy sets are usually expected to satisfy two additional requirements: 4) g is a symmetric function of all its arguments, that is, g(a1 , . . . , am ) = g(aπ (1) , . . . , aπ (m) ) for any permutation π on [1, m]; 5) g is an idempotent function, that is, g(a, . . . , a) = a for all a ∈ [0, 1]. The aggregation function g should be chosen conveniently to model the semantic aggregation of semantic grades. In fact, each resulting grade for a given concept should be representative of the semantic approximation given by the peer and its own subnetwork. Several choices are possible for g, for instance functions such as the min, the max, any generalized mean (e.g., harmonic and arithmetic means), or any ordered weighted averaging (OWA) function (e.g., a weighted sum) [26]. For instance, the generalized semantic mapping M(SB , SA ) between PeerB and PeerA of the reference example relates movie with {film, motion, } related to the paths PeerB-PeerA, PeerB-PeerA-PeerC, and PeerB-PeerA-PeerF4, respectively, whose membership grade based on the arithmetic means is (0.97 + 0.35 + 0.0)/3 = 0.44. The Lemma below easily follows from the properties that an aggregation function must satisfy. Lemma 1. A generalized semantic mapping between pi and p j is a fuzzy relation M(Si , S j ) which satisfies the following properties: 1) it is a 0-function, i.e., for each C ∈ Si it exists exactly one tuple C in the range such that μ (C,C ) ≥ 0; 2) it is reflexive, i.e., given Si = S j , for each C ∈ Si , μ (C,C) = 1.

5 SUNRISE for XML Data Sharing In SUNRISE the internal architecture of each peer is enhanced with the software modules depicted in Figure 3. The main aim of this section is to show how these modules act to create a P2P network for XML data sharing and to explore the network for processing XQuery queries. In particular we will show how SUNRISE implements the fuzzy settlement described in Section 4 by means of a

Notice that Peer F is in the subnetwork rooted at Peer A but, since Peer F only belongs to SON2, movie has no correspondence in PeerF through PeerA and, thus, its associated membership grade is 0.

330

F. Mandreoli et al.

matching module, which extends mapping with similarity scores, and a query routing module, which incrementally computes generalized semantic mappings as the network evolves and uses them for a wise query propagation.

5.1 Annotation Module SUNRISE’s annotation module overcomes the ambiguity of natural language schema terms, as it makes explicit the meanings of the words employed in the peer’s schemas. Indeed, schemas often contain many polysemous words and their meanings could be very different one from the others. Let us examine, for instance, some of the terms in Peer B XML schema (consider again Figure 2) along with some of their meanings extracted from WordNet: plot could be “a secret scheme to do something” (sense 1), “the story that is told in a movie” (sense 2), or many others, title could be a “statute title” (sense 1), “the name of a work of art” (sense 2), “the status of being a champion” (sense 3), and seven others. In order for the schema matching and, consequently, the query processing phase to be effective, it is fundamental to be able to determine the right meaning of the employed terminology. To this end, the annotation module exploits the novel versatile structural disambiguation approach we proposed in [37, 38] and automatically annotates the schemas with the most probable senses extracted from WordNet. The idea behind our annotation approach is to disambiguate the terms occurring in the nodes’ labels by analysing their schema context and by using WordNet as an external knowledge source. The process is composed of the following phases: Schema Pre-processing. Starting from the original XML schema, the annotation module first derives a tree structure representing the underlying conceptual organization. Indeed, the original schema contains, along with the element definitions whose importance is definitely central, also many additional building blocks (i.e. schema components) like type definitions, regular expression keywords, and so on, which may interfere or even distort the discovery of the real underlying tree structure. For this reason, XML Schema components are rewritten in a more explicit way: the content of each complexType is appended as a child of each element of this type, and the original type definition, along with all type references, are discarded; the references to elements or attributes are substituted by the referred nodes; a final filtration is performed on the nodes of the resulting schema, keeping only element and attribute nodes and discarding all other nodes. In the resulting expanded schema, every element or attribute node has a name. Middle elements have children and these can be deduced immediately from the new explicit structure. Figure 7-a depicts a fragment of such structure for both Peer B and A: the trees abstract from the several complexities of the XML schema syntax and only represent the fundamental concepts (nodes) together with their relations (e.g. Title is an attribute of Movie and, therefore, is one children of its). Context Extraction. Starting from each given node, several ways of navigating the obtained schema tree are supported in order to extract their context. In its simplest form, each term’s context contains those terms labeling all the nodes in the tree

The SUNRISE Approach for XML Data Sharing Networks

Peer B

Peer A

movie

film

title

331

awards

plot

director

name

credits

story

production

0.97 0.22

movie title

0.28

0.81

name 0.15

0.18

awards

film 0.25

0.06

production

direction

0.97

0.81

0.82

0.69

… Fig. 7 A portion of Peer A and Peer B schemas and some details about their matching process

which are reachable from the one considered. This is the case of very specific and homogeneous schemas, where all nodes belong to the same semantic area (as in our example). However, different applications require different contexts: for instance, very large and heterogenous schemas benefit from a more carefully selected context. For these reasons, it is possible to limit the context by excluding distant and, therefore, unrelated and potentially misleading terms. This is achieved by means of different crossing settings: starting from the term node, the set of crossable arc labels and the corresponding crossing directions is shrinkable, that is it is possible to specify which kinds of arcs are crossable, in which direction, and the maximum number of crossings (distance from the term’s node) to reach the context terms’ nodes. Moreover, as we deal with trees, it is also possible to include the siblings of the term’s node in the context. In this way, context extraction can easily be tailored to the specific application needs. Word Sense Disambiguation. Finally, specifically devised disambiguation algorithms make use of the hypernymy/hyponymy hierarchy, as suggested by most of the classic WSD studies, in order to determine each term’s most probable meanings w.r.t. its context. The underlying principle is that the confidence in choosing one of the senses associated with each term is directly proportional to the semantic relatedness between that term and each term in the context. For example, in Peer B’s schema, term plot has many movie-related terms, such as movie and director in its context, thus the resulting confidence in choosing sense 2 (the right one) will be much higher than, for instance, sense 1, which is about secret plans. Further, additional information coming from the thesaurus, such as the nouns used in the terms’ definitions and usage examples of each term’s senses, can be compared against the schema context so to automatically understand which is the meaning closest to the one in the schema. For instance, for plot, the most significant to our context are the nouns from sense 2 examples, containing terms like movie and character.

332

F. Mandreoli et al.

The outcome of the disambiguation process is a ranking of the plausible senses for each term. The overall ranking approach is quite versatile and supports two types of graph disambiguation services: the assisted and the completely automatic one. In the former case, the disambiguation task is committed to a human expert and the disambiguation service assists him/her by providing useful suggestions. In the latter case, there is no human intervention and the selected sense can be the top one.

5.2 Matching Module The matching module is able to automatically generate semantic matches between the annotated schemas of the current peer (source) and of a newly connected neighboring peer (target). Each mapping is extended with a score whose semantics has been presented in Section 4. In particular, it associates each source concept to a corresponding target concept according to a score, denoting the degree of semantic similarity between the two concepts. For instance, considering the two XML schema fragments from Peer A and B schemas shown in Figure 7-a: while their structure and element names are different, they clearly represent similar concepts and the correspondences resulting from their matching are represented by the same number (see Figure 7-c for the final match scores). Since these matching scores have to be exploited for network organization and network exploration purposes, the main characteristic our matching module should satisfy is the ability to capture the semantic approximation originating from schema heterogeneity and to quantify it by means of scores which should be comparable. Among the several existing schema matching approaches [5, 16, 24, 31] we draw inspiration from the similarity flooding algorithm, originally proposed in [40]. Similarity flooding is a generic graph matching algorithm which uses fixed point computation to determine corresponding nodes in the graph; the principle of the algorithm is that the similarity between two nodes depends on the similarity between their adjacent nodes. As it is been proved, this method is one of the most versatile and also provides realistic metrics for match accuracy [15]. Our approach goes beyond the similarity flooding algorithm by considering both the structure of the corresponding trees and the semantics of the involved terms, as extracted by the annotation module. Indeed, in our opinion the meanings of the terms used in the XML schemas cannot be ignored as they represent the semantics of the actual content of the XML documents. On the other hand, the structural part of XML documents cannot be considered as a plain set of terms as the position of each node in the tree provides the context of the corresponding term. In particular, in order to identify the â&#x20AC;&#x153;bestâ&#x20AC;? matchings, SUNRISE matching module operates according to the following steps (see [39] for an in-depth explanation): â&#x20AC;˘ the involved schemas are first converted into directed labeled graphs following the RDF specifications5 , where each entity represents an element or attribute of the schema identified by the full path (e.g. /movie/title) and each literal represents a particular name (e.g. title) or a primitive type 5

http://www.w3.org/RDF/

The SUNRISE Approach for XML Data Sharing Networks

333

(e.g. xsd:string) which more than one element or attribute of the schema can share. As to the arcs, we mainly employ two kinds of labels: child, which captures the involved schema structure, and name. Such label set can be optionally extended for further flexibility in the matching process: The extra labels are type, which can be used to make the primitive type of leaf elements or attribute relevant to the matching, and tag, which allows for explicit distinction between an element and an attribute; • from the RDF graphs of each pair of schemas a pairwise connectivity graph (PCG) [40], involving node pairs, is constructed, in which a labeled edge connects two pairs of nodes, one for each RDF graph, if such labeled edge connects the involved nodes in the RDF graphs. Figure 8 shows a portion of the RDF graph and of the pairwise connectivity graph for Peer A and Peer B schema fragments; • an initial similarity score is computed for each node pair contained in the PCG. Similarly to the annotation approach, we follow a linguistic approach in the computation of the similarities between terms. Specifically, the scores for each pair of annotated terms (t1 ,t2 ) are obtained by computing their depths in the WN hypernyms hierarchy and the length of the path connecting them as follows [39]: 2 ∗ depth of the least common ancestor ; depth of t1 + depth of t2 • such similarities, reflecting the semantics of the single node pairs, are refined by an iterative fixpoint calculation [40], which brings the structural information of the schemas in the computation by propagating the similarity of the elements to their adjacent nodes. The fixpoint computation is iterated until the similarities converge or a maximum number of iterations is reached; • finally, a stable marriage filter and a threshold filter are applied to the resulting network of correspondences (Figure 7-b shows an excerpt of such a network for our example). The stable marriage filter guarantees that, for each pair of nodes (x, y), no other pair (x , y ) exists such that x is more similar to y than to y and y is more similar to x than to x ; on the other hand, the threshold filter ensures that very loose (and, thus, potentially wrong) matches do not appear in the final matches. For instance, from the example in Figure 7, the two filters extract the right correspondences for Peer B’ movie and title (shown in bold lines in Figure 7-b and with numbers “1” and “2”, respectively, in Figure 7-a and 7-c), while the awards node is not assigned to a corresponding one in Peer A’s schema. Generally speaking, schema matching is the first step towards mappings that defines how to represent the source schema’s concepts in terms of the target schema vocabulary [42]. Obviously the quality of mappings influences the effectiveness of query processing in a PDMS but the techniques we propose for network construction and exploration are completely independent from the specific format that semantic mappings may have. Indeed, our main concern is about the approximation originating from vocabulary heterogeneity. For this reason, we consider a simplified scenario

334

F. Mandreoli et al.

RDF model for Peer B schema (fragment) /movie

name

/film movie

child /movie/title

RDF model for Peer A schema (fragment)

name

/film/name

/movie , /film

name film

child

name

name movie , film

child /movie/title , /film/name

name

title

Pairwise connectivity graph (fragment)

name title , name

Fig. 8 RDF and corresponding PCG for a portion of Peer A and Peer B schemas

where the outcome of the matching module actually corresponds to the directional, pairwise and one-to-one semantic mappings each peer stores in its local folder.

5.3 Network Organization Module The Network Organization Module supports peers in the construction of the network by providing functionalities for neighbor selection. It is made up of two submodules: the SCI Management module, which is devoted to the maintenance of the specifically devised data structures, and the SON Management module, which manages the effective neighbor selection process. 5.3.1

SCI Management Module

The SCI management module provides each peer with the functionalities for the creation and management of the indexing structures used in the neighbor selection process: the Semantic Clustering Indices (SCIs). Indeed, in order to guide a peer joining the network towards its best position in the selected SONs, each peer maintains a SCI which contains summarized information about the concepts which can be reached in each available direction. In particular, the SCI SCIP of a peer P is a matrix where each cell SCIP[i, j] refers to the set of concepts in the j-th SON (SON j ) which are reachable in the sub-SON rooted at the i-th neighbor. Each column j contains non-null values in correspondence of each peer belonging to SON j . Each cell stores a summarized description similar to a SF (i.e. the clustroid of the sub-SON SONi, j , the radius, i.e. the maximum distance between the clustroid and SONi, j ’s concepts, and other information). Figure 9-a shows the SCI of Peer A (CinemaSearch). The concept in each cell is the clustroid of the corresponding sub-SON, while the score is the radius. Notice that the first row refers to Peer A itself and thus the two cells refer to the sets of concepts through which Peer A joined SON1 and SON2, respectively. A SCI SCIP provides an abstraction of the SONs P belongs to as trees. More precisely, if P belongs to SON j then P is the root node and it has as many children as the number of its neighbors in SON j , i.e. the number of non-null cells in SCIP [∗, j]. Figure 9-b depicts this tree-based abstraction of SON1 from Peer A’s point of view: all concepts in the sub-SON rooted at Peer B (resp., Peer C) are within a semantic distance of 0.5 (resp., 0.3) from the clustroid concept movie (resp., picture).

The SUNRISE Approach for XML Data Sharing Networks

335 film

Peer A

0.2

SON1

SON2

Peer A

(film,0.2,…)

(cinema,0.1,…)

Peer C

(short-film,0.3,…) (movie-theater,0.4,...)

Peer B

(movie,0.5,…)

null

Peer F

null

(cinema,0.35,…)

(a) Peer A’s SCI

d(film,movie)

Peer B

d(film,short-film)

Peer C 0.5

movie

0.3

picture

(b) Tree abstraction of SON1

Fig. 9 Peer A’s SCI and tree abstraction of SON1

Since SCIs maintain summarized information about SONs, they need to change whenever the SONs themselves are modified by the joining or leaving of peers. In particular, SCIs creation and evolution are managed by the SCI management module in an incremental fashion as follows. As a base case, the SCI of an isolated peer P has a single row, referring to the peer P itself. This row expresses the information about the local knowledge of P and contains in each cell SCIP [0, j] the description of the SON j which only contains peer P. When an entering peer Pi joins SON j with its concepts Cpt Pj i and selects Pk P

with its concept Cpt j k as its neighbor, a new connection is created. In particular, each of the connecting peers (say Pi ) informs the other peer (say Pk ) of the sub-SON that can be accessed through it. To this end, Pi aggregates its own SCI by rows and sends the j-th column to Pk . It represents the sub-SON SONi, j and it is obtained by merging the clusters of concepts represented in each cell of the j-th column. Details on the effective way of computing this merge are given in [30], where the protocol regulating SCIs’ creation/updating is presented. After Pk receives such knowledge, it puts the received information in the cell SCIPk [i, j] (i.e. it adds peer Pi to its SON j ’s neighbors), after having extended its SCI with a new row for Pi , if Pi is a newly added neighbor. Afterwards, both peers Pi and Pk need to inform their own reverse neighbors that a change occurred in the network and thus they have to update their SCIs accordingly. To this end, each peer, say Pi , sends to each reverse neighbor Ph which belongs to the SON SON j , i.e. for which SCIPi [h, j] is not empty, an aggregate of its j-th column excluding the Ph ’s cell. When Ph receives such aggregated information, it updates the cell SCIPh [i, j] with the received information. Disconnections are treated in a similar way as connections. When a node disconnects from the network, each of its neighbors must delete the row of the disconnected peer from its own SCI and then inform the remaining neighbors that a change on its own subnetwork has occurred by sending new aggregates of its SCI to them. 5.3.2

SON Management Module

Each peer receiving a (k-NN or range) NeighborSelection message activates the SON management module. Such module exploits the peer’s SCI to lighten the

336

F. Mandreoli et al.

neighbor selection process. The objective is to reduce the network load, i.e. the number of accessed peers and the computational effort each accessed peer is required to make. More precisely, an entering peer Pnew starts the exploration and follows each path which can not be excluded from leading to peers satisfying the (range or k-NN) selection condition. In the range-based selection, for each contacted peer P, the distance between Pnew ’s concepts and P’s concepts is computed and, if the threshold condition is satisfied, P is chosen as Pnew ’s neighbor. For k-NN selection, a branchand-bound technique is applied, based on a priority queue of pointers to active sub-SONs, i.e. subnetworks where the k nearest neighbors of Pnew can possibly be found, and on a k-elements array which at the end of the process contains the k selected neighbors. The detailed algorithms which implement these neighbor selection policies can be found in [30]. For both policies, the information stored at SCIP is used to prune out non-relevant subnetworks and to avoid useless distance computations by exploiting the triangle inequality property of the distance function used. In particular, let ri, j be the radius of sub-SON SONi, j and consider the case of range-based selection. All the sub-SONs in SCIP whose clustroids are at a distance d from Pnew ’s concepts such that d > ri, j + t can be safely pruned. In fact, this condition guarantees that all concepts in the sub-SON SONi, j have a distance greater than the threshold t. A similar condition can be exploited for k-NN-based selection, where the distance dk from the current k-th nearest neighbor is used as a dynamic threshold (i.e. the test condition is d > ri, j + dk ). Going back to our reference example, according to the APS in Figure 2 the entering peer Peer N (Movies.aol) starts the exploration of SON1 from the clustroid peer Peer A (CinemaSearch). Then, let us consider Figure 9 and suppose the peer wants to find its neighbors in a range of 0.2 and that the semantic distances from Peer N’s concepts to Peer B’s and Peer C’s concepts are 0.4 and 0.8, respectively. Since 0.4 < 0.5 + 0.2 the sub-SON rooted at Peer B (IMDb) is explored, whereas the one rooted at Peer C (FindATheater) is pruned since 0.8 > 0.3 + 0.2.

5.4 Query Routing Module The role of the query routing module is to provide the PDMS with advanced semantic query routing functionalities and it consists of two parts: the SRI management module, which has the role of managing the creation and evolution of the index structures involved in the routing process (the SRIs); and the Routing Management one, which helps each peer receiving a query in routing it towards the best subnetworks originating at its neighbors. 5.4.1

SRI Management Module

Each peer maintains a Semantic Routing Index (SRI) containing cumulative information which summarizes the semantic approximation capabilities w.r.t. its XML schema of the whole subnetworks rooted at each of its neighbors. In particular, each peer keeps such information in a local data structure which we call Semantic

The SUNRISE Approach for XML Data Sharing Networks

SRIPeerB PeerB PeerA PeerE PeerD

337

movie

title

plot

…

1.00 0.91 0.77 0.72

1.00 0.88 0.70 0.68

1.00 0.83 0.41 0.30

… … … …

Fig. 10 A portion of Peer B’s SRI for the reference example

Routing Index (SRI). Thus, a peer P having n neighbors and m concepts in its XML schema stores an SRI structured as a matrix with m columns and n + 1 rows, where the first row refers to the knowledge on the local schema of peer P. Definition 5 (Semantic Routing Index). Let p be a peer with schema S={C1 , . . . , Cm } and neighbors p1 , . . . , pn . The p’s Semantic Routing Index is a matrix SRI of n + 1 rows and m columns, such that each entry SRI[i, j], for i = 1 . . . n j = 1 . . . m, is the membership grade μ (C j ,C j ) of the instance (C j ,C j ) of the generalized semantic mapping M(S, Si ). A sample portion of Peer B’s SRI for the reference example is shown in Figure 10. Notice that the scores of the first row represent the scores of Peer B’s concepts in the self mapping, which we assume to be the identity relation. The operations for the creation and update of the SRIs are the SRI management module’s responsibility. In particular, each peer creates its own SRI when entering the network executing a specifically devised protocol. The same protocol regulates SRIs evolution in an incremental way as follows. As a base case, the SRI of an isolated peer p having schema S is made of the single row [1, . . . , 1], i.e., it contains the membership grades of the concepts in S in the self mapping M(S, S). This row expresses the semantic approximation offered by the subnetwork rooted in p, yet made of the only peer p. When a peer connects to another peer, each one aggregates its own SRI SRI by rows, according to an aggregation function g. The result of this aggregation operation is a tuple SRI g = [μ1 , . . . , μm ]. Each μ j is the membership grade of concept C j in the schema S of the peer to the fuzzy relation obtained by the aggregation of the SRI’s rows, i.e., μ j = g(SRI[0][ j], . . ., SRI[n][ j]) for j = 1 . . . m. The so obtained fuzzy relation SRI g and the schema S are then sent to the other peer. After a peer, say pi , receives such knowledge from the other peer, say p j , a semantic mapping M(Si , S j ) is established between Si and S j . Then, pi extends its SRI SRIi with a new row for p j . The membership grades of this newly created row are obtained in two steps: 1) M(Si , S j ) is composed with the aggregated SRI SRI gj provided by p j to obtain a fuzzy relation which expresses the extension of the semantic paths originating from p j (represented by the aggregated SRI) with the connection between pi and p j ; 2) the so obtained fuzzy relation is then aggregated

338

F. Mandreoli et al.

with M(Si , S j ) to include the semantic path connecting pi with p j . More precisely,6 SRIi [ j][k] = g(M(Si , S j )(Ck ,Ck ), M(Si , S j )(Ck ,Ck ) ◦I SRI gj[k]), for j = 1 . . . m. Afterwards, both peers pi and p j need to inform their own reverse neighbors that a change occurred in the network and thus they have to update their SRIs accordingly. To this end, each peer, say pi , sends to each reverse neighbor pik an aggregate of its SRI, excluding the pik ’s row, i.e., μ j = g(SRIi [0][ j], . . . , SRIi [k − 1][ j], SRIi[k + 1][ j], . . . , SRIi [n][ j]). When pik receives such aggregated information, it updates the i-th row of its SRI by recomputing the membership values as discussed above. As the values stored in the SRIs are computed incrementally, we have to show that they actually correspond to the membership values of the generalized semantic mappings. Theorem 1. Whenever the aggregation function g is associative and the composition function I is distributive over the aggregation function g, the process described above is correct, i.e. when applied to the Semantic Routing Index of peer pi , SRIi [ j] is the generalized semantic mapping M(Si , S j ) between pi and p j . Proof is given in Appendix. Disconnections are treated in a similar way as connections. When a node disconnects from the network, each of its neighbors must delete the row of the disconnected peer from its own SRI and then inform the remaining neighbors that a change on its own subnetwork has occurred by sending new aggregates of its SRI to them. A similar procedure applies in case of modifications of the semantic knowledge maintained at each peer, for instance when a new concept is added to the peer’s schema. When many changes occur in the PDMS, a careful policy of updates propagation may be adopted. For instance, when changes have a little impact on its SRI, a peer may also decide not to notify the network. This would reduce the amount of exchanged messages as well as the computational costs due to SRI manipulation. We are aware that the definition of such policies, recommended for highly dynamic PDMSs, is a fundamental issue, so we plan to deal with it in future work. 5.4.2

Routing Management Module

With SRIs at the PDMS’s disposal, in query forwarding phase, a peer P accesses its own index for determining the neighboring peers which are most semantically related to the concepts in q. For example, if a query q refers to a single concept C, the choice of the semantically best neighboring peers can be done by evaluating the column of its SRI corresponding to C: this means that Peer A would be the selected neighbor for the concept plot in Figure 10. In general, each more realistic and thus complex XQuery query involving several concepts can be interpreted as a formula of predicates specifying the query conditions and combined through logical connectives. In this case, the choice of the best neighbor can thus be done by applying scoring rules which, for each neighboring peer Pi , combine the corresponding 6

Notice that, because of the boundary condition of the t-norm I used for composition, and being g an idempotent function, the base case of connection of an isolated peer p j to a peer pi results in SRIi [ j][k] = M(Si , S j )(Ck ,Ck ).

The SUNRISE Approach for XML Data Sharing Networks

339

grades in the SRI for all the corresponding concepts in q. Specifically, the fuzzy logic approach presented in [34] is adopted. Going back to our example (Figure 10), the score of Peer A for a query involving the concepts plot and title connected through an AND operator would be min(0.83, 0.88), being conjunction dealt with the minimum. Assuming that an overall score is somehow obtained for a complex query, different routing strategies can be executed, each having effectiveness, efficiency or a trade-off between the two as its priority. In particular, the routing strategies the Routing Management Module can implement belong to two main families of navigation policies: the Depth First (DF), which pursues efficiency as its objective (i.e. its main objective is to minimize the query path), and the Global (G), or Goal-based model, which is designed for effectiveness (i.e. its main objective is to maximize the relevance of the retrieved results). The DF model provides an SRI-oriented depthfirst visiting criteria: it progresses deeper and deeper in the network following, at each forwarding step, the path toward the neighbor characterized by the highest SRI value; backtracking is only performed when a “blind alley” is reached. Based on the DF model, the two following routing policies are implemented: • DF policy: the “standard” depth-first policy, straightly implementing the DF model; • DFF (Depth-First Fan) policy: a variation of DF, performing depth-first visit with an added twist. Specifically, at each node, DFF performs a “fan” by exploring all the neighbors, then it proceeds in depth to the best subnetwork, as DF does. DFF is an attempt to enhance DF, as it tries to capture in less hops more answers coming from short semantic paths and, thus, being potentially more relevant than those retrieved by DF. In order to better explain how the DF policies work, let us consider our reference network and see how a query posed on Peer B and involving the only concept plot would be routed (see Figures 2 and 10). We use the following notation: we present the routing sequence of hops as an ordered list, where each entry P means peer P is accessed and queried, while (P) denotes a backtracking hop through peer P. We consider the navigation until peers B through C (the most relevant ones) have been queried. For the DF policy this would be the behavior: Peer B, Peer A (most promising subnetwork rooted at Peer B), Peer C (assuming that Peer C is more relevant than Peer F w.r.t. Peer A), (Peer A), Peer F, (Peer A), (Peer B), Peer E, Peer D. For DFF: Peer B, Peer A, (Peer B), Peer E, (a fan is performed before exploring the best subnetwork), (Peer B), Peer D, (Peer B), (Peer A), Peer C, (Peer A), Peer F. Differently from the DF model, in the G one each peer chooses the best peer to forward the query to in a “global” way: it does not limit its choice among the neighbors but it considers all the peers already “discovered” (i.e. for which a navigation path leading to them has been found) during network exploration and that have still not been visited. This is mainly achieved by managing and passing along the network an additional structure, called Goal List (GL), which is a globally ordered list of goals. Each goal G contains information useful for next peer selection. In particular, it represents an arc in the network topology, starting from an already queried

340

F. Mandreoli et al.

peer and going to a destination (and still unvisited) one. GL is always kept ordered on the basis of the goals’ semantic relevances, which are calculated by means of an appropriate function taking into account the whole path originating from the querying peer. The G model simply progresses selecting the top goal in GL as the next peer to be queried. In this way, the G model constantly exploits backtracking in order to reach back potentially distant goals. Obviously going back to potentially distant goals (peers) has a cost in terms of efficiency but always ensures the highest possible effectiveness, since the most relevant discovered peers are always selected. Based on the G model, two routing policies are implemented, which differ on the basis of the function used for the goals relevance computation: • G policy: the function only considers the semantic relevance of the goals; • GH (Global Hybrid) policy: this “hybrid” policy chooses goals following a trade-off between effectiveness and efficiency. This is achieved by introducing an ad-hoc parameterizable function f , which does not only consider a goal G’s semantic relevance semRel but also its distance hops (expressed in number of hops) from current peer: f(semRel) = semRel/(hops)k , k = 0 . . . ∞. By simply adjusting the value of k, the GH policy can be easily tuned more on efficiency (k → ∞) or on effectiveness (k → 0). Going back to our reference example, let us assume that the scores of the goal destinations (incrementally computed during navigation) are in the following order: Peer A, Peer E, Peer C, Peer D, Peer F. Then, the routing sequence for the G policy would be: Peer B, Peer A, (Peer B), Peer E, (Peer B), (Peer A), Peer C, (Peer A), (Peer B), Peer D, (Peer B), (Peer A), Peer F.

5.5 Query Reformulation Module The semantic mappings produced by the Matching module are exploited to reformulate the source query to a target one, compatible with the target peer’s schema. As our mappings relate the target peer’s schema concepts with the source ones, reformulation translates to unfolding [39]. At present, we support XQuery FLWOR conjunctive queries with standard variable use, predicates and wildcards. In particular, after having substituted each path in the WHERE and RETURN clauses with the corresponding full paths and then discarded the variable introduced in the FOR clause, all the full paths in the query are reformulated by using the best matches between the nodes in the given source schema and the target schema (e.g. the path /movie/director of Peer B is automatically reformulated in the corresponding best match, /film/credits/ direction of Peer A, as can be seen from Figure 7-a). Let us consider, for instance, the XQuery representation of our simple running example’s query: FOR $x IN /movie WHERE $x/title = "Indiana Jones IV" AND $x/director = "Steven Spielberg" RETURN $x/plot

The SUNRISE Approach for XML Data Sharing Networks

341

The reformulation module transforms it in the following target query, compatible with Peer Aâ&#x20AC;&#x2122;s schema: FOR $x IN /film WHERE $x/name = "Indiana Jones IV" AND $x/credits/direction = "Steven Spielberg" RETURN $x/story

6 Experiments This section describes the empirical evaluation of SUNRISE, performed by means of its simulation environment [35] through which we were able to reproduce the main conditions characterizing a PDMS environment where autonomous peers freely decide when entering the system. The simulation engine is based on SimJava 2.0, a discrete, event-based, general purpose simulator. Through this framework we modeled scenarios corresponding to networks of semantic peers, each with its own XML schema describing a particular reality. As in [50], the schemas are derived from real world-data sets, collected from many different available web sites, such as IMDb and DBLP Computer Society Bibliography, and enlarged with new schemas created by introducing structural and terminological variations on the original ones; in such a way we were able to fully test the potentialities of SUNRISE with large PDMSs of semantically related peers. The schemas differ for their complexity and dimension (their mean size is in the order of dozens of elements), and belong to three main domains: movies, publications and sport. The networks are automatically produced by SUNRISE network organization algorithms, which establish the connections among peers according to the semantic similarity between peersâ&#x20AC;&#x2122; schemas. The mean size of our networks is in the order of some hundreds of nodes. In order to evaluate the benefits provided by the network construction and routing techniques, and thus the effectiveness and efficiency of SUNRISE query answering, we instantiated different queries on randomly selected peers where each query is a combination, through logical connectives, of a small number of predicates specifying conditions on concepts. More precisely, we quantified the advantages on query processing by propagating each query until a stopping condition is reached considering two alternatives: stopping the querying process when a given number of hops (hops) has been performed and measuring the quality of the results (satisfaction) or, in a dual way, stopping when a given satisfaction is obtained and measuring the required number of hops. Satisfaction is a specifically introduced quantity that grows proportionally to the goodness of the results returned by each queried peer [32]. In particular, we compared the routing strategies presented in Section 5.4 together with the Global IP-based (GIP) policy, which is a variation of the Global (G) mechanism: a direct connection is established between the current peer and the peer chosen for the following step, avoiding the hops needed to reach it in the original network topology. This policy can not be considered a real P2P strategy, but it is an interesting upper-bound to be shown.

342

F. Mandreoli et al. 7,5

Satisfaction

6,5 5,5 4,5 3,5 2,5 R GIP G DF DFF R* DF*

3,417 4,620 4,245 4,300 3,653 2,879 3,479

4,938 7,015 5,629 6,427 5,357 2,843 4,129

6,038 6,653 7,443 7,443 6,588 7,079 7,025 7,025 6,886 7,256 3,367 4,074 4,353 4,394 Number of hops

6,788 7,443 7,232 7,156 7,264 4,074 4,394

6,843 7,443 7,283 7,156 7,264 4,074 4,394

Fig. 11 Satisfaction reached by routing policies given a maximum number of hops (first scenario)

We considered two significant scenarios differing in the grade of semantic heterogeneity characterizing each peer’s schema. In the first one, most peers’s XML schemas are monothematic, while in the second one many are multithematic. Figure 11 shows the trend of the obtained satisfaction when we gradually vary the stopping condition on hops for the first scenario. As the Figure shows, SUNRISE network organization techniques allow for a relevant improvement of the query processing effectiveness, even when no routing capabilities are available. In particular, comparing the R*7 and R curves, we can appreciate the improvement provided by the SUNRISE network organization algorithms. Moreover, exploiting the routing techniques we can achieve even better effectiveness. Specifically, the obtained network organization allows the DF policy to achieve results near to the upper bound (GIP). Further, the DFF mechanism is initially less effective than the DF one, since it uses a large number of hops for performing its “fan” exploration. Nevertheless, for higher stopping conditions, it becomes increasingly more effective until it outperforms the DF policy: this is due to the fact that it visits nearer peers, which have a higher probability to provide better results. On the other hand, Figure 12 shows the results of the experiments aiming at verifying the efficiency of SUNRISE query routing: it represents the trend of the number of required hops for a given satisfaction goal. As we expected, the DF policy outperforms the others since its priority target is to minimize the query path. The DFF policy is instead the less efficient one, due to the number of hops for visiting all the neighboring peers. In the more complex second scenario, routing becomes even more relevant for query processing, since interesting data are spread among a larger number of peers. 7

‘*’ denotes a test performed on a randomly constructed network topology.

The SUNRISE Approach for XML Data Sharing Networks

343

Number of hops

16 14 12 10 8 6 4 2 R GIP G DF DFF R* DF*

2,0

2,5

3,0

3,5

4,0

4,5

5,0

2,65 2,00 2,00 2,00 2,00 8,67 2,25

3,84 3,00 3,44 3,00 4,00 12,11 3,55

4,91 3,00 3,44 3,29 4,82 14,39 4,02

7,27 4,00 4,85 4,31 6,63 15,70 6,36 Satisfaction

9,40 4,00 4,85 4,83 8,16 16,09 8,18

11,31 5,00 7,94 6,51 9,22 16,32 10,65

12,77 6,00 9,74 7,73 10,36 16,57 11,89

Fig. 12 Mean number of hops needed to reach a given satisfaction (first scenario)

Satisfaction

8,8 7,8 6,8 5,8 4,8 3,8 2,8 R GIP G DF DFF GH-1 GH-2

2,869 4,076 3,761 3,887 3,269 3,898 3,887

3,232 6,409 5,184 5,274 4,521 5,439 5,291

3,444 3,487 7,704 8,594 5,843 6,594 5,676 5,891 5,586 6,659 5,999 6,513 5,721 6,017 Number of hops

3,418 9,196 7,072 5,952 6,724 6,902 6,313

3,478 9,453 7,429 5,990 6,931 7,231 6,502

Fig. 13 Satisfaction reached by routing policies given a maximum number of hops (second scenario)

In this scenario, the selection of the best peers to forward a query to is a fundamental challenge, as Figure 13 shows. Differently from the first scenario, the G policy shows the best behavior as it selects for each step the available peer with the highest semantic relevance approximation. The curves for the GH strategy are also represented: notice that by tuning the k parameter, we can handle the trade-off between efficiency and effectiveness of the query routing. For clarity of presentation, we omitted the results of randomly constructed networks which are similar to the ones in the first scenario.

344

F. Mandreoli et al.

DFF

GH-1

GH-2

27,0

Queried peers

22,0

17,0 12,0

7,0

2,0 3,00

4,00

5,00

6,00

7,00

Satisfaction

Fig. 14 Effectiveness vs. efficiency of routing policies (second scenario)

Figure 14 represents a visual summary of the achieved results and shows the relation between the number of queried peers (efficiency) and the satisfaction that SUNRISE reaches (effectiveness) given a maximum number of hops. In particular, each policy performance is represented as a line connecting six points: from left to right they represent the results obtained at the different stopping conditions, i.e. limiting the simulation to 5, 10, 15, 20, 25, and 30 hops. The results are visualized in a combined effectiveness/efficiency plane: the more a point appears in the right (top) part of the plane, the more effective is the achieved effectiveness (efficiency, respectively). As expected, we observe that the G policy is the most effective one (for instance, the satisfaction achieved in 30 hops is nearly 7.5). In contrast, the DF policy appears as the most efficient one (nearly 26 actually queried peers in 30 hops, as opposed, for instance, to almost 13 for the G policy). Moreover, we can see the effect of k in the GH policy: increasing k makes the GH policy more efficient, but less effective. Finally, notice that the DFF policy can reach satisfaction goals similar to the ones reached by the G strategy, but in a more efficient way.

7 Conclusions and Future Research Directions Semantic support for data representation as well as a flexible machine-readable format have made XML the de facto standard for Internet applications semantic interoperability. Its applicability is primarily evident in realities where actors are heterogeneous peers connected by means of pairwise semantic mappings between their schemas which interact with each other for data sharing purposes. One of the main challenges in such a semantically heterogeneous environment is concerned

The SUNRISE Approach for XML Data Sharing Networks

345

with query processing when dealing with the inherent semantic approximations occurring in the data. In this chapter we demonstrated that a P2P network for XML data sharing can benefit from a semantics-aware infrastructure like SUNRISE. In particular, the main contributions of the chapter are: • we described SUNRISE’s PDMS infrastructure and showed how it extends each peer with functionalities for capturing the semantic approximation originating from schema heterogeneity and exploiting it for a semantically driven network organization and query routing. SUNRISE completely supports the construction of a PDMS semantic layer and offers a series of techniques which can be used for an effective and efficient exploration of the network. The system also includes a visual simulation environment and an easy-to-use GUI that can reproduce and visualize the various performed operations; • we described how XML language peculiarities are supported by the actual implementation of SUNRISE software modules. Though the SUNRISE architecture has been conceived in order to be completely independent from the adopted data model, we showed that it allows for the instantiation of a P2P infrastructure for XML data sharing; • we proved through a rich series of experiments how the techniques provided by SUNRISE allow for a semantics-driven exploration of the network. The experiments evaluated SUNRISE exploiting all its features working together and demonstrated how peers interact for an effective and efficient query processing. As a future work, several research directions could be explored. As to network organization, the adopted approach could be enhanced by adding merging and splitting mechanisms in the SON management. This could improve not only the effectiveness and efficiency of our approach, but also its scalability and ability to gracefully cope with the changes occurring in the network. Further, the problem of range and k-NN selection could be deepened. This is a fundamental aspect, since the topology of the resulting network is heavily influenced by the value each peer chooses for the number of neighbors k or the similarity threshold t when it joins the network, with outcomes on the query processing performances. On the other hand, as to network exploration, SRIs could be integrated in a more general framework, together with other approaches such as [11, 41] which are orthogonal to ours, and which cover complementary aspects such as knowledge on quantitative information, as well as on novelty of results, so as to blend different dimensions a peer can be queried on. Then, as also stated in [51], the best peer has been understood as a peer that has the most knowledge. Other aspects one might include in the evaluation of peers are properties like latency, costs, etc. A further research direction is concerned with the well-known object fusion problem. Peers in the network might be able to answer the query only partially, thus returning pieces of information to be consistently collected in an integrated result on the basis of the different contributions of the answering peers. A similar situation might occur also in case of optimization policies which intentionally split a complex

346

F. Mandreoli et al.

query into subqueries to be answered by different peers, depending on the skills of each peer with respect to the conditions expressed by each specific subquery.

Appendix Proof of Theorem 1 Theorem 1 is proved by induction as follows: Base case. This is the case of a network made up of one isolated peer p having schema S. Its own SRI is made of a single row which represents the self mapping M(S, S). This row expresses the semantic approximation offered by the subnetwork rooted in p, yet made of the only peer p. On the other hand, it represents the generalized semantic mapping M(S, S ). Induction step. To prove this step we rely on the associativity property of the aggregation function g and on the distributivity property of the composition function I over g. Let us suppose pi and p j having schemas Si and S j , respectively, are two peers who want to connect each other. Before connection, by hypothesis, the rows of the SRIs SRIi and SRI j represent the generalized semantic mappings pi and p j have with respect to their own neighbors. When establishing the connection, pi receives from p j its own aggregated SRI SRI gj and then it applies the process described in Section 5.4 to complete the connection. After this process SRIi[ j][k] for concept Ck in Si results as follows (see Section 5.4), where Ck is the concept in S j which corresponds to Ck : SRIi [ j][k] = g(M(Si , S j )(Ck ,Ck ), M(Si , S j )(Ck ,Ck ) ◦I SRI gj[k])

(1)

Let us consider the last operand of g in the right-hand side of Eq. 1, i.e., M(Si , S j ) g g (Ck ,Ck ) ◦I SRI j [k]. We can expand it by substituting (by definition of SRI j [k]) g SRI j [k] = g(SRI j [0][k], . . . , SRI j [n j ][k]), where n j is the number of p j ’s neighbors, thus obtaining: M(Si , S j )(Ck ,Ck ) ◦I g(SRI j [0][k], . . . , SRI j [n j ][k]) By exploiting the distributivity property of I over g, we obtain: g(M(Si , S j )(Ck ,Ck ) ◦I SRI j [0][k], . . . , M(Si , S j )(Ck ,Ck ) ◦I SRI j [n j ][k])

(2)

Each SRI j [l][k], with l = 0..n j represents by hypothesis the generalized semantic mapping between p j and pl , i.e.,: SRI j [l][k] = g(μ (Ck ,Ck 1 ), . . . , μ (Ck ,Ck h )) where {Ck 1 , . . . ,Ck h } are the concepts associated with Ck in Pp j ...p . l

The SUNRISE Approach for XML Data Sharing Networks

347

By exploiting again the distributivity property of I over g to each operand of g in Eq.2 we obtain (for the sake of simplicity, we show only the component related to pl , the others are identical): M(Si , S j )(Ck ,Ck ) ◦I SRI j [l][k] = g(M(Si , S j )(Ck ,Ck ) ◦I μ (Ck ,Ck 1 ), . . . , M(Si , S j )(Ck ,Ck ) ◦I μ (Ck ,Ck h ))

(3)

where M(Si , S j )(Ck ,Ck ) ◦I μ (Ck ,Ck q ), with q = 1..h represents the score of Ck in the mapping formed by the path connecting Ck with Ck q through Ck . The right-hand side of Eq. 3 thus denotes the aggregation of the scores of Ck in the mappings formed by the paths connecting pi to any peer in p . l , i.e., in Ppi ...p l By the associativity property of the aggregation function g, Eq. 2 thus rep resents the aggregation of the scores of Ck in Ppi ...p r , with r = 0..n j . Then, by exploiting again the associativity property of g, the further aggregation with M(Si , S j )(Ck ,Ck ) in Eq.1 represents the generalized semantic mapping between pi and p j , i.e., M(Si , S j ). The same process applies for p j .

References 1. Aberer, K., Cudr´e-Mauroux, P., Hauswirth, M., Pelt, T.V.: GridVine: Building InternetScale Semantic Overlay Networks. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 107–121. Springer, Heidelberg (2004) 2. Abiteboul, S., Allard, T., Chatalic, P., Gardarin, G., Ghitescu, A., Goasdou´e, F., Manolescu, I., Nguyen, B., Ouazara, M., Somani, A., Travers, N., Vasile, G., Zoupanos, S.: WebContent: Efficient P2P Warehousing of Web Data. In: Proceedings of the 34th International Conference on Very Large Databases (VLDB), vol. 1(2), pp. 1428–1431 (2008) 3. Abiteboul, S., Manolescu, I., Polyzotis, N., Preda, N., Sun, C.: XML Processing in DHT Networks. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 606–615 (2008) 4. Arenas, M., Kantere, V., Kementsietsidis, A., Kiringa, I., Miller, R., Mylopoulos, J.: The Hyperion Project: from Data Integration to Data Coordination. SIGMOD Record 32(3), 53–58 (2003) 5. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and Ontology Matching with COMA++. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 906–908 (2005) 6. Bawa, M., Manku, G., Raghavan, P.: SETS: Search Enhanced by Topic Segmentation. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 306–313 (2003) 7. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (2001) 8. Bonifati, A., Cuzzocrea, A.: Storing and Retrieving XPath Fragments in Structured P2P Networks. Data Knowledge Engineering 59(2), 247–269 (2006) 9. Comito, C., Patarin, S., Talia, D.: PARIS: A Peer-to-Peer Architecture for Large-Scale Semantic Data Integration. In: Moro, G., Bergamaschi, S., Joseph, S., Morin, J.-H., Ouksel, A.M. (eds.) DBISP2P 2005 and DBISP2P 2006. LNCS, vol. 4125, pp. 163–170. Springer, Heidelberg (2007)

348

F. Mandreoli et al.

10. Cooper, B.: Using Information Retrieval Techniques to Route Queries in an InfoBeacons Network. In: Ng, W.S., Ooi, B.-C., Ouksel, A.M., Sartori, C. (eds.) DBISP2P 2004. LNCS, vol. 3367, pp. 46–60. Springer, Heidelberg (2005) 11. Crespo, A., Garcia-Molina, H.: Routing Indices for Peer-to-Peer Systems. In: Proceedings of the 22nd IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 23–33 (2002) 12. Crespo, A., Garcia-Molina, H.: Semantic Overlay Networks for P2P Systems. In: Moro, G., Bergamaschi, S., Aberer, K. (eds.) AP2PC 2004. LNCS (LNAI), vol. 3601, pp. 1–13. Springer, Heidelberg (2005) 13. Cudr´e-Mauroux, P., Agarwal, S., Aberer, K.: GridVine: An Infrastructure for Peer Information Management. IEEE Internet Computing 11(5), 36–44 (2007) 14. Cuenca-Acuna, F., Peery, C., Martin, R., Nguyen, T.: PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. In: Proceedings of the 12th International Symposium on High-Performance Distributed Computing (HPDC), pp. 236–249 (2003) 15. Do, H., Melnik, S., Rahm, E.: Comparison of Schema Matching Evaluations. In: Aksit, M., Mezini, M., Unland, R. (eds.) NODe 2002. LNCS, vol. 2591, pp. 221–237. Springer, Heidelberg (2003) 16. Do, H., Rahm, E.: COMA – A System for Flexible Combination of Schema Matching Approaches. In: Proceedings of 28th International Conference on Very Large Data Bases (VLDB), pp. 610–621 (2002) 17. Doulkeridis, C., Nørv˚ag, K., Vazirgiannis, M.: DESENT: Decentralized and Distributed Semantic Overlay Generation in P2P Networks. IEEE Journal on Selected Areas in Communications 25(1), 25–34 (2007) 18. Fagin, R.: Combining Fuzzy Information: an Overview. SIGMOD Record 31(2), 109– 118 (2002) 19. Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A., French, J.: Clustering Large Datasets in Arbitrary Metric Spaces. In: Proceedings of the 15th International Conference on Data Engineering (ICDE), pp. 502–511 (1999) 20. Haase, P., Siebes, R., van Harmelen, F.: Peer Selection in Peer-to-Peer Networks with Semantic Topologies. In: Proceedings of the 1st International Conference on Semantics of a Networked World (ICNSW), pp. 108–125 (2004) 21. Halevy, A., Ives, Z., Madhavan, J., Mork, P., Suciu, D., Tatarinov, I.: The Piazza Peer Data Management System. IEEE Transactions on Knowledge and Data Engineering 16(7), 787–798 (2004) 22. Halevy, A., Ives, Z., Mork, P., Tatarinov, I.: Piazza: Data Management Infrastructure for Semantic Web Applications. In: Proceedings of the 12th International World Wide Web Conference (WWW), pp. 556–567 (2003) 23. Halevy, A., Ives, Z., Suciu, D., Tatarinov, I.: Schema Mediation for Large-Scale Semantic Data Sharing. VLDB Journal 14(1), 68–83 (2005) 24. Hern´andez, M.A., Miller, R.J., Haas, L.M.: Clio: A Semi-Automatic Tool For Schema Mapping. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), p. 607 (2001) 25. Joseph, S.: NeuroGrid: Semantically Routing Queries in Peer-to-Peer Networks. In: Gregori, E., Cherkasova, L., Cugola, G., Panzieri, F., Picco, G.P. (eds.) NETWORKING 2002. LNCS, vol. 2376, pp. 202–214. Springer, Heidelberg (2002) 26. Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. PrenticeHall, Englewood Cliffs (1995)

The SUNRISE Approach for XML Data Sharing Networks

349

27. Koloniari, G., Pitoura, E.: Content-Based Routing of Path Queries in Peer-to-Peer Systems. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 29–47. Springer, Heidelberg (2004) 28. Li, M., Lee, W., Sivasubramaniam, A.: Semantic Small World: An Overlay Network for Peer-to-Peer Search. In: Proceedings of the 12th IEEE International Conference on Network Protocols (ICNP), pp. 228–238 (2004) 29. Linari, A., Weikum, G.: Efficient Peer-to-Peer Semantic Overlay Networks Based on Statistical Language Models. In: Proceedings of the Information Retrieval in Peer-to-Peer Networks Workshop (P2PIR) (in conj. with the ACM 15th Conference on Information and Knowledge Management (CIKM)), pp. 9–16 (2006) 30. Lodi, S., Mandreoli, F., Martoglia, R., Penzo, W., Sassatelli, S.: Semantic Peer, Here are the Neighbors You Want! In: Proceedings of the 11th International Conference on Extending Database Technology (EDBT), pp. 26–37 (2008) 31. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic Schema Matching with Cupid. In: Proceedings of 27th International Conference on Very Large Data Bases (VLDB), pp. 49–58 (2001) 32. Mandreoli, F., Martoglia, R., Penzo, W., Sassatelli, S.: SRI: Exploiting Semantic Information for Effective Query Routing in a PDMS. In: Proceedings of the 8th ACM International Workshop on Web Information and Data Management (WIDM) (in conj. with the ACM 15th Conference on Information and Knowledge Management (CIKM)), pp. 19–26 (2006) 33. Mandreoli, F., Martoglia, R., Penzo, W., Sassatelli, S.: Data-Sharing P2P Networks with Semantic Approximation Capabilities. IEEE Internet Computing 13(5), 60–70 (2009) 34. Mandreoli, F., Martoglia, R., Penzo, W., Sassatelli, S., Villani, G.: SRI@work: Efficient and Effective Routing Strategies in a PDMS. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds.) WISE 2007. LNCS, vol. 4831, pp. 285–297. Springer, Heidelberg (2007) 35. Mandreoli, F., Martoglia, R., Penzo, W., Sassatelli, S., Villani, G.: SUNRISE: Exploring PDMS Networks with Semantic Routing Indexes. In: Proceedings of the 4th European Semantic Web Conference, ESWC (2007) 36. Mandreoli, F., Martoglia, R., Penzo, W., Sassatelli, S., Villani, G.: Building a PDMS Infrastructure for XML Data Sharing with SUNRISE. In: Proc. of DATAX (in conj. with EDBT) (2008) 37. Mandreoli, F., Martoglia, R., Ronchetti, E.: Versatile Structural Disambiguation for Semantic-aware Applications. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pp. 209–216 (2005) 38. Mandreoli, F., Martoglia, R., Ronchetti, E.: STRIDER: a Versatile System for Structural Disambiguation. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 1194–1197. Springer, Heidelberg (2006) 39. Mandreoli, F., Martoglia, R., Tiberio, P.: Approximate Query Answering for a Heterogeneous XML Document Base. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds.) WISE 2004. LNCS, vol. 3306, pp. 337–351. Springer, Heidelberg (2004) 40. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching. In: Proceedings of the 18th International Conference on Data Engineering (ICDE), pp. 117–128 (2002)

350

F. Mandreoli et al.

41. Michel, S., Bender, M., Triantafillou, P., Weikum, G.: IQN Routing: Integrating Quality and Novelty in P2P Querying and Ranking. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 149–166. Springer, Heidelberg (2006) 42. Miller, R., Haas, L., Hernández, M.: Schema Mapping as Query Discovery. In: Proceedings of 26th International Conference on Very Large Data Bases (VLDB), pp. 77–88 (2000) 43. Nejdl, W., Wolpers, M., Siberski, W., Schmitz, C., Schlosser, M.T., Brunkhorst, I., Loser, A.: Superpeer-based Routing and Clustering Strategies for RDF-based Peer-to-Peer Networks. Journal of Web Semantics 1(2), 177–186 (2004) 44. Nejdl, W., Wolpers, M., Siberski, W., Schmitz, C., Schlosser, M., Brunkhorst, I., Löser, A.: Super-Peer-Based Routing and Clustering Strategies for RDF-based Peer-to-Peer Networks. In: Proceedings of the 12th World Wide Web Conference (WWW), pp. 536– 543 (2003) 45. Parreira, J., Michel, S., Weikum, G.: P2PDating: Real Life Inspired Semantic Overlay Networks for Web Search. Information Processing and Management 43(3), 643–664 (2007) 46. Penzo, W.: Rewriting Rules To Permeate Complex Similarity and Fuzzy Queries within a Relational Database System. IEEE Transactions on Knowledge and Data Engineering 17(2), 255–270 (2005) 47. Rao, P., Moon, B.: An Internet-Scale Service for Publishing and Locating XML Documents. In: Proceedings of the 25th International Conference on Data Engineering (ICDE), pp. 1459–1462 (2009) 48. Sartiani, C., Manghi, P., Ghelli, G., Conforti, G.: XPeer: A self-organizing XML P2P database system. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 456–465. Springer, Heidelberg (2004) 49. Stoica, I., Morris, R., Karger, D., Kaashoek, M., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. In: Proceedings of the ACM SIGCOMM Conference on Application, Technologies, Architectures and Protocols for Computer Communication (SIGCOMM), pp. 149–160 (2001) 50. Tatarinov, I., Halevy, A.: Efficient Query Reformulation in Peer Data Management Systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 539–550 (2004) 51. Tempich, C., Staab, S., Wranik, A.: REMINDIN’: Semantic Query Routing in Peer-toPeer Networks Based on Social Metaphors. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 640–649 (2004) 52. Triantafillou, P., Xiruhaki, C., Koubarakis, M., Ntarmos, N.: Towards High Performance Peer-to-Peer Content and Resource Sharing Systems. In: Proceedings of the 1st biennial Conference on Innovative Data Systems Research, CIDR (2003) 53. Winter, J., Drobnik, O.: SPIRIX: A Peer-to-Peer Search Engine for XML-Retrieval. Advances in Focused Retrieval, 237–242 (2009) 54. Yang, B., Garcia-Molina, H.: Improving Search in Peer-to-Peer Networks. In: Proceedings of the 22nd IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 5–14 (2002) 55. Yu, C., Jagadish, H.: Schema Summarization. In: Proceedings of 32nd International Conference on Very Large Data Bases (VLDB), pp. 319–330 (2006)

Author Index

Alhajj, Reda 55, 165 Alimohamed, Yasin 55 Barker, Ken

165

Calado, P´ avel 193 Cavalcanti, Rafael T. 291 Chan, Allan 165 Cruz, Adriano J. de O. 291 Dollmann, Thorsten Fazzinga, Bettina

227 107

Goncalves, Marlene Herschel, Melanie Hunter, Anthony

133 193 259

Ma, Jianbing 259 Mandreoli, Federica Martoglia, Riccardo Ma, Z.M. 35 Oliboni, Barbara 3 ¨ Ozyer, Tansel 55 Penzo, Wilma 315 Pozzani, Gabriele 3 Rodrigues, Raquel D. Rokne, Jon 165

Sassatelli, Simona 315 Situ, Nancy 165 Spence, Krista 55

Jida, Jamal 55, 165 Jiwani, Alnaar 55

Thomas, Oliver 227 Tineo, Leonid 133

Keijzer, Ander de Kianmehr, Keivan

Villani, Giorgio

79 55, 165

Wong, Kim Leit¨ ao, Luıs 193 Liu, Jian 35 Liu, Weiru 259 Lo, Anthony 55

315 315

Yan, Li

315

165

Zhang, Weiya

259

291