ETPL DM - 001
K Nearest Neighbour Joins for Big Data on MapReduce: a Theoretical and Experimental Analysis
Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a computational intensive task with a large range of applications such as knowledge discovery or data mining. However, as the volume and the dimension of data increase, only distributed approaches can perform such costly operation in a reasonable time. Recent works have focused on implementing efficient solutions using the MapReduce programming model because it is suitable for distributed large scale data processing. Although these works provide different solutions to the same problem, each one has particular constraints and properties. In this paper, we compare the different existing approaches for computing kNN on MapReduce, first theoretically, and then by performing an extensive experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety of datasets, and analyze the impact of data volume, data dimension and the value of k from many perspectives like time and space complexity, and accuracy. The experimental part brings new advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this is the first paper that compares kNN computing methods on MapReduce both theoretically and experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based practical problems in the context of big data.
ETPL DM - 002
Efficient Algorithms for Mining Top-K High Utility Item sets
High utility itemsets (HUIs) mining is an emerging topic in data mining, which refers to discovering all itemsets having a utility meeting a user-specified minimum utility threshold min_util. However, setting min_util appropriately is a difficult problem for users. Generally speaking, finding an appropriate minimum utility threshold by trial and error is a tedious process for users. If min_util is set too low, too many HUIs will be generated, which may cause the mining process to be very inefficient. On the other hand, if min_util is set too high, it is likely that no HUIs will be found. In this paper, we address the above issues by proposing a new framework for top-k high utility itemset mining, where k is the desired number of HUIs to be mined. Two types of efficient algorithms named TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed for mining such itemsets without the need to set min_util. We provide a structural comparison of the two algorithms with discussions on their advantages and limitations. Empirical evaluations on both real and synthetic datasets show that the performance of the proposed algorithms is close to that of the optimal case of state-of-the-art utility mining algorithms.
ETPL DM - 003
Mining User-Aware Rare Sequential Topic Patterns in Document Streams
Textual documents created and distributed on the Internet are ever changing in various forms. Most of existing works are devoted to topic modelling and the evolution of individual topics, while sequential relations of topics in successive documents published by a specific user are ignored. In this paper, in order to characterize and detect personalized and abnormal behaviours of Internet users, we propose Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring on abnormal user behaviours. We present a group of algorithms to solve this innovative mining problem through three phases: pre-processing to extract probabilistic topics and identify sessions for different users, generating all the STP candidates with (expected) support values for each user by pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs. Experiments on both real (Twitter) and synthetic datasets show that our approach can indeed discover special users and interpretable URSTPs effectively and efficiently, which significantly reflect users’ characteristics.
ETPL DM - 004
Pattern Based Sequence Classification
Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels. We measure the interestingness of a pattern in a given class of sequences by combining the cohesion and the support of the pattern. We use the discovered patterns to generate confident classification rules, and present two different ways of building a classifier. The first classifier is based on an improved version of the existing method of classification based on association rules, while the second ranks the rules by first measuring their value specific to the new data object. Experimental results show that our rule based classifiers outperform existing comparable classifiers in terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use different kinds of patterns as features to represent each sequence as a feature vector. We then apply a variety of machine learning algorithms for sequence classification, experimentally demonstrating that the patterns we discover represent the sequences well, and prove effective for the classification task.
ETPL DM - 005
ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD.
ETPL DM - 006
Crowd sourced Data Management: A Survey
Some important data management and analytics tasks cannot be completely addressed by automated processes. These “computer-hard� tasks such as entity resolution, sentiment analysis, and image recognition, can be enhanced through the use of human cognitive ability. Human Computation is an effective way to address such tasks by harnessing the capabilities of crowd workers (i.e., the crowd). Thus, crowdsourced data management has become an area of increasing interest in research and industry. There are three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy results and effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly in contrast to computing time scales, so latencycontrol techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowdsourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowdsourced data management.
ETPL DM - 007
A Survey of General-Purpose Crowdsourcing Techniques
Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving paradigm has gained a lot of attention and has been a hot research topic in the field of Computer Science. Even though a lot of work has been conducted on this topic, so far we do not have a comprehensive survey on most relevant work done in crowdsourcing field. In this paper, we aim to offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing. According to their focus, we divide this work into three parts, which are: incentive design, task assignment and quality control. For each part, we start with different problems faced in that area followed by a brief description of existing work and a discussion of pros and cons. In addition, we also present a real scenario on how the different techniques are used in implementing a location-based crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose crowdsourcing techniques and present some open problems in this area.
ETPL DM - 008
Mining Health Examination Records — A Graph-based Approach
General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
ETPL DM - 009
TopicSketch: Real-time Bursty Topic Detection from Twitter
Twitter has become one of the largest microblogging platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short period of time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper TopicSketch, a sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that TopicSketch on a single machine can potentially handle hundreds of millions tweets per day, which is on the same scale of the total number of daily tweets in Twitter, and present bursty events in finergranularity.
ETPL DM -10
SPIRIT: A Tree Kernel-based Method for Topic Person Interaction Detection
The development of a topic in a set of topic documents is constituted by a series of person interactions at a specific time and place. Knowing the interactions of the persons mentioned in these documents is helpful for readers to better comprehend the documents. In this paper, we propose a topic person interaction detection method called SPIRIT, which classifies the text segments in a set of topic documents that convey person interactions. We design the rich interactive tree structure to represent syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based convolution kernel to identify interactive segments. Experiment results based on real world topics demonstrate that the proposed rich interactive tree structure effectively detects the topic person interactions and that our method outperforms many well-known relation extraction and protein-protein interaction methods.
ETPL DM - 011
Truth Discovery in Crowdsourced Detection of Spatial Events
The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the detection of spatial events when smartphone users move around in their daily lives. However, the credibility of those detected events can be negatively impacted by unreliable participants with lowquality data. Consequently, a major challenge in mobile crowdsourcing is truth discovery, i.e., to discover true events from diverse and noisy participants' reports. This problem is uniquely distinct from its online counterpart in that it involves uncertainties in both participants' mobility and reliability. Decoupling these two types of uncertainties through location tracking will raise severe privacy and energy issues, whereas simply ignoring missing reports or treating them as negative reports will significantly degrade the accuracy of truth discovery. In this paper, we propose two new unsupervised models, i.e., Truth finder for Spatial Events (TSE) and Personalized Truth finder for Spatial Events (PTSE), to tackle this problem. In TSE, we model location popularity, location visit indicators, truths of events, and three-way participant reliability in a unified framework. In PTSE, we further model personal location visit tendencies. These proposed models are capable of effectively handling various types of uncertainties and automatically discovering truths without any supervision or location tracking. Experimental results on both real-world and synthetic datasets demonstrate that our proposed models outperform existing state-of-the-art truth discovery approaches in the mobile crowdsourcing environment.
ETPL DM - 012
Graph Regularized Feature Selection with Data Reconstruction
Feature selection is a challenging problem for high dimensional data processing, which arises in many real applications such as data mining, information retrieval, and pattern recognition. In this paper, we study the problem of unsupervised feature selection. The problem is challenging due to the lack of label information to guide feature selection. We formulate the problem of unsupervised feature selection from the viewpoint of graph regularized data reconstruction. The underlying idea is that the selected features not only preserve the local structure of the original data space via graph regularization, but also approximately reconstruct each data point via linear combination. Therefore, the graph regularized data reconstruction error becomes a natural criterion for measuring the quality of the selected features. By minimizing the reconstruction error, we are able to select the features that best preserve both the similarity and discriminant information in the original data. We then develop an efficient gradient algorithm to solve the corresponding optimization problem. We evaluate the performance of our proposed algorithm on text clustering. The extensive experiments demonstrate the effectiveness of our proposed approach.
ETPL DM - 013
Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks
The last few years have witnessed the emergence and evolution of a vibrant research stream on a large variety of online social media network (SMN) platforms. Recognizing anonymous, yet identical users among multiple SMNs is still an intractable problem. Clearly, cross-platform exploration may help solve many problems in social computing in both theory and applications. Since public profiles can be duplicated and easily impersonated by users with different purposes, most current user identification resolutions, which mainly focus on text mining of users’ public profiles, are fragile. Some studies have attempted to match users based on the location and timing of user content as well as writing style. However, the locations are sparse in the majority of SMNs, and writing style is difficult to discern from the short sentences of leading SMNs such as Sina Microblog and Twitter. Moreover, since online SMNs are quite symmetric, existing user identification schemes based on network structure are not effective. The real-world friend cycle is highly individual and virtually no two users share a congruent friend cycle. Therefore, it is more accurate to use a friendship structure to analyze cross-platform SMNs. Since identical users tend to set up partial similar friendship structures in different SMNs, we proposed the Friend Relationship-Based User Identification (FRUI) algorithm. FRUI calculates a match degree for all candidate User Matched Pairs (UMPs), and only UMPs with top ranks are considered as identical users. We also developed two propositions to improve the efficiency of the algorithm. Results of extensive experiments demonstrate that FRUI performs much better than current network structure-based algorithms.
ETPL DM - 014
TaxoFinder: A Graph-Based Approach for Taxonomy Learning
Taxonomy learning is an important task for knowledge acquisition, sharing, and classification as well as application development and utilization in various domains. To reduce human effort to build a taxonomy from scratch and improve the quality of the learned taxonomy, we propose a new taxonomy learning approach, named TaxoFinder. TaxoFinder takes three steps to automatically build a taxonomy. First, it identifies domain-specific concepts from a domain text corpus. Second, it builds a graph representing how such concepts are associated together based on their co-occurrences. As the key method in TaxoFinder, we propose a method for measuring associative strengths among the concepts, which quantify how strongly they are associated in the graph, using similarities between sentences and spatial distances between sentences. Lastly, TaxoFinder induces a taxonomy from the graph using a graph analytic algorithm. TaxoFinder aims to build a taxonomy in such a way that it maximizes the overall associative strengths among the concepts in the graph to build a taxonomy. We evaluate TaxoFinder using gold-standard evaluation on three different domains: emergency management for mass gatherings, autism research, and disease domains. In our evaluation, we compare TaxoFinder with a state-of-the-art subsumption method and show that TaxoFinder is an effective approach significantly outperforming the subsumption method.
ETPL DM - 015
Clustering Data Streams Based on Shared Density between MicroClusters
As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Microclusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the microclusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and reclustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first microcluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller micro-clusters to achieve comparable results.
ETPL DM - 016
NATERGM: A Model for Examining the Role of Nodal Attributes in Dynamic Social Media Networks
Social media networks are dynamic. As such, the order in which network ties develop is an important aspect of the network dynamics. This study proposes a novel dynamic network model, the Nodal Attribute-based Temporal Exponential Random Graph Model (NATERGM) for dynamic network analysis. The proposed model focuses on how the nodal attributes of a network affect the order in which the network ties develop. Temporal patterns in social media networks are modeled based on the nodal attributes of individuals and the time information of network ties. Using social media data collected from a knowledge sharing community, empirical tests were conducted to evaluate the performance of the NATERGM on identifying the temporal patterns and predicting the characteristics of the future networks. Results showed that the NATERGM demonstrated an enhanced pattern testing capability and an increased prediction accuracy of network characteristics compared to benchmark models. The proposed NATERGM model helps explain the roles of nodal attributes in the formation process of dynamic networks.
ETPL DM - 017
Quality-Aware Subgraph Matching Over Inconsistent Probabilistic Graph Databases
Resource Description Framework (RDF) has been widely used in the Semantic Web to describe resources and their relationships. The RDF graph is one of the most commonly used representations for RDF data. However, in many real applications such as the data extraction/integration, RDF graphs integrated from different data sources may often contain uncertain and inconsistent information (e.g., uncertain labels or that violate facts/rules), due to the unreliability of data sources. In this paper, we formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies and uncertainty. With such a probabilistic graph model, we focus on an important problem, qualityaware subgraph matching over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves subgraphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and with high quality scores (considering both consistency and uncertainty). In order to efficiently answer QA-gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and quality score pruning, which can greatly filter out false alarms of subgraphs. We also design an effective index to facilitate our proposed pruning methods, and propose an efficient approach for processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our proposed approaches through extensive experiments.
ETPL DM - 018
Joint Structure Feature Exploration and Regularization for Multi-Task Graph Classification
Graph classification aims to learn models to classify structure data. To date, all existing graph classification methods are designed to target one single learning task and require a large number of labeled samples for learning good classification models. In reality, each real-world task may only have a limited number of labeled samples, yet multiple similar learning tasks can provide useful knowledge to benefit all tasks as a whole. In this paper, we formulate a new multi-task graph classification (MTG) problem, where multiple graph classification tasks are jointly regularized to find discriminative subgraphs shared by all tasks for learning. The niche of MTG stems from the fact that with a limited number of training samples, subgraph features selected for one single graph classification task tend to overfit the training data. By using additional tasks as evaluation sets, MTG can jointly regularize multiple tasks to explore high quality subgraph features for graph classification. To achieve this goal, we formulate an objective function which combines multiple graph classification tasks to evaluate the informativeness score of a subgraph feature. An iterative subgraph feature exploration and multi-task learning process is further proposed to incrementally select subgraph features for graph classification. Experiments on real-world multi-task graph classification datasets demonstrate significant performance gain.
ETPL DM - 019
Mining Health Examination Records — A Graph-based Approach
General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
ETPL DM - 020
Semantic-Aware Blocking for Entity Resolution
In this paper, we propose a semantic-aware blocking framework for entity resolution (ER). The proposed framework is built using locality-sensitive hashing (LSH) techniques, which efficiently unifies both textual and semantic features into an ER blocking process. In order to understand how similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity metrics and their properties in terms of LSH families. Then, we present how the semantic similarity of records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces. In doing so, the proposed framework can support efficient similarity searches on records in both textual and semantic similarity spaces, yielding ER blocking with improved quality. We have evaluated the proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking techniques. Our experimental study shows that the combination of semantic similarity and textual similarity can considerably improve the quality of blocking. Furthermore, due to the probabilistic nature of LSH, this semantic-aware blocking framework enables us to build fast and reliable blocking for performing entity resolution tasks in a large-scale data environment.
ETPL DM - 021
On Learning of Choice Models with Interactive Attributes
Introducing recent advances in the machine learning techniques to state-of-the-art discrete choice models, we develop an approach to infer the unique and complex decision making process of a decision-maker (DM), which is characterized by the DM’s priorities and attitudinal character, along with the attributes interaction, to name a few. On the basis of exemplary preference information in the form of pairwise comparisons of alternatives, our method seeks to induce a DM’s preference model in terms of the parameters of recent discrete choice models. To this end, we reduce our learning function to a constrained non-linear optimization problem. Our learning approach is a simple one that takes into consideration the interaction among the attributes along with the priorities and the unique attitudinal character of a DM. The experimental results on standard benchmark datasets suggest that our approach is not only intuitively appealing and easily interpretable but also competitive to state-of-the-art methods.
ETPL DM - 022
OMASS: One Memory Access Set Separation
In many applications, there is a need to identify to which of a group of sets an element $x$ belongs, if any. For example, in a router, this functionality can be used to determine the next hop of an incoming packet. This problem is generally known as set separation and has been widely studied. Most existing solutions make use of hash-based algorithms, particularly when a small percentage of false positives is allowed. A known approach is to use a collection of Bloom filters in parallel. Such schemes can require several memory accesses, a significant limitation for some implementations. We propose an approach using Block Bloom Filters, where each element is first hashed to a single memory block that stores a small Bloom filter that tracks the element and the set or sets the element belongs to. In a naïve solution, when an element $x$ in a set $S$ is stored, it necessarily increases the false positive probability for finding that $x$ is in another set $T$. In this paper, we introduce our One Memory Access Set Separation (OMASS) scheme to avoid this problem. OMASS is designed so that for a givn element $x$, the corresponding Bloom filter bits for each set map to different positions in the memory word. This ensures that the false positive rates for the Bloom filters for element $x$ under other sets are not affected. In addition, OMASS requires fewer hash functions compared to the naïve solution.
ETPL DM - 023
Resolving Multi-party Privacy Conflicts in Social Media
Items shared through Social Media may affect more than one user's privacy—e.g., photos that depict multiple users, comments that mention multiple users, events in which multiple users are invited, etc. The lack of multi-party privacy management support in current mainstream Social Media infrastructures makes users unable to appropriately control to whom these items are actually shared or not. Computational mechanisms that are able to merge the privacy preferences of multiple users into a single policy for an item can help solve this problem. However, merging multiple users’ privacy preferences is not an easy task, because privacy preferences may conflict, so methods to resolve conflicts are needed. Moreover, these methods need to consider how users’ would actually reach an agreement about a solution to the conflict in order to propose solutions that can be acceptable by all of the users affected by the item to be shared. Current approaches are either too demanding or only consider fixed ways of aggregating privacy preferences. In this paper, we propose the first computational mechanism to resolve conflicts for multi-party privacy management in Social Media that is able to adapt to different situations by modelling the concessions that users make to reach a solution to the conflicts. We also present results of a user study in which our proposed mechanism outperformed other existing approaches in terms of how many times each approach matched users’ behaviour.
ETPL DM - 024
SEDEX: Scalable Entity Preserving Data Exchange
Data exchange is the process of generating an instance of a target schema from an instance of a source schema such that source data is reflected in the target. Generally, data exchnge is performed using schema mapping, representing high level relations between source and target schemas. In this paper, we argue that data exchange solely based on schema level information limits the ability to express semantics in data exchange. We show such schema level mappings not only may result in entity fragmentation, they are unable to resolve some ambiguous data exchange scenarios. To address this problem, we propose Scalable Entity Preserving Data Exchange (SEDEX), a hybrid method based on data and schema mapping that employs similarities between relation trees of source and target relations to find the best relations that can host source instances. Our experiments show SEDEX outperforms other methods in terms of quality and scalability of data exchange.
ETPL DM - 025
DiploCloud: Efficient and Scalable Management of RDF Data in the Cloud
Despite recent advances in distributed RDF data management, processing large-amounts of RDF data in the cloud is still very challenging. In spite of its seemingly simple data model, RDF actually encodes rich and complex graphs mixing both instance and schema-level data. Sharding such data using classical techniques or partitioning the graph using traditional min-cut algorithms leads to very inefficient distributed operations and to a high number of joins. In this paper, we describe DiploCloud, an efficient and scalable distributed RDF data management system for the cloud. Contrary to previous approaches, DiploCloud runs a physiological analysis of both instance and schema information prior to partitioning the data. In this paper, we describe the architecture of DiploCloud, its main data structures, as well as the new algorithms we use to partition and distribute data. We also present an extensive evaluation of DiploCloud showing that our system is often two orders of magnitude faster than state-of-the-art systems on standard workloads.
ETPL DM - 026
A Survey on Trajectory Data Mining: Techniques and Applications
Rapid advance of location acquisition technologies boosts the generation of trajectory data, which track the traces of moving objects. A trajectory is typically represented by a sequence of time stamped geographical locations. A wide spectrum of applications can benefit from the trajectory data mining. Bringing unprecedented opportunities, large-scale trajectory data also pose great challenges. In this paper, we survey various applications of trajectory data mining, e.g., path discovery, location prediction, movement behaviour analysis, and so on. Furthermore, this paper reviews an extensive collection of existing trajectory data mining techniques and discusses them in a framework of trajectory data mining. This framework and the survey can be used as a guideline for designing future trajectory data mining solutions.
ETPL DM - 027
Insider Collusion Attack on Privacy-Preserving Kernel-Based Data Mining Systems
In this paper, we consider a new insider threat for the privacy preserving work of distributed kernelbased data mining (DKBDM), such as distributed support vector machine. Among several known data breaching problems, those associated with insider attacks have been rising significantly, making this one of the fastest growing types of security breaches. Once considered a negligible concern, insider attacks have risen to be one of the top three central data violations. Insider-related research involving the distribution of kernel-based data mining is limited, resulting in substantial vulnerabilities in designing protection against collaborative organizations. Prior works often fall short by addressing a multifactorial model that is more limited in scope and implementation than addressing insiders within an organization colluding with outsiders. A faulty system allows collusion to go unnoticed when an insider shares data with an outsider, who can then recover the original data from message transmissions (intermediary kernel values) among organizations. This attack requires only accessibility to a few data entries within the organizations rather than requiring the encrypted administrative privileges typically found in the distribution of data mining scenarios. To the best of our knowledge, we are the first to explore this new insider threat in DKBDM. We also analytically demonstrate the minimum amount of insider data necessary to launch the insider attack. Finally, we follow up by introducing several proposed privacy-preserving schemes to counter the described attack.
ETPL DM - 028
Probabilistic Static Load-Balancing of Parallel Mining of Frequent Sequences
Frequent sequence mining is well known and well-studied problem in data mining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to where is the number of processors. In the experimental evaluation, we show that our method performs significantly better than the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms.
ETPL DM - 029
Clustering Data Streams Based on Shared Density between MicroClusters
As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Microclusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the microclusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and reclustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first microcluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller micro-clusters to achieve comparable results.
ETPL DM - 030
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop's performance, we develop a workload balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable.
ETPL DM - 031
Fast and accurate mining the community structure: integrating center locating and membership optimization
Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many natural and engineering complex systems, e.g. protein networks, power grid, and transportation systems. Most of the existing techniques view the community mining problem as an optimization problem based on a given quality function (e.g., modularity), however none of them are grounded with a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining efficiency and the community quality still remains an open problem. In this paper, we attempt to address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable influence factor is proposed to measure the leadership of each node, those nodes with highest local leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system to describe the dynamical assignment of community membership; and formulate the serval conditions to guarantee the convergence of each node’s dynamic trajectory, by which the hierarchical community structure of the network can be revealed. The proposed dynamical system is independent of the quality function used, so could also be applied in other community mining models. Our algorithm is highly efficient: the computational complexity analysis shows that the execution time is nearly linearly dependent on the number of nodes in sparse networks. We finally give demonstrative applications of the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the algorithmic performance.
ETPL DM - 032
Practical Approximate k Nearest Neighbour Queries with Location and Query Privacy
In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user. In this paper, we study approximate k nearest neighbour (kNN) queries where the mobile user queries the location-based service (LBS) provider about approximate k nearest points of interest (POIs) on the basis of his current location. We propose a basic solution and a generic solution for the mobile user to preserve his location and query privacy in approximate kNN queries. The proposed solutions are mainly built on the Paillier public-key cryptosystem and can provide both location and query privacy. To preserve query privacy, our basic solution allows the mobile user to retrieve one type of POIs, for example, approximate k nearest car parks, without revealing to the LBS provider what type of points is retrieved. Our generic solution can be applied to multiple discrete type attributes of private location-based queries. Compared with existing solutions for kNN queries with location privacy, our solution is more efficient. Experiments have shown that our solution is practical for kNN queries.
ETPL DM - 033
Inverted Linear Quadtree: Efficient Top K ord Search
With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatio-textual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study two fundamental problems in the spatial keyword queries: top k spatial keyword search (TOPK-SK), and batch top k spatial keyword search (BTOPK-SK). Given a set of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the closest k objects each of which contains all keywords in the query. BTOPK-SK is the batch processing of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top k spatial keyword search. To further enhance the filtering capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods.
ETPL DM - 034
A Novel Recommendation Model Regularized with User Trust and Item Ratings
We propose TrustSVD, a trust-based matrix factorization technique for recommendations. TrustSVD integrates multiple information sources into the recommendation model in order to reduce the data sparsity and cold start problems and their degradation of recommendation performance. An analysis of social trust data from four real-world data sets suggests that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Trust SVD therefore builds on top of a state-of-the-art recommendation algorithm, SVD++ (which uses the explicit and implicit influence of rated items), by further incorporating both the explicit and implicit influence of trusted and trusting users on the prediction of items for an active user. The proposed technique is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that Trust SVD achieves better accuracy than other ten counterpart’s recommendation techniques.
ETPL DM - 036
Probabilistic static load-balancing of parallel mining of frequent sequences
Frequent sequence mining is well known and well-studied problem in data mining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to where is the number of processors. In the experimental evaluation, we show that our method performs significantly better than the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as item set/tree/graph mining algorithms.
ETPL DM - 036
Cold-Start Recommendation with Provable Guarantees: A Decoupled Approach
Although the matrix completion paradigm provides an appealing solution to the collaborative filtering problem in recommendation systems, some major issues, such as data sparsity and cold-start problems, still remain open. In particular, when the rating data for a subset of users or items is entirely missing, commonly known as the cold-start problem, the standard matrix completion methods are inapplicable due the non-uniform sampling of available ratings. In recent years, there has been considerable interest in dealing with cold-start users or items that are principally based on the idea of exploiting other sources of information to compensate for this lack of rating data. In this paper, we propose a novel and general algorithmic framework based on matrix completion that simultaneously exploits the similarity information among users and items to alleviate the cold-start problem. In contrast to existing methods, our proposed recommender algorithm, dubbed DecRec, decouples the following two aspects of the cold-start problem to effectively exploit the side information: (i) the completion of a rating sub-matrix, which is generated by excluding cold-start users/items from the original rating matrix; and (ii) the transduction of knowledge from existing ratings to cold-start items/users using side information. This crucial difference prevents the error propagation of completion and transduction, and also significantly boosts the performance when appropriate side information is incorporated. The recovery error of the proposed algorithm is analyzed theoretically and, to the best of our knowledge, this is the first algorithm that addresses the cold-start problem with provable guarantees on performance. Additionally, we also address the problem where both cold-start user and item challenges are present simultaneously. We conduct thorough experiments on real datasets that complement our theoretical results. These experiments demonstrate the ef- ectiveness of the proposed algorithm in handling the cold-start users/items problem and mitigating data sparsity issue.