INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
Similarity Search in Information Networks using Meta-Path Based between Objects Abstract – Real world physical and abstract data
in many applications. For example, in spatial
objects are interconnected, forming enormous,
database, people are interested in finding the k
interconnected networks. By structuring these
nearest neighbors for a given spatial object.
data objects and interactions between these
Object similarity is also one of the most
objects into multiple types, such networks
primitive concepts for object clustering and
become
many other data mining functions.
semi-structured
heterogeneous
information networks. Therefore, the quality In a similar context, it is critical to provide
analysis of large heterogeneous information
effective
networks poses new challenges. In current
search for the most similar pictures for a given
distance, connectivity and co-citation. By using
relationships
between
we
measure
objects
rather
in
such as flicker, a user may be interested in
Wikipedia by reflecting all three concepts:
approach
functions
a given entity. In a network of tagged images
introduced for measuring the relationship on
current
search
information networks, to find similar entities for
system, a generalized flow based method is
the
similarity
picture. In an e-commerce system, a user would
only
be interest in search for the most similar
than
products for a given product. Different attribute-
similarities. To address these problems we
based similarity search, links play an important
introduce a novel solution meta-path based
role for
similarity searching approach for dealing with
similarity search in
information
networks, especially when the full information
heterogeneous information networks using a
about attributes for objects is difficult to obtain.
meta-path-based method. Under this framework, similarity search and other mining tasks of the
There are a few studies leveraging link
network structure.
information in networks for similarity search, but most of these revisions are focused on
Index terms – similarity search, information
homogeneous or bipartite networks such as P-
network, and meta-path based, clustering.
PageRank and SimRank. These similarity measures disregard the subtlety of different types among objects and links. Adoption of such I. INTRODUCTION
measures
to
heterogeneous
networks
his
Similarity search, which aims at locating the
significant drawbacks: even if we just want to
most relevant information for a query in large
compare objects of the same type, going through
collections of datasets, has been widely studied
link paths of different types leads to rather
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
317
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
different semantics meanings, and it makes little
connections represent different relationships
sense to mix them up and measure the similarity
between authors, each having some different
without distinguishing their semantics.
semantic meaning.
To systematically distinguish the semantics
Now the questions are, given an arbitrary
among paths connecting two objects, we
heterogeneous information network, is there any
introduce
similarity
way systematically identify all the possible
framework for objects of the same type in a
connection type between two objects types? In
heterogeneous network. A meta-path is a
order to do so, we propose two important
sequence of relations between object types,
concepts in the following.
a
meta-path
based
which defines a new composite relation between a) Network Schema And Meta-Path
its starting type and ending type. The meta-path framework provides a powerful mechanism for a
First,
user to select appropriate similarity semantics,
information network, it is necessary to provide
by choosing a proper meta-path, or learn it from
its
a set of training examples of similar objects.
understanding the network. Therefore, we
for
level
complex
description
heterogeneous
for
better
describe the Meta structure of a network.
relate it to two well-known existing link-based functions
Meta
a
propose the concept of network scheme to
The meta-path based similarity framework, and
similarity
given
homogeneous
The concept of network scheme is similar to that
information networks. We define a novel
of the Entity – Relationship model in database
similarity measure, PathSim that is able to find
systems, but only captures the entity type and
peer objects that are not only strongly connected
their binary relations, without considering the
with each other but also share similar visibility
attributes for each Entity type. Network schema
in the network. Moreover, we propose an
serves as a template for a network, and tells how
efficient algorithm to support online top-k
many types of objects there are in the network
queries for such similarity search.
and where the possible links exist.
II. A META-PATH BASED SIMILARITY
b) Bibliographic Scheme and Meta-Path
MEASURE For the bibliographic network scheme, where an The similarity between two objects in a link-
explicitly shows the direction of a relation.
based similarity function is determined by how III. META-PATH BASED SIMILARITY
the objects are connected in a network, which
FRAMEWORK
can be described using paths. In a heterogeneous information network, due to the heterogeneity of
Given
the types of links, the way to connect two
similarity measures can be defined for a pair of
objects can be much more diverse. The schema
objects, and according to the path instances
a user-specified meta-path,
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
318
several
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
between them following the met-path. There are
their movie styles and productivity and finding
several
similar product.
straightforward
measures
in
the
following. This motivated us to propose a new, meta-path Path count: the number of path instances
based similarity measure, call PathSim that
between objects.
captures the subtle of peer similarity. The insight behind it is that two similar peer objects should
Random Walk: s(x, y) is the probability of the
not only be strongly connected, but also share
random walk that starts from x and ends with y
comparable observations. As the relation of peer
following meta-path P, which is the sum of the
should be symmetric, we confine PathSim to
probabilities of all the path instances.
symmetric meta-paths. The calculation of
Pair wise random walk: for a meta-path P that
PathSim between any two objects of the same
can be decomposed into two shorter meta-paths
type given a certain meta-path involves matrix
with the same length is then the pair wise
multiplication.
random walk probability starting from objects x
In this paper, we only consider the meta-path in
and y and reaching the same middle object.
the round trip from, to guarantee its symmetry
In general, we can define a meta-path based
and therefore the symmetry of the PathSim
similarity framework for two objects x and y.
measure.
Note that P-PageRank and SimRank, two wellknown
network
similarity
functions,
Properties of PathSim
are
weighted combinations of random walk measure
1. Symmetric.
or pair wise random walk measure, respectively,
2. Self-maximum
over meta-paths with different lengths in
3. Balance of visibility
homogeneous networks. In order to use PAlthough using meta-path based similarity we
PageRank and SimRank in heterogeneous
can define similarity between two objects given
information networks.
any round trip meta-paths. a) A Novel Similarity Measure As primary eigenvectors can be used as There have been several similarity measures are
authority ranking of objects, the similarity
presented and they are partially to either highly
between two objects under an infinite meta-path
visible objects or highly concentrated objects but
can be viewed as a measure defined on their
cannot capture the semantics of peer similarity.
rankings. Two objects with more similar
However, in many scenarios, finding similar
rankings scores will have higher similarity. In
objects in networks is to find similar peers, such
the next section we discuss online query
as finding similar authors based on their fields
processing for ingle meta-path.
and reputation, finding similar actors based on
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
319
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
IV.
ISBN: 378 - 26 - 138420 - 5
check every possible object. This will be much
QUERY PROCESSING FOR SINGLE META-
more
PATH
efficient
than
Pairwise
computation
between the query and all the objects of that Compared with P-PageRank and SimRank, the
type. We call baseline concatenation algorithm
calculation is much more efficient, as it is a local
as PathSim-baseline.
graph measure. But still involves expensive matrix multiplication operations for top –k
The PathSim-baseline algorithm is still time
search functions, as we need to calculate the
consuming if the candidate set is large. The time
similarity between a query and every object of
complexity of computing PathSim for each
the same type in the network. One possible
candidate, where is O(d) on average and O(m) in
solution is to materialize all the meta-paths.
the worst case. We now propose a co-clustering based top-k concatenation algorithm, by which
In order to support fast online query processing for
large-scale
networks,
we
propose
non-promising target objects are dynamically
a
filtered out to reduce the search space.
methodology that partially materializes short length meta-paths and then concatenates them online
to
derive
longer
b) Co-Clustering-Based Pruning
meta-path-based In the baseline algorithm, the computational
similarity. First, a baseline method is proposed,
costs involve two factors. First, the more
which computes the similarity between query
candidates to check, the more time the algorithm
object x and all the candidate object y of the
will take; second, for each candidate, the dot
same type. Next, a co-clustering based pruning
product of query vector and candidate vector
method is proposed, which prunes candidate
will at most involve m operations, where m is
objects that are not promising according to their
the vector length. Based on the intuition, we
similarity upper bounds. Both algorithms return
propose, we propose a co-clustering-based path
exact top-k results the given query.
concatenation method, which first generates coclusters of two types of objects for partial
a) Baseline
relation matrix, then stores necessary statics for Suppose we know that the relation matrix for
each of the blocks corresponding to different co-
meta-path and the diagonal vector in order to get
cluster pairs, and then uses the block statistics to
top-k objects with the highest similarity for the
prune the search space. For better picture, we
query, we need to compute the probability of
call cluster of type as target clusters, since the
objects. The straightforward baseline is: (1) first
objects are the targets for the query and call
apply vector matrix multiplication (2) calculate
clusters of type as feature clusters. Since the
probability of objects (3) sort the probability of
objects serve as features to calculate the
objects and return top-k list in the final step.
similarity between the query and the target
When a large matrix, the vector matrix
objects. By partitioning into different target
computation will be too time consuming to
clusters, if a whole target cluster is not similar to
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
320
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
the query, then all the objects in the target
extracted from DBLP and Flicker in the
cluster are likely not in the final top-k lists and
experiments.
can be pruned. By partitioning in different The PathSim algorithm significantly improves
feature clusters, cheaper calculations on the
the query processing speed comparing with the
dimension-reduced query vector and candidate
baseline algorithm, without affecting the search
vectors can be used to derive the similarity upper
bounds.
The
PathSim-Pruning
quality.
can
significantly improve the query processing speed
For additional case studies, we construct a
comparing with the baseline algorithm, without
Flicker network from a subset of the Flicker data
affecting the search quality.
which contains four types of objects such as images, users, tags, and groups. We have to
c) Multiple Meta-Paths Combination
show that our algorithms improve similarity In the previous section, we presented algorithms
search between object based on the potentiality
for similarity search using single meta-path.
and correlation between objects.
Now, we present a solution to combine multiple VI. CONCLUSION
meta-paths. The reason why we need to combine several meta-paths is that each meta-path
In this paper we introduced novel similarity
provides a unique angle to view the similarity
search using meta-path based similarity search
between objects, and the ground truth may be a
using baseline algorithm and co-clustering based
cause of different factors. Some useful guidance
pruning algorithms to improve the similarity
of the weight assignment includes: longer meta-
search based on the strengths and relationships
path utilize more remote relationship and thus
between objects.
should be assigned with a smaller weight, such REFERENCES
as in P-PageRank and SimRank and meta-paths with more important relationships should be
[1]
Jiawei Han, Lise Getoor, Wei Wang,
assigned with a higher weight. For automatically
Johannes Gehrke, Robert Grossman "Mining
determining the weights, users cloud provides
Heterogeneous
training examples of similar objects to learn the
Principles and Methodologies"
weights of different meta-paths using learning
Information
Networks
[2] Y. Koren, S.C. North, and C. Volinsky,
algorithm.
“Measuring and Extracting Proximity in Networks,” Proc. 12th ACM SIGKDD Int’l
V. EXPECTED RESULTS
Conf. Knowledge Discovery and Data To show the effectiveness of the PathSim
Mining, pp. 245-255, 2006.
measure and the efficiency of the proposed [3] M. Ito, K. Nakayama, T. Hara, and S. Nishio,
algorithms we use the bibliographic networks
“Association
Thesaurus
Construction
Methods Based on Link Co-Occurrence
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
321
www.iaetsd.in
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
Analysis for Wikipedia,” Proc. 17th ACM Conf.
Information
and
Knowledge
Management (CIKM), pp. 817-826, 2008. [4] K. Nakayama, T. Hara, and S. Nishio, “Wikipedia Mining for an Association Web Thesaurus Construction,” Proc. Eighth Int’l Conf. Web Information Systems Eng. (WISE), pp. 322-334, 2007. [5] M. Yazdani and A. Popescu-Belis, “A Random Walk Framework to Compute Textual Semantic Similarity: A Unified Model for Three Benchmark Tasks,” Proc. IEEE
Fourth
Int’l
Conf.
Semantic
Computing (ICSC), pp. 424-429, 2010. [6] R.L. Cilibrasi and P.M.B. Vita´nyi, “The Google Similarity Distance,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 370-383, Mar. 2007. [7] G. Kasneci, F.M. Suchanek, G. Ifrim, M. Ramanath,
and
G.
Weikum,
“Naga:
Searching and Ranking Knowledge,” Proc. IEEE 24th Int’l Conf. Data Eng. (ICDE), pp. 953-962, 2008. [8] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
322
www.iaetsd.in