Iaetsd similarity search in information networks using by Iaetsd Iaetsd

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

Similarity Search in Information Networks using Meta-Path Based between Objects Abstract â&#x20AC;&#x201C; Real world physical and abstract data

in many applications. For example, in spatial

objects are interconnected, forming enormous,

database, people are interested in finding the k

interconnected networks. By structuring these

nearest neighbors for a given spatial object.

data objects and interactions between these

Object similarity is also one of the most

objects into multiple types, such networks

primitive concepts for object clustering and

become

many other data mining functions.

semi-structured

heterogeneous

information networks. Therefore, the quality In a similar context, it is critical to provide

analysis of large heterogeneous information

effective

networks poses new challenges. In current

search for the most similar pictures for a given

distance, connectivity and co-citation. By using

relationships

between

measure

objects

rather

such as flicker, a user may be interested in

Wikipedia by reflecting all three concepts:

approach

functions

a given entity. In a network of tagged images

introduced for measuring the relationship on

current

information networks, to find similar entities for

system, a generalized flow based method is

the

similarity

picture. In an e-commerce system, a user would

only

be interest in search for the most similar

than

products for a given product. Different attribute-

similarities. To address these problems we

based similarity search, links play an important

introduce a novel solution meta-path based

role for

similarity searching approach for dealing with

similarity search in

information

networks, especially when the full information

heterogeneous information networks using a

about attributes for objects is difficult to obtain.

meta-path-based method. Under this framework, similarity search and other mining tasks of the

There are a few studies leveraging link

network structure.

information in networks for similarity search, but most of these revisions are focused on

Index terms â&#x20AC;&#x201C; similarity search, information

homogeneous or bipartite networks such as P-

network, and meta-path based, clustering.

PageRank and SimRank. These similarity measures disregard the subtlety of different types among objects and links. Adoption of such I. INTRODUCTION

measures

heterogeneous

networks

his

Similarity search, which aims at locating the

significant drawbacks: even if we just want to

most relevant information for a query in large

compare objects of the same type, going through

collections of datasets, has been widely studied

link paths of different types leads to rather

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

317

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

different semantics meanings, and it makes little

connections represent different relationships

sense to mix them up and measure the similarity

between authors, each having some different

without distinguishing their semantics.

semantic meaning.

To systematically distinguish the semantics

Now the questions are, given an arbitrary

among paths connecting two objects, we

heterogeneous information network, is there any

introduce

similarity

way systematically identify all the possible

framework for objects of the same type in a

connection type between two objects types? In

heterogeneous network. A meta-path is a

order to do so, we propose two important

sequence of relations between object types,

concepts in the following.

meta-path

based

which defines a new composite relation between a) Network Schema And Meta-Path

its starting type and ending type. The meta-path framework provides a powerful mechanism for a

First,

user to select appropriate similarity semantics,

information network, it is necessary to provide

by choosing a proper meta-path, or learn it from

its

a set of training examples of similar objects.

understanding the network. Therefore, we

for

level

complex

description

heterogeneous

for

better

describe the Meta structure of a network.

relate it to two well-known existing link-based functions

similar product.

straightforward

measures

the

following. This motivated us to propose a new, meta-path Path count: the number of path instances

based similarity measure, call PathSim that

between objects.

captures the subtle of peer similarity. The insight behind it is that two similar peer objects should

Random Walk: s(x, y) is the probability of the

not only be strongly connected, but also share

random walk that starts from x and ends with y

comparable observations. As the relation of peer

following meta-path P, which is the sum of the

should be symmetric, we confine PathSim to

probabilities of all the path instances.

symmetric meta-paths. The calculation of

Pair wise random walk: for a meta-path P that

PathSim between any two objects of the same

can be decomposed into two shorter meta-paths

type given a certain meta-path involves matrix

with the same length is then the pair wise

multiplication.

random walk probability starting from objects x

In this paper, we only consider the meta-path in

and y and reaching the same middle object.

the round trip from, to guarantee its symmetry

In general, we can define a meta-path based

and therefore the symmetry of the PathSim

similarity framework for two objects x and y.

measure.

Note that P-PageRank and SimRank, two wellknown

network

similarity

functions,

Properties of PathSim

are

weighted combinations of random walk measure

1. Symmetric.

or pair wise random walk measure, respectively,

2. Self-maximum

over meta-paths with different lengths in

3. Balance of visibility

homogeneous networks. In order to use PAlthough using meta-path based similarity we

PageRank and SimRank in heterogeneous

can define similarity between two objects given

information networks.

any round trip meta-paths. a) A Novel Similarity Measure As primary eigenvectors can be used as There have been several similarity measures are

authority ranking of objects, the similarity

presented and they are partially to either highly

between two objects under an infinite meta-path

visible objects or highly concentrated objects but

can be viewed as a measure defined on their

cannot capture the semantics of peer similarity.

rankings. Two objects with more similar

However, in many scenarios, finding similar

rankings scores will have higher similarity. In

objects in networks is to find similar peers, such

the next section we discuss online query

as finding similar authors based on their fields

processing for ingle meta-path.

and reputation, finding similar actors based on

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

319

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

IV.

ISBN: 378 - 26 - 138420 - 5

check every possible object. This will be much

QUERY PROCESSING FOR SINGLE META-

PATH

efficient

than

Pairwise

computation

between the query and all the objects of that Compared with P-PageRank and SimRank, the

type. We call baseline concatenation algorithm

calculation is much more efficient, as it is a local

as PathSim-baseline.

graph measure. But still involves expensive matrix multiplication operations for top â&#x20AC;&#x201C;k

The PathSim-baseline algorithm is still time

search functions, as we need to calculate the

consuming if the candidate set is large. The time

similarity between a query and every object of

complexity of computing PathSim for each

the same type in the network. One possible

candidate, where is O(d) on average and O(m) in

solution is to materialize all the meta-paths.

the worst case. We now propose a co-clustering based top-k concatenation algorithm, by which

In order to support fast online query processing for

large-scale

networks,

propose

non-promising target objects are dynamically

filtered out to reduce the search space.

methodology that partially materializes short length meta-paths and then concatenates them online

derive

longer

b) Co-Clustering-Based Pruning

meta-path-based In the baseline algorithm, the computational

similarity. First, a baseline method is proposed,

costs involve two factors. First, the more

which computes the similarity between query

candidates to check, the more time the algorithm

object x and all the candidate object y of the

will take; second, for each candidate, the dot

same type. Next, a co-clustering based pruning

product of query vector and candidate vector

method is proposed, which prunes candidate

will at most involve m operations, where m is

objects that are not promising according to their

the vector length. Based on the intuition, we

similarity upper bounds. Both algorithms return

propose, we propose a co-clustering-based path

exact top-k results the given query.

concatenation method, which first generates coclusters of two types of objects for partial

a) Baseline

relation matrix, then stores necessary statics for Suppose we know that the relation matrix for

each of the blocks corresponding to different co-

meta-path and the diagonal vector in order to get

cluster pairs, and then uses the block statistics to

top-k objects with the highest similarity for the

prune the search space. For better picture, we

query, we need to compute the probability of

call cluster of type as target clusters, since the

objects. The straightforward baseline is: (1) first

objects are the targets for the query and call

apply vector matrix multiplication (2) calculate

clusters of type as feature clusters. Since the

probability of objects (3) sort the probability of

objects serve as features to calculate the

objects and return top-k list in the final step.

similarity between the query and the target

When a large matrix, the vector matrix

objects. By partitioning into different target

computation will be too time consuming to

clusters, if a whole target cluster is not similar to

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

320

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

the query, then all the objects in the target

extracted from DBLP and Flicker in the

cluster are likely not in the final top-k lists and

experiments.

can be pruned. By partitioning in different The PathSim algorithm significantly improves

feature clusters, cheaper calculations on the

the query processing speed comparing with the

dimension-reduced query vector and candidate

baseline algorithm, without affecting the search

vectors can be used to derive the similarity upper

bounds.

The

PathSim-Pruning

quality.

can

significantly improve the query processing speed

For additional case studies, we construct a

comparing with the baseline algorithm, without

Flicker network from a subset of the Flicker data

affecting the search quality.

which contains four types of objects such as images, users, tags, and groups. We have to

c) Multiple Meta-Paths Combination

show that our algorithms improve similarity In the previous section, we presented algorithms

search between object based on the potentiality

for similarity search using single meta-path.

and correlation between objects.

Now, we present a solution to combine multiple VI. CONCLUSION

meta-paths. The reason why we need to combine several meta-paths is that each meta-path

In this paper we introduced novel similarity

provides a unique angle to view the similarity

search using meta-path based similarity search

between objects, and the ground truth may be a

using baseline algorithm and co-clustering based

cause of different factors. Some useful guidance

pruning algorithms to improve the similarity

of the weight assignment includes: longer meta-

search based on the strengths and relationships

path utilize more remote relationship and thus

between objects.

should be assigned with a smaller weight, such REFERENCES

as in P-PageRank and SimRank and meta-paths with more important relationships should be

[1]

Jiawei Han, Lise Getoor, Wei Wang,

assigned with a higher weight. For automatically

Johannes Gehrke, Robert Grossman "Mining

determining the weights, users cloud provides

Heterogeneous

training examples of similar objects to learn the

Principles and Methodologies"

weights of different meta-paths using learning

Information

Networks

[2] Y. Koren, S.C. North, and C. Volinsky,

algorithm.

“Measuring and Extracting Proximity in Networks,” Proc. 12th ACM SIGKDD Int’l

V. EXPECTED RESULTS

Conf. Knowledge Discovery and Data To show the effectiveness of the PathSim

Mining, pp. 245-255, 2006.

measure and the efficiency of the proposed [3] M. Ito, K. Nakayama, T. Hara, and S. Nishio,

algorithms we use the bibliographic networks

“Association

Thesaurus

Construction

Methods Based on Link Co-Occurrence

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

321

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

Analysis for Wikipedia,” Proc. 17th ACM Conf.

Information

and

Knowledge

Management (CIKM), pp. 817-826, 2008. [4] K. Nakayama, T. Hara, and S. Nishio, “Wikipedia Mining for an Association Web Thesaurus Construction,” Proc. Eighth Int’l Conf. Web Information Systems Eng. (WISE), pp. 322-334, 2007. [5] M. Yazdani and A. Popescu-Belis, “A Random Walk Framework to Compute Textual Semantic Similarity: A Unified Model for Three Benchmark Tasks,” Proc. IEEE

Fourth

Int’l

Conf.

Semantic

Computing (ICSC), pp. 424-429, 2010. [6] R.L. Cilibrasi and P.M.B. Vita´nyi, “The Google Similarity Distance,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 370-383, Mar. 2007. [7] G. Kasneci, F.M. Suchanek, G. Ifrim, M. Ramanath,

and

Weikum,

“Naga:

Searching and Ranking Knowledge,” Proc. IEEE 24th Int’l Conf. Data Eng. (ICDE), pp. 953-962, 2008. [8] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

322