DOULION: Counting Triangles in Massive Graphs with a Coin

Page 1

Charalampos (Babis) Tsourakakis Carnegie Mellon University KDD ‘09 Paris Joint work with: U Kang, Gary L. Miller, Christos Faloutsos

DOULION, KDD 09

1


Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

2


Why is Triangle Coun3ng important?  Clustering coefficient

A

 Transitivity ratio  Social Network Analysis fact:

C

B

“Friends of friends are friends” [WF94)] • Hidden Thematic Structure of the Web (Eckmann et al. PNAS [EM02]) • Motif Detection, (e.g., [YPSB05] ) • Web Spam Detection (Becchetti et.al. KDD ’08 [BBCG08]) DOULION, KDD 09

3


Personal Mo3va3on |V |

∑λ u 3 j

δ (i) =

2 ji

j =1

2

eigenvalues of adjacency matrix € i-th eigenvector

[CET08] Political Blogs

Keep only 3!

DOULION, KDD 09

4


Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

5


Coun3ng methods Dense graphs

Sparse graphs

Fast

Low space

Time complexity

O(n2.37)

O(n3)

Space complexity

O(n2)

O(m)

Fast

Low space

Time complexity

O(m0.7n1.2+n2+o(1))

e.g. O( n )

Space complexity

Θ(n2) (eventually)

Θ(m)

Matrix Multiplication not practical DOULION, KDD 09

M. Latapy, Theory and Experiments 6


Naive Sampling  r independent samples of three distinct vertices

X=1

T3

X=0 T0

T1

T2

DOULION, KDD 09

7


Naive Sampling  r independent samples of three distinct vertices

Then the following holds:

with probability at least 1-δ

Works Prohibitive for graphs with T3=o(n2). e.g., T3 n2logn DOULION, KDD 09

8


Buriol, Frahling, Leonardi, MarcheB-­‐Spaccamela, Sohler ?

k ?

i

j

Sample uniformly at random an edge (i,j) and a node k in V-{i,j}

Check if edges (i,k) and (j,k) exist in E(G) samples DOULION, KDD 09

9


Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

10


Our Sampling Approach i

1/p

j

G(V,E)

HEADS! (i,j) “survives”

DOULION, KDD 09

11


Our Sampling Approach k

m

G(V,E)

TAILS! (k,m) “dies”

DOULION, KDD 09

12


Sampling approach

DOULION, KDD 09

13


Our Sampling Approach on Kn Gn,0.5

Kn

Initially

In Expectation Weighted

* DOULION, KDD 09

14


Mean and Variance Δ=#triangles=k+(Δ-­‐k) k non-­‐edge-­‐disjoint triangles X r.v, our estimate

E[Χ]=Δ DOULION, KDD 09

15


Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

16


Doulion and NodeIterator Sparsify first and then use Node

Iterator to count triangles. Node Iterator: Consider each node and count how many edges among its neighbors

DOULION, KDD 09

17


Expected Speedup  Expected Speedup: 1/p2  Proof

Let R be the running time of Node Iterator after the sparsification:

Therefore, expected speedup: DOULION, KDD 09

18


Some results (I)

~3M, ~35M

~400K, ~2.1M DOULION, KDD 09

19


Some results (II)

~3.1M, ~37M

~3.6M, ~42M DOULION, KDD 09

20


Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

21


Conclusions  New Sampling approach that counts triangles

approximately.  Basic analysis of the estimate (expectation, variance, expected speedup)  Experimentation on many real world datasets where we showed that for p=constant we get high quality estimates and 1/p2 constant speedups.

DOULION, KDD 09

22


Ques3on  Can p be smaller than constant? How small can we

afford p to be and at the same time guarantee concentration?  Could e.g., p be as small as 1/ ???  Motivation: p

Speedup

0.001

106

0.005

4*104

0.01

104

DOULION, KDD 09

23


Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09

24


Approximate Triangle Coun3ng  Approximate Triangle Counting

Arxiv preprint http://arxiv.org/PS_cache/arxiv/pdf/ 0904/0904.3761v1.pdf

 C.E.T M.N. Kolountzakis G.L. Miller

DOULION, KDD 09

25


Theorem C.E.T, Kolountzakis, Miller 2009

How to choose Mildness, pick p=1 p?

Concentration DOULION, KDD 09

26


Prac33oner’s Guide

Wikipedia 2005 1,6M nodes 18,5M edges Pick p=1/ Keep doubling until concentration

Concentration appears Concentration becomes stronger DOULION, KDD 09

27


“Bad” Instances Remove edge (1,2)

Remove any weighted edge w sufficiently large DOULION, KDD 09

28


Thanks!  http://www.cs.cmu.edu/~ctsourak/projects.html

Code and datasets available graphminingtoolbox@gmail.com (HADOOP, MATLAB, JAVA implementations along with small real-­‐world graphs, all datasets used are on the web) An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software environment and the complete set of instructions which generated the figures. Buckheit and Donoho[BD95] DOULION, KDD 09

29


References  Efficient semi-­‐streaming algorithms for

local triangle counting in massive graphs Becchetti, Boldi, Castillio, Gionis [BBCG08] • Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast Ye, Peyser, Spencer, Bader [YPSB05] DOULION, KDD 09

30


References  Curvature of co-­‐links uncovers hidden thematic

layers in the World Wide Web

Eckmann, Moses [EM02]

DOULION, KDD 09

31


References  Fast Counting of Triangles in Large Real-­‐

World Networks: Algorithms and Laws C. Tsourakakis  [BD95] Wavelab and reproducible research Buckheit, Donoho

DOULION, KDD 09

32


References  Social Network Analysis: Methods and

Applications Wasserman, Faust [WF94]  Counting triangles in data streams Buriol, Frahling, Leonardi, Spaccamela, Sohler [BFLSS06]

DOULION, KDD 09

33


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.