Charalampos (Babis) Tsourakakis Carnegie Mellon University KDD ‘09 Paris Joint work with: U Kang, Gary L. Miller, Christos Faloutsos
DOULION, KDD 09
1
Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09
2
Why is Triangle Coun3ng important? Clustering coefficient
A
Transitivity ratio Social Network Analysis fact:
C
B
“Friends of friends are friends” [WF94)] • Hidden Thematic Structure of the Web (Eckmann et al. PNAS [EM02]) • Motif Detection, (e.g., [YPSB05] ) • Web Spam Detection (Becchetti et.al. KDD ’08 [BBCG08]) DOULION, KDD 09
3
Personal Mo3va3on |V |
∑λ u 3 j
δ (i) =
2 ji
j =1
2
eigenvalues of adjacency matrix € i-th eigenvector
[CET08] Political Blogs
Keep only 3!
DOULION, KDD 09
4
Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09
5
Coun3ng methods Dense graphs
Sparse graphs
Fast
Low space
Time complexity
O(n2.37)
O(n3)
Space complexity
O(n2)
O(m)
Fast
Low space
Time complexity
O(m0.7n1.2+n2+o(1))
e.g. O( n )
Space complexity
Θ(n2) (eventually)
Θ(m)
Matrix Multiplication not practical DOULION, KDD 09
M. Latapy, Theory and Experiments 6
Naive Sampling r independent samples of three distinct vertices
X=1
T3
X=0 T0
T1
T2
DOULION, KDD 09
7
Naive Sampling r independent samples of three distinct vertices
Then the following holds:
with probability at least 1-δ
Works Prohibitive for graphs with T3=o(n2). e.g., T3 n2logn DOULION, KDD 09
8
Buriol, Frahling, Leonardi, MarcheB-‐Spaccamela, Sohler ?
k ?
i
j
Sample uniformly at random an edge (i,j) and a node k in V-{i,j}
Check if edges (i,k) and (j,k) exist in E(G) samples DOULION, KDD 09
9
Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09
10
Our Sampling Approach i
1/p
j
G(V,E)
HEADS! (i,j) “survives”
DOULION, KDD 09
11
Our Sampling Approach k
m
G(V,E)
TAILS! (k,m) “dies”
DOULION, KDD 09
12
Sampling approach
DOULION, KDD 09
13
Our Sampling Approach on Kn Gn,0.5
Kn
Initially
In Expectation Weighted
* DOULION, KDD 09
14
Mean and Variance Δ=#triangles=k+(Δ-‐k) k non-‐edge-‐disjoint triangles X r.v, our estimate
E[Χ]=Δ DOULION, KDD 09
15
Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09
16
Doulion and NodeIterator Sparsify first and then use Node
Iterator to count triangles. Node Iterator: Consider each node and count how many edges among its neighbors
DOULION, KDD 09
17
Expected Speedup Expected Speedup: 1/p2 Proof
Let R be the running time of Node Iterator after the sparsification:
Therefore, expected speedup: DOULION, KDD 09
18
Some results (I)
~3M, ~35M
~400K, ~2.1M DOULION, KDD 09
19
Some results (II)
~3.1M, ~37M
~3.6M, ~42M DOULION, KDD 09
20
Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09
21
Conclusions New Sampling approach that counts triangles
approximately. Basic analysis of the estimate (expectation, variance, expected speedup) Experimentation on many real world datasets where we showed that for p=constant we get high quality estimates and 1/p2 constant speedups.
DOULION, KDD 09
22
Ques3on Can p be smaller than constant? How small can we
afford p to be and at the same time guarantee concentration? Could e.g., p be as small as 1/ ??? Motivation: p
Speedup
0.001
106
0.005
4*104
0.01
104
DOULION, KDD 09
23
Outline Motivation Related Work Proposed Method Results Conclusion Extra DOULION, KDD 09
24
Approximate Triangle Coun3ng Approximate Triangle Counting
Arxiv preprint http://arxiv.org/PS_cache/arxiv/pdf/ 0904/0904.3761v1.pdf
C.E.T M.N. Kolountzakis G.L. Miller
DOULION, KDD 09
25
Theorem C.E.T, Kolountzakis, Miller 2009
How to choose Mildness, pick p=1 p?
Concentration DOULION, KDD 09
26
Prac33oner’s Guide
Wikipedia 2005 1,6M nodes 18,5M edges Pick p=1/ Keep doubling until concentration
Concentration appears Concentration becomes stronger DOULION, KDD 09
27
“Bad” Instances Remove edge (1,2)
Remove any weighted edge w sufficiently large DOULION, KDD 09
28
Thanks! http://www.cs.cmu.edu/~ctsourak/projects.html
Code and datasets available graphminingtoolbox@gmail.com (HADOOP, MATLAB, JAVA implementations along with small real-‐world graphs, all datasets used are on the web) An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software environment and the complete set of instructions which generated the figures. Buckheit and Donoho[BD95] DOULION, KDD 09
29
References Efficient semi-‐streaming algorithms for
local triangle counting in massive graphs Becchetti, Boldi, Castillio, Gionis [BBCG08] • Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast Ye, Peyser, Spencer, Bader [YPSB05] DOULION, KDD 09
30
References Curvature of co-‐links uncovers hidden thematic
layers in the World Wide Web
Eckmann, Moses [EM02]
DOULION, KDD 09
31
References Fast Counting of Triangles in Large Real-‐
World Networks: Algorithms and Laws C. Tsourakakis [BD95] Wavelab and reproducible research Buckheit, Donoho
DOULION, KDD 09
32
References Social Network Analysis: Methods and
Applications Wasserman, Faust [WF94] Counting triangles in data streams Buriol, Frahling, Leonardi, Spaccamela, Sohler [BFLSS06]
DOULION, KDD 09
33