Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu
CSE 2011, Reno 1st March ‘11 CSE'11
1
Geoff Sanders Lawrence Livermore CSE'11
2
Mihail N. Kolountzakis Math, University of Crete
CSE'11
Gary L. Miller SCS, CMU
3
Motivation Existing Work
Spectral Family Combinatorial Family Experimental Results Conclusions
CSE'11
4
Friends of friends tend to become friends themselves!
A
B
C
(Wasserman Faust ‘94)
(left to right) Paul Erdös , Ronald Graham, Fan Chung Graham CSE'11
5
Eckmann-‐Moses, Uncovering the Hidden Thematic Structure of the Web (PNAS, 2001) Key Idea: Connected regions of high curvature (i.e., dense in triangles) indicate a common topic!
CSE'11
6
Triangles used for Web Spam Detection (Becchetti et al. KDD ‘08)
Key Idea: Triangle Distribution among spam hosts is significantly different from non-‐spam hosts!
CSE'11
7
Triangles used for assessing Content Quality in Social Networks Welser, Gleave, Fisher, Smith Journal of Social Structure 2007 Key Claim: The amount of triangles in the self-‐centered social network of a user is a good indicator of the role of that user in the community! CSE'11
8
CSE'11
9
(Watts,Strogatz’98)
CSE'11
10
Signed triangles in structural balance theory Jon Kleinberg
Triangle closing models also used to model the microscopic evolution of social networks (Leskovec et.al., KDD ‘08) CSE'11
11
CAD applications, E.g., solving systems of geometric
constraints involves triangle counting! (Fudos, Hoffman 1997)
CSE'11
12
Numerous other applications including : • Motif Detection/ Frequent Subgraph Mining (e.g., Protein-‐Protein Interaction Networks) • Community Detection (Berry et al. ‘09) • Outlier Detection (CET ‘08) • Link Recommendation Fast triangle counting algorithms are necessary. CSE'11
13
There is no general, good definition but
typical characteristics include: Skewed degree distributions High clustering coefficients “Small world” characteristics
(Six degrees of separation)
CSE'11
14
Motivation Existing Work
Spectral Family Combinatorial Family Experimental Results Conclusions
CSE'11
15
Alon
Yuster
Zwick
Asymptotically the fastest algorithm but not practical for large graphs.
In practice, one of the iterator algorithms are preferred. • Node Iterator (count the edges among the neighbors of each vertex) • Edge Iterator (count the common neighbors of the endpoints of each edge) Both run asymptotically in O(mn) time. CSE'11 16
Remarks In Alon, Yuster, Zwick appears the idea of
partitioning the vertices into “large” and “small” degree and treating them appropriately. For more work, see references in our paper: ▪ Itai, Rodeh (STOC ‘77) ▪ Papadimitriou, Yannakakis (IPL ‘81) ……
CSE'11
17
r independent samples of three distinct vertices
Then the following holds:
with probability at least 1-δ
Works for dense graphs. e.g., T3 n2logn CSE'11
18
(Yosseff, Kumar, Sivakumar ‘02) require n2/
polylogn edges More follow up work: (Jowhari, Ghodsi ‘05)
(Buriol, Frahling, Leondardi, Marchetti,
Spaccamela, Sohler ‘06) (Becchetti, Boldi, Castillio, Gionis ‘08)
CSE'11
19
Motivation Existing Work
Spectral Family Combinatorial Family Experimental Results Conclusions
CSE'11
20
CET, [ICDM ’08]
eigenvalues of adjacency matrix i-th eigenvector
Key Idea: Few top eigenvalue-‐eigenvector pairs typically give a good approximation to the number of triangles. CSE'11
21
Keep only 3!
Political Blogs Network (1.2K,17K) (Adamic, Glance ‘04) CSE'11
22
The few top eigenvalues are significantly
larger than the bulk of the eigenvalues (“Eigenvalue power law”) Hence, they contribute a lot to the number of triangles and cubes amplify this even more. Bulk of eigenvalues almost symmetrically distributed around 0, cubes cancel out. Lanczos method converges fast due to large eigengaps. CSE'11
23
Pearson’s correlation coefficient ρ=0.9997 using a rank 10 approximation
Political Blogs Network (1.2K,17K) (Adamic, Glance ‘04) CSE'11
24
Note: with a rank 3 approximation almost perfect results
CSE'11
25
CET, [KAIS ’11]
Key idea Sample the i-‐th column A(i) of the adjacency
matrix with probability proportional to the degree of the i-‐th vertex and scale it “appropriately” Compute a low rank approximation of sampled matrix using SVD.
CSE'11
26
Observation 1: Eigendecomposition <-‐> SVD
when matrix is symmetric, i.e.,
eigenvectors = left singular vectors λi=σi sgn(uivi) (where λi,σi eigenvalue, singular
value respectively, ui and vi left and right singular vectors respectively.
Observation 2: We care about a k-‐rank
approximation Ak of A, where k is small. CSE ’11
27
Frieze, Kannan, Vempala
(1) Pick column i with probability proportional to its squared length (2) Use the sampled matrix to obtain a good low rank approximation to the original one ~
~
Idea: Sample c columns, obtain A and find Ak instead of the optimal Ak. Recover signs from left and right singular vectors. Use EigenTriangle! Results: c=100, k=6 for Flickr (400k,2M) 95.6% accuracy
CSE ‘11
28
Success is based on empirical properties: Real world networks typically satisfy the
properties shown before but not always. Very little knowledge about the spectrum, most we know about are the top eigenvalues Way less knowledge about eigenvectors of real world networks
CSE'11
29
Motivation Existing Work
Spectral Family Combinatorial Family Results
Conclusions
CSE'11
30
Approximate a given graph G with a sparse
graph H, such that H is close to G in a certain notion.
Examples:
Cut preserving Benczur-‐Karger
Spectral Sparsifier Spielman-‐Teng What about Triangle Sparsifiers? CSE ‘11
31
CSE'11
32
Speedup: e.g., if we use any standard iterator
method 1/p2 Setting p optimally using “median boosting trick” (Jerrum, Valiant, Vazirani ‘86) Sampling in expected sublinear time O(pm) Can justify even O(n) speedups in graphs with sufficiently many triangles. Practice: huge speedups, high accuracy CSE'11
33
McSherry
Achlioptas
Sparsify matrix A appropriately Compute faster a low rank Approximation which is “good” in terms of any reasonable norm (e.g., Frobenious,2-‐norm)
CET et al. [ASONAM ‘09] : Speeds up spectra computations while not affecting triangle estimates MACH: Fast Randomized Tensor Decompositions (CET, SDM’10) Theoretical guarantees on HOSVD decompositions for dense tensors, works great in practice for Tucker decompositions too. CSE'11
34
Theorem
If then with probability 1-‐1/n3-‐d the sampled graph has a triangle count that ε-‐approximates the true number of triangles for any 0<d<3.
CSE'11
35
Every graph on n vertices with max. degree Δ(G) =k is (k+1) -‐colorable with all color classes differing at size by at most 1.
k+1
1
….
2 CSE'11
36
Create an auxiliary graph where each triangle
is a vertex and two vertices are connected iff the corresponding triangles share an edge.
Observe: Δ(G)=Ο(n) Invoke Hajnal-‐Szemerédi theorem and apply
Chernoff bound per each chromatic class. Finally, take a union bound. Q.E.D. CSE'11
37
K, M, Peng, CET Int. Math. ‘11
CSE'11
38
CSE'11
39
Given a graph G with n vertices and m edges which graph maximizes the edges in the line CSE'11 graph L(G)?
40
CSE'11
41
Motivation Existing Work
Spectral Family Combinatorial Family
Experimental Results Conclusions
CSE'11
42
CSE'11
43
Orkut (3.1M,117M)
LiveJournal (5.4M,48M) YouTube (1.2M,3M) Flickr, (1.9M, 15.6M)
CSE'11
Web-‐EDU (9.9M,46.3M)
44
Social networks abundant in triangles!
CSE'11
45
250
Accuracy ~99%
200 150
Exact
secs
Triple Sampling
100
Hybrid
50 0 Orkut
Flickr CSE'11
Livejournal Wiki-‐2006 Wiki-‐2007 46
p was set to 0.1. More sophisticated techniques for setting p exist (CET, Kolountzakis, Miller ) using a doubling procedure. From our results, there is not a clear winner, but the hybrid algorithm achieves both high accuracy and speed. Our code, even our exact algorithm, outperforms the fastest approximate counting competitors code, hence we compared different versions of our code!
CSE'11
47
Motivation Existing Work
Spectral Family Combinatorial Family Experimental Results Conclusions
CSE'11
48
Real world graphs though of as “planar
graphs”
Many problems can be solved more efficiently
than the general case.
Spectral algorithm designed based on
empirical special spectral properties Triangle Sparsifiers (fast with strong theoretical guarantees) CSE'11
49
“Ιnterplay” Combinatorial-‐Spectral approach MACH for HOSVD
Degree based partitioning is a very practical
“trick” State of the art results for sampling based and semi-‐streaming triangle counting algorithms
CSE'11
50
Triangles in Kronecker graphs [CET ICDM’08] Triangle Power Laws [CET ICDM’08] Random projections and counting triangles [ Kolountzakis, Miller, Peng, CET ‘11] Semistreaming model with low space usage and only 3 passes over the graph stream [ Kolountzakis, Miller, Peng, CET ‘11] MapReduce implementation [CET et al, KDD’09] High quality code with optimized cache properties
CSE'11
51
Remove edge (1,2)
Remove any weighted edge w sufficiently large Spielman-‐Srivastava and Benczur-‐Karger sparsifiers also don’t work! CSE'11
52
THANK YOU! QUESTIONS
CSE'11
53
CSE'11
54
621,963,073
CSE'11
55
Hybrid vs. Naïve Sampling improves accuracy, Increases running time CSE'11
Best method for our applications: best running time, high accuracy 56