Counting Triangles in Real-World Networks by Charalampos Tsourakakis

Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu

CSE 2011, Reno 1st March ‘11 CSE'11

Geoﬀ Sanders Lawrence Livermore CSE'11

Mihail N. Kolountzakis Math, University of Crete

CSE'11

Gary L. Miller SCS, CMU

  Motivation   Existing Work

  Spectral Family   Combinatorial Family   Experimental Results   Conclusions

CSE'11

Friends of friends tend to become friends themselves!

(Wasserman Faust ‘94)

(left to right) Paul Erdös , Ronald Graham, Fan Chung Graham CSE'11

Eckmann-‐Moses, Uncovering the Hidden Thematic Structure of the Web (PNAS, 2001) Key Idea: Connected regions of high curvature (i.e., dense in triangles) indicate a common topic!

CSE'11

Triangles used for Web Spam Detection (Becchetti et al. KDD ‘08)

Key Idea: Triangle Distribution among spam hosts is signiﬁcantly diﬀerent from non-‐spam hosts!

CSE'11

Triangles used for assessing Content Quality in Social Networks Welser, Gleave, Fisher, Smith Journal of Social Structure 2007 Key Claim: The amount of triangles in the self-‐centered social network of a user is a good indicator of the role of that user in the community! CSE'11



CSE'11



(Watts,Strogatz’98)

CSE'11



Signed triangles in structural balance theory Jon Kleinberg



Triangle closing models also used to model the microscopic evolution of social networks (Leskovec et.al., KDD ‘08) CSE'11

  CAD applications,   E.g., solving systems of geometric

constraints involves triangle counting! (Fudos, Hoﬀman 1997)

CSE'11

Numerous other applications including : •  Motif Detection/ Frequent Subgraph Mining (e.g., Protein-‐Protein Interaction Networks) •  Community Detection (Berry et al. ‘09) •  Outlier Detection (CET ‘08) •  Link Recommendation Fast triangle counting algorithms are necessary. CSE'11

  There is no general, good deﬁnition but

typical characteristics include:   Skewed degree distributions   High clustering coeﬃcients   “Small world” characteristics

(Six degrees of separation)

CSE'11

  Motivation   Existing Work

  Spectral Family   Combinatorial Family   Experimental Results   Conclusions

CSE'11

Alon

Yuster

Zwick

Asymptotically the fastest algorithm but not practical for large graphs.

In practice, one of the iterator algorithms are preferred. •  Node Iterator (count the edges among the neighbors of each vertex) •  Edge Iterator (count the common neighbors of the endpoints of each edge) Both run asymptotically in O(mn) time. CSE'11 16

  Remarks   In Alon, Yuster, Zwick appears the idea of

partitioning the vertices into “large” and “small” degree and treating them appropriately.   For more work, see references in our paper: ▪  Itai, Rodeh (STOC ‘77) ▪  Papadimitriou, Yannakakis (IPL ‘81) ……

CSE'11



r independent samples of three distinct vertices

Then the following holds:

with probability at least 1-δ

Works for dense graphs. e.g., T3 n2logn CSE'11

  (Yosseﬀ, Kumar, Sivakumar ‘02) require n2/

polylogn edges   More follow up work:   (Jowhari, Ghodsi ‘05)

  (Buriol, Frahling, Leondardi, Marchetti,

Spaccamela, Sohler ‘06)   (Becchetti, Boldi, Castillio, Gionis ‘08)

CSE'11

  Motivation   Existing Work

  Spectral Family   Combinatorial Family   Experimental Results   Conclusions

CSE'11

CET, [ICDM ’08]

eigenvalues of adjacency matrix i-th eigenvector

Key Idea: Few top eigenvalue-‐eigenvector pairs typically give a good approximation to the number of triangles. CSE'11

Keep only 3!

Political Blogs Network (1.2K,17K) (Adamic, Glance ‘04) CSE'11

  The few top eigenvalues are signiﬁcantly

larger than the bulk of the eigenvalues (“Eigenvalue power law”)   Hence, they contribute a lot to the number of triangles and cubes amplify this even more.   Bulk of eigenvalues almost symmetrically distributed around 0, cubes cancel out.   Lanczos method converges fast due to large eigengaps. CSE'11

Pearson’s correlation coeﬃcient ρ=0.9997 using a rank 10 approximation

Political Blogs Network (1.2K,17K) (Adamic, Glance ‘04) CSE'11

Note: with a rank 3 approximation almost perfect results

CSE'11

CET, [KAIS ’11]

Key idea   Sample the i-‐th column A(i) of the adjacency

matrix with probability proportional to the degree of the i-‐th vertex and scale it “appropriately”   Compute a low rank approximation of sampled matrix using SVD.

CSE'11

  Observation 1: Eigendecomposition <-‐> SVD

when matrix is symmetric, i.e.,

  eigenvectors = left singular vectors   λi=σi sgn(uivi) (where λi,σi eigenvalue, singular

value respectively, ui and vi left and right singular vectors respectively.

  Observation 2: We care about a k-‐rank

approximation Ak of A, where k is small. CSE ’11

  Frieze, Kannan, Vempala

(1) Pick column i with probability proportional to its squared length (2) Use the sampled matrix to obtain a good low rank approximation to the original one ~

Idea: Sample c columns, obtain A and ﬁnd Ak instead of the optimal Ak. Recover signs from left and right singular vectors. Use EigenTriangle!   Results: c=100, k=6 for Flickr (400k,2M) 95.6% accuracy 

CSE ‘11

  Success is based on empirical properties:   Real world networks typically satisfy the

properties shown before but not always.   Very little knowledge about the spectrum, most we know about are the top eigenvalues   Way less knowledge about eigenvectors of real world networks

CSE'11

  Motivation   Existing Work

  Spectral Family   Combinatorial Family   Results

  Conclusions

CSE'11

  Approximate a given graph G with a sparse

graph H, such that H is close to G in a certain notion.

  Examples:

Cut preserving Benczur-‐Karger

Spectral Sparsiﬁer Spielman-‐Teng What about Triangle Sparsiﬁers? CSE ‘11



CSE'11

  Speedup: e.g., if we use any standard iterator

method 1/p2   Setting p optimally using “median boosting trick” (Jerrum, Valiant, Vazirani ‘86)   Sampling in expected sublinear time O(pm)   Can justify even O(n) speedups in graphs with suﬃciently many triangles.   Practice: huge speedups, high accuracy CSE'11

McSherry

Achlioptas

Sparsify matrix A appropriately Compute faster a low rank Approximation which is “good” in terms of any reasonable norm (e.g., Frobenious,2-‐norm)

CET et al. [ASONAM ‘09] : Speeds up spectra computations while not aﬀecting triangle estimates MACH: Fast Randomized Tensor Decompositions (CET, SDM’10) Theoretical guarantees on HOSVD decompositions for dense tensors, works great in practice for Tucker decompositions too. CSE'11

  Theorem

If then with probability 1-‐1/n3-‐d the sampled graph has a triangle count that ε-‐approximates the true number of triangles for any 0<d<3.

CSE'11

Every graph on n vertices with max. degree Δ(G) =k is (k+1) -‐colorable with all color classes diﬀering at size by at most 1.

k+1

….

2 CSE'11

  Create an auxiliary graph where each triangle

is a vertex and two vertices are connected iﬀ the corresponding triangles share an edge.

Observe: Δ(G)=Ο(n)   Invoke Hajnal-‐Szemerédi theorem and apply

Chernoﬀ bound per each chromatic class. Finally, take a union bound. Q.E.D. CSE'11

K, M, Peng, CET Int. Math. ‘11 

CSE'11



CSE'11



Given a graph G with n vertices and m edges which graph maximizes the edges in the line CSE'11 graph L(G)?



CSE'11

  Motivation   Existing Work

  Spectral Family   Combinatorial Family

  Experimental Results   Conclusions

CSE'11

Orkut (3.1M,117M)

LiveJournal (5.4M,48M) YouTube (1.2M,3M) Flickr, (1.9M, 15.6M)

CSE'11

Web-‐EDU (9.9M,46.3M)

Social networks abundant in triangles!

CSE'11

250

Accuracy ~99%

200 150

Exact

secs

Triple Sampling

100

Hybrid

50 0 Orkut

Flickr CSE'11

Livejournal Wiki-‐2006 Wiki-‐2007 46



p was set to 0.1. More sophisticated techniques for setting p exist (CET, Kolountzakis, Miller ) using a doubling procedure. From our results, there is not a clear winner, but the hybrid algorithm achieves both high accuracy and speed. Our code, even our exact algorithm, outperforms the fastest approximate counting competitors code, hence we compared diﬀerent versions of our code!

CSE'11

  Motivation   Existing Work

  Spectral Family   Combinatorial Family   Experimental Results   Conclusions

CSE'11

  Real world graphs though of as “planar

graphs”

  Many problems can be solved more eﬃciently

than the general case.

  Spectral algorithm designed based on

empirical special spectral properties   Triangle Sparsiﬁers (fast with strong theoretical guarantees) CSE'11

  “Ιnterplay” Combinatorial-‐Spectral approach   MACH for HOSVD

  Degree based partitioning is a very practical

“trick”   State of the art results for sampling based and semi-‐streaming triangle counting algorithms

CSE'11

Triangles in Kronecker graphs [CET ICDM’08] Triangle Power Laws [CET ICDM’08] Random projections and counting triangles [ Kolountzakis, Miller, Peng, CET ‘11]   Semistreaming model with low space usage and only 3 passes over the graph stream [ Kolountzakis, Miller, Peng, CET ‘11]   MapReduce implementation [CET et al, KDD’09]   High quality code with optimized cache properties     

CSE'11

Remove edge (1,2)

Remove any weighted edge w suﬃciently large Spielman-‐Srivastava and Benczur-‐Karger sparsiﬁers also don’t work! CSE'11

THANK YOU! QUESTIONS

CSE'11

621,963,073

CSE'11

Hybrid vs. Naïve Sampling improves accuracy, Increases running time CSE'11

Best method for our applications: best running time, high accuracy 56