Charalampos (Babis) E. Tsourakakis ctsourak@math.cmu.edu
Canadian Mathematical Society 12th December ‘11 CMS '11
1
Mihail N. Kolountzakis Gary L. Miller Math, University of Crete SCS, CMU
CMS '11
Rasmus Pagh SCS, University Copenhagen
2
PART I: Triangle counting Motivation and Related Work Algorithms, Results and Discussion
PART II: Vertex Similarity Motivation and Related Work Our Approach, few Results and Discussion
CMS '11
3
Friends of friends tend to become friends themselves!
A
B
C
[Wasserman Faust ’94]
(left to right) Paul Erdös , Ronald Graham, Fan Chung Graham CMS '11
4
http://fellows-‐exp.com/
[Friggeri et al., 2011]
721 million users 69 billion links
CMS '11
Subjective ratings given to communities by real persons show that triangles are the key quantity that determines the rating. 5
Uncovering the Hidden Thematic Structure of the Web [Eckmann-‐ Moses, PNAS 2001] Key Idea: Connected regions of high curvature (i.e., dense in triangles) indicate a common topic! CMS '11
6
Triangles used for Web Spam Detection [Becchetti et al. KDD ’08]
Key Idea: Triangle distribution among spam hosts is significantly different from non-‐spam hosts!
CMS '11
7
Triangles used for assessing content quality in Social Networks Welser, Gleave, Fisher, Smith Journal of Social Structure 2007 Key Claim: The amount of triangles in the self-‐centered social network of a user is a good indicator of the role of that user in the community! CMS '11
8
[Watts,Strogatz’98]
CMS '11
9
Signed triangles appear in structural balance theory
Triangle closing models also used to model the microscopic evolution of social networks [Leskovec et.al., KDD ’08] CMS '11
10
Numerous
other applications including : • Motif Detection/ Frequent Subgraph Mining • Community Detection [Berry et al. ’09] • Outlier Detection and Link Recommendation and many more.. Fast triangle counting algorithms are necessary. CMS '11
11
Alon
Yuster
Zwick
Asymptotically the fastest algorithm but not practical for large graphs.
In practice, one of the iterator algorithms are preferred. • Node Iterator (count the edges among the neighbors of each vertex) • Edge Iterator (count the common neighbors of the endpoints of each edge) Both run asymptotically in O(mn) time. CMS '11 12
Remarks In Alon, Yuster, Zwick appears the idea of
partitioning the vertices into “large” and “small” degree and treating them appropriately. For more work, see references in our paper(s): ▪ Itai, Rodeh (STOC ‘77) ▪ Papadimitriou, Yannakakis (IPL ‘81) ……
CMS '11
13
r independent samples of three distinct vertices
Then the following holds:
with probability at least 1-δ
Works for dense graphs. e.g., T3 n2logn CMS '11
14
(Yosseff, Kumar, Sivakumar ‘02) require n2/
polylogn edges More follow up work: (Jowhari, Ghodsi ‘05)
(Buriol, Frahling, Leondardi, Marchetti,
Spaccamela, Sohler ‘06) (Becchetti, Boldi, Castillio, Gionis ‘08)
CMS '11
15
Approximate a given graph G with a sparse
graph H, such that H is close to G in a certain notion.
Examples:
Cut preserving Benczur-‐Karger Spectral Sparsifier Spielman-‐Teng Modern Data Mining Algorithms
16
t: number of triangles. T: triangles in sparsified graph, essentially our
estimate. Δ: maximum number of triangles an edge is contained in. Δ=O(n)
tmax: maximum number of triangles a vertex is
contained in. tmax =Ο(n2) CMS '11
17
CMS '11
18
How to choose Mildness, pick p=1 p?
Concentration CMS '11
19
Kim
CMS '11
Vu
20
CMS '11
21
CMS '11
22
Given a graph G with n vertices and m edges which graph maximizes the edges in the line graph L(G)? CMS '11
23
CMS '11
24
CMS '11
25
Orkut (3.1M,117M)
LiveJournal (5.4M,48M) YouTube (1.2M,3M) Flickr, (1.9M, 15.6M)
CMS '11
Web-‐EDU (9.9M,46.3M)
26
Social networks abundant in triangles!
CMS '11
27
250 200 150
Exact
secs
Triple Sampling
100
Hybrid
50 0 Orkut
Flickr CMS '11
Livejournal Wiki-‐2006 Wiki-‐2007 28
p was set to 0.1. More sophisticated techniques for setting p exist using a doubling procedure. Sampling from a binomial can be done easily in (expected) sublinear time. Our code, even our exact algorithm, outperforms the fastest approximate counting competitors code, hence we compared different versions of our code! To the best of our knowledge, used in Twitter.
CMS '11
29
Remove any weighted edge, w sufficiently large.
CMS '11
30
Remove edge (1,2)
CMS '11
31
Let N=1/p be the number of colors we use to color the vertices. Call an edge monochromatic if its endpoints receive the same color.
CMS '11
32
CMS '11
33
CMS '11
34
CMS '11
35
From these extreme cases we see that if we want to hope for concentration p has to be at least ω(n)t/Δ and ω(n)t-‐1/2 respectively. CMS '11
36
We pick p large enough to make Var(T)=o(E[T]2). CMS '11
37
CMS '11
38
Every graph on n vertices with max. degree Δ(G) =k is (k+1) -‐colorable with all color classes differing at size by at most 1.
k+1
1
….
2 CMS '11
39
Create an auxiliary graph where each triangle
is a vertex and two vertices are connected iff the corresponding triangles share a vertex.
Invoke Hajnal-‐Szemerédi theorem and apply
Chernoff bound per each chromatic class. Finally, take a union bound. Q.E.D. CMS '11
40
Pr(Xi=1|rest are monochromatic) =p ≠ Pr(Xi=1)=p2
CMS '11
41
We can adapt our proposed method in the
semi-‐streaming model with space usage
so that it performs only 3 passes over the data. MapReduce implementations.
CMS '11
42
PART I: Triangle counting Motivation and Related Work Algorithms, Results and Discussion
PART II: Vertex Similarity Motivation and Related Work Our Approach, few Results and Discussion
CMS '11
43
Privacy attacks , [Hay et al., VLDB]
Vertex Similarity & Link Recommendation
Viral Marketing CMS '11
44
Since there can be pairs of vertices which are highly similar, recursive equations are better, e.g.,:
S=φAS+ψI where φ,ψ are given parameters.
Vertex similarity in networks, Leicht et al. CMS '11
45
CMS '11
46
We are interested in robust simplex fitting since in many real-‐world graph embeddings there exist outliers.
CMS '11
47
“Hard” formulation
Robust formulation CMS '11
48
CMS '11
49
CMS '11
50
Having a set of mixture coefficients for each vertex we can use any nearest neighbor structure to perform queries such as “find the k vertices most similar to v”. Several data structures exist, e.g., Mount & Arya.
CMS '11
51
Reconstructing more complex geometric
structures, such as simplicial complexes?
CMS '11
52
THANK YOU!
CMS '11
53