Charalampos (Babis) E. Tsourakakis
Data Analysis Project 20 Apr. 2010 Modern Data Mining Algorithms
1
Introduction PART I: Graphs Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions Modern Data Mining Algorithms
2
Leonard Euler (1707-1783) Seven Bridges of Kรถnigsberg
Eulerian Paths
Modern Data Mining Algorithms
3
Internet Map [lumeta.com]
Friendship Network [Moody ’01]
Food Web [Martinez ’91]
Protein Interactions [genomebiology.com] Modern Data Mining Algorithms
P0-4
Market Basket Analysis
Documents-Terms
prison freedom dance
m customers
n products
m documents
Modern Data Mining Algorithms
n words 5
(min)value
0200040006000800010000010203040time
(min)value
020004000600080001000000.511.522.5time
(min)value
02000400060008000100000100200300400500600time
Temperature Light
Intel Berkeley lab
Humidity Voltage
Modern Data Mining Algorithms 6
(min)value
0200040006000800010000051015202530time
Location
time
Data modeled as a tensor, i.e., multidimensional matrix, Tx(#sensors)x(#types of measurements)
Multi-‐dimensional time series can be modeled in such way. Modern Data Mining Algorithms
7
Functional Magnetic Resonance Imaging (fMRI)
voxel x subjects x trials x task conditions x timeticks
Modern Data Mining Algorithms
8
Introduction PART I: Graphs Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions Modern Data Mining Algorithms
9
Spam Detection Exponential random graphs
Clustering Coefficients & Transitivity Ratio Uncovering the Hidden Thematic Structure
of the web Link Recommendation
Friends of friends tend to become friends themselves
Modern Data Mining Algorithms
10
Contributions
Spectral family
Triangle Sparsifiers
Randomized SVD
Modern Data Mining Algorithms
11
Theorem 1
Δ(G) = # triangles in graph G(V,E) = eigenvalues of adjacency matrix AG
Modern Data Mining Algorithms
12
Theorem 2
i
Δ(i) = #Δs vertex i participates at. = j-‐th eigenvector = i-‐th entry of
Modern Data Mining Algorithms
Δ(i) = 2
13
Political blogs
Airports Modern Data Mining Algorithms
14
Very important for us because:
Few eigenvalues contribute a lot! Cubes amplify this even more! Lanczos converges fast due to large
spectral gaps!
Modern Data Mining Algorithms
15
Almost symmetric around 0!
Omit!
Political Blogs
Keep only 3! 3
Sum of cubes almost cancels out! Modern Data Mining Algorithms
16
Nodes
Edges
Description
~75K
~405K
Epinions network
~404K
~2.1M
Flickr
~27K
~341K
Arxiv Hep-‐Th
~1K
~17K
Political blogs
~13K
~148K
Reuters news
~3M
35M
Wikipedia 2006-‐Sep-‐05
~3.15M
~37M
Wikipedia 2006-‐Nov-‐04
~13.5K
~37.5K
AS Oregon
~23.5K
~47.5K
CAIDA AS 2004 to 2008 (means over 151 timestamps)
Social Networks Co-authorship network Information Networks
Modern Data Mining Algorithms
Web Graphs Internet Graphs
17
Modern Data Mining Algorithms
18
Modern Data Mining Algorithms
19
Triangles node i participates according to our estimation
Triangles node i participates Modern Data Mining Algorithms
20
2-3 eigenvalues almost ideal results!
Modern Data Mining Algorithms
21
Kronecker graphs is a model for generating graphs that mimic properties of real-‐world networks. The basic operation is the Kronecker product([Leskovec et al.]). 0
1
1
1
0
1
1
1
0
Initiator graph
Adjacency matrix A[0] Kronecker Product
Repeat k times
[1] Adjacency [k] Adjacency Adjacencymatrix matrix matrixA AA[2]
Modern Data Mining Algorithms
22
Theorem[KroneckerTRC ]
Let B = A[k] k-‐th Kronecker product and Δ(GA), Δ(GΒ) the total number of triangles in GA , GΒ . Then, the following equality holds:
Modern Data Mining Algorithms
23
Observation 1: Eigendecomposition <-‐> SVD
when matrix is symmetric, i.e.,
eigenvectors = left singular vectors λi=σi sgn(uivi) (where λi,σi eigenvalue, singular
value respectively, ui and vi left and right singular vectors respectively.
Observation 2: We care about a low rank
approximation of A
Modern Data Mining Algorithms
24
Frieze, Kannan, Vempala
(1) Pick column i with probability proportional to its squared length (2) Use the sampled matrix to obtain a good low rank approximation to the original one ~
~
Idea: Sample c columns, obtain A and find Ak instead of the optimal Ak. Recover signs from left and right singular vectors. Use EigenTriangle! Results: c=100, k=6 for Flickr, EigenTriangle 95.6% accuracy, Approximation 95.46%
Modern Data Mining Algorithms
25
Contributions
Spectral family
Triangle Sparsifiers
Randomized SVD
Modern Data Mining Algorithms
26
Approximate a given graph G with a sparse
graph H, such that H is close to G in a certain notion.
Examples:
Cut preserving Benczur-‐Karger
Spectral Sparsifier Spielman-‐Teng What about Triangle Sparsifiers? Modern Data Mining Algorithms
27
i
j
G(V,E) t =# Δ
HEADS! (i,j) “survives” with probability p Modern Data Mining Algorithms
28
k
G(V,E)
m
t =# Δ
TAILS! (k,m) “dies” Main Theoretical Results: Under mild conditions on the triangle density (at least nearly linear number of triangles), our estimate is strongly concentrated around the true number of triangles!
G’(V,E’) Τ =# Δ
Now, count triangles in G’ and let T/p3 be the estimate of t.
Modern Data Mining Algorithms
29
Modern Data Mining Algorithms
30
Re
1 day = 86400 seconds
Expected Speedup 1/p2 Modern Data Mining Algorithms
31
Introduction PART I: Graphs Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions Modern Data Mining Algorithms
32
Milgram 1967
The “small world experiment” • Pick 300 people at random • Ask them to get a letter to a by passing it through friends to a stockbroker in Boston. How many steps does it take?
Only 6! Typically the diameter of real-‐world network is surprisingly small! Modern Data Mining Algorithms
33
Does the same observation hold on the Yahoo Web Graph (2002), where #nodes=1.4B and #edges=6.83B?
Modern Data Mining Algorithms
34
Assume we have a multiset M={x1,..,xm} and
we want to count the number of distinct elements n from M. How can we do this using small amount of space?
Flajolet & G. Nigel Martin
Modern Data Mining Algorithms
35
Hash function h(x in U):[0,..,2L-‐1] y = Σ bit(y,k) 2k
ρ(y) = minimum k s.t bit(y,k)=1, o/w L Let’s keep a bitmask[0..L] Hash every x in M and find ρ(h(x)). If
BITMASK[ρ(h(x))] is not 0, then flip it! How will the bitmask look at the end? 0000000000…. 010110… 1111111111111 i>>log(n)
i~=log(n)
i<<log(n)
Modern Data Mining Algorithms
36
How will the bitmask look at the end?
0000000000…. 010110… 1111111111111 i>>log(n)
i~=log(n)
i<<log(n)
This region will give us the information. Flajolet-‐Martin prove that for the random variable R=leftmost 0 in our bitmask: E(R)= log(0.77351*n) Modern Data Mining Algorithms
37
For every h = 1,2, .. Estimate the cardinality of the set N(h), i.e., the
pairs of nodes reachable within h steps. When the cardinality stabilizes, output the number of steps to reach that cardinality as the diameter. Scalability O(diam(G)*m), m=#edges Efficient access to the file (very important) Parallelizable (also very important) Modern Data Mining Algorithms
38
The diameter of the Yahoo Web Graph is
surprisingly small (7~8)
Modern Data Mining Algorithms
39
Introduction PART I: Graphs Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions Modern Data Mining Algorithms
40
Document to term matrix
Documents to Document HCs Strength of each concept
CS =
x
x
MD data graph java brain lung
Term to Term HCs
Modern Data Mining Algorithms
41
Tucker is an SVD-‐like decomposition of a tensor, one projection matrix per mode and a core tensor giving the correlation among the projection matrices Modern Data Mining Algorithms
42
2.
G'
U1
Out
U2T
modality
Tucker decomposition
modality sparsify
Tucker-2
Wavelet transform
Sparsify the core tensor G
location
D
Temporal compression
3.
In
location
1.
In: D Out: D’=[G;U0,U1,U2] Spatial compression
e2 = 1 -‐ ||G||2/||D||2
location
U1
Wavelet coefficients
X U2T modality
Modern Data Mining Algorithms
U0
G
Transform Matrix (fixed) 43
(min)value (min)value (min)value
02000400060008000100000100200300400500600time
0200040006000800010000010203040time
Voltage Humidity
D
(min)value
Light Temperature
0200040006000800010000051015202530time
location
020004000600080001000000.511.522.5time
In:
modality
sensor measurements
Out: Projection matrices U1 and U2
location
Core G’ (wavelet coefficients)
Mining guide: U1
G' U2T modality
U1 and U2 reveal the patterns on
location and modality, respectively G’ provides the patterns on time Modern Data Mining Algorithms
44
U1
G' U2T
1 . . 54
1st Hidden Concept Daily Periodicity
1 . . 54
2nd Hidden Concept Exceptions
1st HC : dominant trend, e.g. daily periodicity. 2nd HC: Exceptions
Modern Data Mining Algorithms
45
U1
G' U2T modality
1 2 3
volt temp light 4 humid
1 2 3 4
1st Hidden Concept
volt temp humid light 2nd Hidden Concept
• 1st HC indicates the main sensor modality correlations ▪ Temperature and light are positively correlated, while humidity is anti-‐ correlated with the rest
• 2nd HC indicates an abnormal pattern which is due to battery
outage for some sensors
Modern Data Mining Algorithms
46
U1
G' U2T
modality
• 1st scalogram indicates daily periodicity • 2nd scalogram gives abnormal flat trend due to battery outage
Modern Data Mining Algorithms
47
Introduction PART I: Graphs Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions Modern Data Mining Algorithms
48
Most of the real-‐world processes result in
sparse tensors. However, there exist important processes which result in dense tensors:
Physical Process
Percentage of non-‐zero entries
Sensor network (sensor x measurement type x timeticks)
85%
Computer network (machine x measurement type x timeticks)
81%
Modern Data Mining Algorithms
49
It can be either very slow or impossible to
perform due to memory constraints a Tucker decomposition on a dense tensor. Can we trade a little bit of accuracy for efficiency?
Modern Data Mining Algorithms
50
McSherry
Achlioptas
MACH extends the work of Achlioptas-McSherry for fast low rank approximations to the multilinear setting.
Modern Data Mining Algorithms
51
Toss a coin for each non-‐zero entry with
probability p
If it “survives” reweigh it by 1/p. If not, make it zero!
Perform Tucker on the sparsified tensor!
For the theoretical results, see Tsourakakis,
SDM 2010.
Modern Data Mining Algorithms
52
Intemon (Carnegie Mellon University Self-‐
Monitoring system) Tensor X, 100 machines x 12 types of measurement x 10080 timeticks
Jimeng Sun showed in his thesis that Tucker
decompositions can be used to monitor efficiently the system Modern Data Mining Algorithms
53
For p=0.1 we obtain that Pearson’s Correlation Coefficient is 0.99
Ideal ρ=1
Modern Data Mining Algorithms
54
Find the differences!
Exact
MACH
The qualitative analysis which is important for our goals remains the same! Modern Data Mining Algorithms
55
Berkeley Lab
Tensor 54 sensors x 4 types of measurement x
5385 timeticks
Modern Data Mining Algorithms
56
The qualitative analysis which is important for our goals remains the same! Modern Data Mining Algorithms
57
The spatial principal mode is also preserved, and Pearson’s correlation coefficient is again almost 1! Modern Data Mining Algorithms
58
REMARKS 1) Daily periodicity is apparent. 2) Pearson’s correlation Coefficient 0.99 with the exact component.
Modern Data Mining Algorithms
59
Introduction PART I: Graphs Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions Modern Data Mining Algorithms
60
More Applications of Probabilistic
Combinatorics in Large Scale Graph Mining Randomized Algorithms work very well (e.g.,
sublinear time algorithm), but typically hard to analyze.
Smallest p* for tensor sparsification for the
(messy) HOOI algorithm
Modern Data Mining Algorithms
61
Better sparsification (Edge (1,2) is important,
Weighted Graphs!)
Property Testing: Is a graph triangle free? Does Boolean Matrix Multiplication have a truly
subcubic algorithm?
3/16/2010
Triangle Sparsifiers
62
Faloutsos
Drineas
Schwartz
Kang
Miller
Frieze
Kolountzakis
Koutis
Leskovec
Modern Data Mining Algorithms
63
Modern Data Mining Algorithms
64
Modern Data Mining Algorithms
65
Pick p=1/ Keep doubling until concentration
Concentration appears Concentration becomes stronger Modern Data Mining Algorithms
66
How to choose Mildness, pick p=1 p?
Concentration
Modern Data Mining Algorithms
67
Is tUse he cLube of anczos Compute the the sto econd one I want to compute k-‐th e igenvalue. terminates! compute the Algorithm significantly the Is fiIterate rst Δ s tiwo YES! NO The e stimated # o f s smaller than number of eigenvalues then! much s maller the s um o f c ubes o f s the c ube o f λti’he triangles! please! than divided by 6! first? ?
After some iterations… (hopefully few!)
Modern Data Mining Algorithms
68
Remark:Even if our theoretical results refer to HOSVD, MACH works for HOOI
Modern Data Mining Algorithms
69