Data Mining with MapReduce: Graph and Tensor Algorithms with Applications

Page 1

Charalampos (Babis) E. Tsourakakis

Data Analysis Project 20 Apr. 2010 Modern Data Mining Algorithms

1


  Introduction   PART I: Graphs   Triangles   Diameter

  PART II: Tensors   2 Heads method   MACH

  Conclusion/Research Directions Modern Data Mining Algorithms

2


Leonard Euler (1707-1783) Seven Bridges of Kรถnigsberg

Eulerian Paths

Modern Data Mining Algorithms

3


Internet Map [lumeta.com]

Friendship Network [Moody ’01]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com] Modern Data Mining Algorithms

P0-4


Market Basket Analysis

Documents-Terms

prison freedom dance

m customers

n products

m documents

Modern Data Mining Algorithms

n words 5


(min)value

0200040006000800010000010203040time

(min)value

020004000600080001000000.511.522.5time

(min)value

02000400060008000100000100200300400500600time

Temperature Light

Intel Berkeley lab

Humidity Voltage

Modern Data Mining Algorithms 6

(min)value

0200040006000800010000051015202530time


Location

time

Data modeled as a tensor, i.e., multidimensional matrix, Tx(#sensors)x(#types of measurements)

Multi-­‐dimensional time series can be modeled in such way. Modern Data Mining Algorithms

7


Functional Magnetic Resonance Imaging (fMRI)

voxel x subjects x trials x task conditions x timeticks

Modern Data Mining Algorithms

8


  Introduction   PART I: Graphs   Triangles   Diameter

  PART II: Tensors   2 Heads method   MACH

  Conclusion/Research Directions Modern Data Mining Algorithms

9


  Spam Detection   Exponential random graphs

  Clustering Coefficients & Transitivity Ratio   Uncovering the Hidden Thematic Structure

of the web   Link Recommendation

Friends of friends tend to become friends themselves

Modern Data Mining Algorithms

10


Contributions

Spectral family

Triangle Sparsifiers

Randomized SVD

Modern Data Mining Algorithms

11


Theorem 1

Δ(G) = # triangles in graph G(V,E) = eigenvalues of adjacency matrix AG

Modern Data Mining Algorithms

12


Theorem 2

i

Δ(i) = #Δs vertex i participates at. = j-­‐th eigenvector = i-­‐th entry of

Modern Data Mining Algorithms

Δ(i) = 2

13


Political blogs

Airports Modern Data Mining Algorithms

14


  Very important for us because:

 Few eigenvalues contribute a lot!  Cubes amplify this even more!  Lanczos converges fast due to large

spectral gaps!

Modern Data Mining Algorithms

15


  Almost symmetric around 0!

Omit!

Political Blogs

Keep only 3! 3

  Sum of cubes almost cancels out! Modern Data Mining Algorithms

16


Nodes

Edges

Description

~75K

~405K

Epinions network

~404K

~2.1M

Flickr

~27K

~341K

Arxiv Hep-­‐Th

~1K

~17K

Political blogs

~13K

~148K

Reuters news

~3M

35M

Wikipedia 2006-­‐Sep-­‐05

~3.15M

~37M

Wikipedia 2006-­‐Nov-­‐04

~13.5K

~37.5K

AS Oregon

~23.5K

~47.5K

CAIDA AS 2004 to 2008 (means over 151 timestamps)

Social Networks Co-authorship network Information Networks

Modern Data Mining Algorithms

Web Graphs Internet Graphs

17


Modern Data Mining Algorithms

18


Modern Data Mining Algorithms

19


Triangles node i participates according to our estimation

Triangles node i participates Modern Data Mining Algorithms

20


2-3 eigenvalues almost ideal results!

Modern Data Mining Algorithms

21


Kronecker graphs is a model for generating graphs that mimic properties of real-­‐world networks. The basic operation is the Kronecker product([Leskovec et al.]). 0

1

1

1

0

1

1

1

0

Initiator graph

Adjacency matrix A[0] Kronecker Product

Repeat k times

[1] Adjacency [k] Adjacency Adjacencymatrix matrix matrixA AA[2]

Modern Data Mining Algorithms

22


  Theorem[KroneckerTRC ]

Let B = A[k] k-­‐th Kronecker product and Δ(GA), Δ(GΒ) the total number of triangles in GA , GΒ . Then, the following equality holds:

Modern Data Mining Algorithms

23


  Observation 1: Eigendecomposition <-­‐> SVD

when matrix is symmetric, i.e.,

  eigenvectors = left singular vectors   λi=σi sgn(uivi) (where λi,σi eigenvalue, singular

value respectively, ui and vi left and right singular vectors respectively.

  Observation 2: We care about a low rank

approximation of A

Modern Data Mining Algorithms

24


  Frieze, Kannan, Vempala

(1) Pick column i with probability proportional to its squared length (2) Use the sampled matrix to obtain a good low rank approximation to the original one ~

~

Idea: Sample c columns, obtain A and find Ak instead of the optimal Ak. Recover signs from left and right singular vectors. Use EigenTriangle!   Results: c=100, k=6 for Flickr, EigenTriangle 95.6% accuracy, Approximation 95.46% 

Modern Data Mining Algorithms

25


Contributions

Spectral family

Triangle Sparsifiers

Randomized SVD

Modern Data Mining Algorithms

26


  Approximate a given graph G with a sparse

graph H, such that H is close to G in a certain notion.

  Examples:

Cut preserving Benczur-­‐Karger

Spectral Sparsifier Spielman-­‐Teng What about Triangle Sparsifiers? Modern Data Mining Algorithms

27


i

j

G(V,E) t =# Δ

HEADS! (i,j) “survives” with probability p Modern Data Mining Algorithms

28


k

G(V,E)

m

t =# Δ

TAILS! (k,m) “dies” Main Theoretical Results: Under mild conditions on the triangle density (at least nearly linear number of triangles), our estimate is strongly concentrated around the true number of triangles!

G’(V,E’) Τ =# Δ

Now, count triangles in G’ and let T/p3 be the estimate of t.

Modern Data Mining Algorithms

29


Modern Data Mining Algorithms

30


Re

1 day = 86400 seconds

Expected Speedup 1/p2 Modern Data Mining Algorithms

31


  Introduction   PART I: Graphs   Triangles   Diameter

  PART II: Tensors   2 Heads method   MACH

  Conclusion/Research Directions Modern Data Mining Algorithms

32


  Milgram 1967

The “small world experiment” • Pick 300 people at random •  Ask them to get a letter to a by passing it through friends to a stockbroker in Boston. How many steps does it take?

Only 6! Typically the diameter of real-­‐world network is surprisingly small! Modern Data Mining Algorithms

33


Does the same observation hold on the Yahoo Web Graph (2002), where #nodes=1.4B and #edges=6.83B?

Modern Data Mining Algorithms

34


  Assume we have a multiset M={x1,..,xm} and

we want to count the number of distinct elements n from M. How can we do this using small amount of space?

Flajolet & G. Nigel Martin

Modern Data Mining Algorithms

35


  Hash function h(x in U):[0,..,2L-­‐1]   y = Σ bit(y,k) 2k

  ρ(y) = minimum k s.t bit(y,k)=1, o/w L   Let’s keep a bitmask[0..L]   Hash every x in M and find ρ(h(x)). If

BITMASK[ρ(h(x))] is not 0, then flip it!   How will the bitmask look at the end? 0000000000…. 010110… 1111111111111 i>>log(n)

i~=log(n)

i<<log(n)

Modern Data Mining Algorithms

36


  How will the bitmask look at the end?

0000000000…. 010110… 1111111111111 i>>log(n)

i~=log(n)

i<<log(n)

This region will give us the information. Flajolet-­‐Martin prove that for the random variable R=leftmost 0 in our bitmask: E(R)= log(0.77351*n) Modern Data Mining Algorithms

37


  For every h = 1,2, ..   Estimate the cardinality of the set N(h), i.e., the

pairs of nodes reachable within h steps.   When the cardinality stabilizes, output the number of steps to reach that cardinality as the diameter.   Scalability O(diam(G)*m), m=#edges   Efficient access to the file (very important)   Parallelizable (also very important) Modern Data Mining Algorithms

38


  The diameter of the Yahoo Web Graph is

surprisingly small (7~8)

Modern Data Mining Algorithms

39


  Introduction   PART I: Graphs   Triangles   Diameter

  PART II: Tensors   2 Heads method   MACH

  Conclusion/Research Directions Modern Data Mining Algorithms

40


Document to term matrix

Documents to Document HCs Strength of each concept

CS =

x

x

MD data graph java brain lung

Term to Term HCs

Modern Data Mining Algorithms

41


Tucker is an SVD-­‐like decomposition of a tensor, one projection matrix per mode and a core tensor giving the correlation among the projection matrices Modern Data Mining Algorithms

42


2.

G'

U1

Out

U2T

modality

Tucker decomposition

modality sparsify

Tucker-2

Wavelet transform

Sparsify the core tensor G 

location

D

Temporal compression 

3.

In

location

  1.

In: D Out: D’=[G;U0,U1,U2] Spatial compression

e2 = 1 -­‐ ||G||2/||D||2

location

U1

Wavelet coefficients

X U2T modality

Modern Data Mining Algorithms

U0

G

Transform Matrix (fixed) 43


(min)value (min)value (min)value

02000400060008000100000100200300400500600time

0200040006000800010000010203040time

Voltage Humidity

D

(min)value

Light Temperature

0200040006000800010000051015202530time

location

020004000600080001000000.511.522.5time

  In:

modality

  sensor measurements

  Out:   Projection matrices U1 and U2

location

  Core G’ (wavelet coefficients)

  Mining guide: U1

G' U2T modality

  U1 and U2 reveal the patterns on

location and modality, respectively   G’ provides the patterns on time Modern Data Mining Algorithms

44


U1

G' U2T

1 . . 54

  

1st Hidden Concept Daily Periodicity

1 . . 54

2nd Hidden Concept Exceptions

1st HC : dominant trend, e.g. daily periodicity. 2nd HC: Exceptions

Modern Data Mining Algorithms

45


U1

G' U2T modality

1 2 3

volt temp light 4 humid

1 2 3 4

1st Hidden Concept

volt temp humid light 2nd Hidden Concept

•  1st HC indicates the main sensor modality correlations ▪  Temperature and light are positively correlated, while humidity is anti-­‐ correlated with the rest

•  2nd HC indicates an abnormal pattern which is due to battery

outage for some sensors

Modern Data Mining Algorithms

46


U1

G' U2T

modality

•  1st scalogram indicates daily periodicity •  2nd scalogram gives abnormal flat trend due to battery outage

Modern Data Mining Algorithms

47


  Introduction   PART I: Graphs   Triangles   Diameter

  PART II: Tensors   2 Heads method   MACH

  Conclusion/Research Directions Modern Data Mining Algorithms

48


  Most of the real-­‐world processes result in

sparse tensors. However, there exist important processes which result in dense tensors:

Physical Process

Percentage of non-­‐zero entries

Sensor network (sensor x measurement type x timeticks)

85%

Computer network (machine x measurement type x timeticks)

81%

Modern Data Mining Algorithms

49


  It can be either very slow or impossible to

perform due to memory constraints a Tucker decomposition on a dense tensor.   Can we trade a little bit of accuracy for efficiency?

Modern Data Mining Algorithms

50


McSherry

Achlioptas

MACH extends the work of Achlioptas-McSherry for fast low rank approximations to the multilinear setting.

Modern Data Mining Algorithms

51


  Toss a coin for each non-­‐zero entry with

probability p

  If it “survives” reweigh it by 1/p.   If not, make it zero!

  Perform Tucker on the sparsified tensor!

  For the theoretical results, see Tsourakakis,

SDM 2010.

Modern Data Mining Algorithms

52


  Intemon (Carnegie Mellon University Self-­‐

Monitoring system)   Tensor X, 100 machines x 12 types of measurement x 10080 timeticks

  Jimeng Sun showed in his thesis that Tucker

decompositions can be used to monitor efficiently the system Modern Data Mining Algorithms

53


For p=0.1 we obtain that Pearson’s Correlation Coefficient is 0.99

Ideal ρ=1

Modern Data Mining Algorithms

54


Find the differences!

Exact

MACH

The qualitative analysis which is important for our goals remains the same! Modern Data Mining Algorithms

55


  Berkeley Lab

  Tensor 54 sensors x 4 types of measurement x

5385 timeticks

Modern Data Mining Algorithms

56


The qualitative analysis which is important for our goals remains the same! Modern Data Mining Algorithms

57


The spatial principal mode is also preserved, and Pearson’s correlation coefficient is again almost 1! Modern Data Mining Algorithms

58


REMARKS 1) Daily periodicity is apparent. 2) Pearson’s correlation Coefficient 0.99 with the exact component.

Modern Data Mining Algorithms

59


  Introduction   PART I: Graphs   Triangles   Diameter

  PART II: Tensors   2 Heads method   MACH

  Conclusion/Research Directions Modern Data Mining Algorithms

60


  More Applications of Probabilistic

Combinatorics in Large Scale Graph Mining   Randomized Algorithms work very well (e.g.,

sublinear time algorithm), but typically hard to analyze.

  Smallest p* for tensor sparsification for the

(messy) HOOI algorithm

Modern Data Mining Algorithms

61


  Better sparsification (Edge (1,2) is important,

Weighted Graphs!)

  Property Testing: Is a graph triangle free?   Does Boolean Matrix Multiplication have a truly

subcubic algorithm?

3/16/2010

Triangle Sparsifiers

62


Faloutsos

Drineas

Schwartz

Kang

Miller

Frieze

Kolountzakis

Koutis

Leskovec

Modern Data Mining Algorithms

63


Modern Data Mining Algorithms

64


Modern Data Mining Algorithms

65


Pick p=1/ Keep doubling until concentration

Concentration appears Concentration becomes stronger Modern Data Mining Algorithms

66


How to choose Mildness, pick p=1 p?

Concentration

Modern Data Mining Algorithms

67


Is tUse he cLube of anczos Compute the the sto econd one I want to compute k-­‐th e igenvalue. terminates! compute the Algorithm significantly the Is fiIterate rst Δ s tiwo YES!   NO The e stimated # o f s smaller than number of eigenvalues then! much s maller the s um o f c ubes o f s the c ube o f λti’he triangles! please! than divided by 6! first? ?

After some iterations… (hopefully few!)

Modern Data Mining Algorithms

68


Remark:Even if our theoretical results refer to HOSVD, MACH works for HOOI

Modern Data Mining Algorithms

69


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.