On Application-Aware Data Extraction for Big Data in Social Networks Ming-Syan Chen Research Center for Information Tech. Innovation, Academia Sinica EE Department, National Taiwan Univ.
Fast Increasing of Social Network Activities • Example social networks: – Twitter – Facebook – Flickr – MSN – Wikipedia – Amazon.com
• Such a network – Very huge in size! – Cannot easily be analyzed M.-S. Chen
2
2
The Amount of Information is Huge! • Twitter – 150+ million members – 50 million tweets per day
From twitter.om
– 800+ million users
• Amazon Co-purchasing Network – half million product nodes – several million recomm. links
• Web Pages – Yahoo! Over one billion Web Pages Amazon From SNSP M.-S. Chen
3
Example of Big Data and Social Network Volume: thousands of people! Velocity: fast accumulated!! Variety: eating different food!!!
M.-S. Chen
4
Example of Big Data and Social Network For some gossip in this occasion, Veracity is an issue and the information Value could be low. Mr. Lin won the lottery!
Mrs. Chang just did a face lift!
M.-S. Chen
5
Information Extraction for Big Data in Social Networks • Extracting important information from large social network graphs – To allow data analysts to mine the information in large social networks, to enable scalable storage and querying, and to facilitate the development of real-world applications
M.-S. Chen
6
Outline • Graph reduction – Summarization, sampling, and extraction
• Information Extraction on Social Network Graphs – Capturing key parameters (parameter extraction) – Guide query (information extraction) – Decomposing SN graphs (structure extraction)
M.-S. Chen
7
Graph Reduction Graph summarization (going thru all data) e.g., NTU has 32K students, 20% are sushi lovers, 25% prefer steak, also 15% are artists, 20% are engineers, etc.
Graph sampling (going thru a subset) Getting a small representative set of NTU students (which preferably fit statistics)
Graph extraction Application/goal-oriented data extraction, e.g., only picking good eaters for feast contest. M.-S. Chen
8
Graph Extraction 執簡御繁 To handle complicated things with simple skills.
Application/goal-oriented data extraction Three levels of information extraction from SN graphs
• Parameter extraction (e.g., company stat.) – Fast calculation of closeness centrality
• Information extraction (e.g., company biz.) – Guide query
• Structure extraction (e.g., company org.) – Decomposing SN graphs M.-S. Chen
Parameter extraction Structure extraction
weapon
Information extraction (regarding capability) M.-S. Chen
10
Outline • Graph reduction • Information Extraction on Social Network Graphs – Capturing key parameters (parameter extraction) – Guide query (information extraction) – Decomposing SN graphs (structure extraction)
M.-S. Chen
11
Closeness centrality • There are several interesting quantities, including closeness centrality, network diameters, degree distribution, in SN graphs. • Closeness centrality of node v, Cc(v): the inverse of the average shortest path distance from v to any other node in a network. – If Cc(v) is large, v is around the center as it requires only few hops to reach others. M.-S. Chen
12
Response to Dynamic Changes • It is frequent to have edge insertion or deletion in a social network – It is desirable to fast update the closeness centrality of every node in response to edge insertion/deletion.
• Example use: pick a number of people (the nodes with high CCs) who can maximize advertisement effectiveness. M.-S. Chen
13
Example of Closeness Centrality Cc(v): the inverse of the average shortest path distance from v to other nodes Cc (v)
14 1 13 1 4 2 2 3 1 4 1 5 2 6 2 7 1 44
Cc ( w)
14 1 13 1 3 2 4 3 4 4 2 31
| V | 1 Cc (v) uV | p(v, u) |
Thus, node w is closer to all other node than the node v.
M.-S. Chen
An unweighted and undirected graph G with 14 nodes and 18 edges 14
Calculating Closeness Centrality • One can calculate closeness centralities of all vertices by solving All Pairs Shortest Paths (APSP) problem. – O(n(m+n)) based on the breadth-first search (BFS) method for undirected graph, where n and m are the number of nodes and edges in the graph. – In a dynamic graph, re-solving APSP problem after each edge insertion or deletion is not efficient.
• Note that only some pairs of shortest paths will be affected due to certain edge changes. – Identify them (unstable node pairs) for fast calculation of CC M.-S. Chen
15
Example For example, with the addition of (a,b) Un-changed shortest paths ◦ p(b,v), p(c,t) and p(r,h), etc.
Changed shortest paths ◦ Before edge insertion
p(a,b)={a,d,w,b}, p(a,c)={a,d,w,r,c} and p(u,v)={u,l,o,d,w,r,s,v}, etc.
◦ After edge insertion (we then call these nodes unstable) p(a,b)={a,b}, p(a,c)={a,b,c} and p(u,v)={u,x,a,b,c,v}, etc. (a): the original unweighted and undirected graph G. (b): G’=G∪e(a,b).
M.-S. Chen
16
Illustration of Unstable Node Pairs • To find V’u : u-unstable node set, whose shortest paths to u changed after the edge addition • If we perform BFS at node u in G and G’ to obtain Gu and G’u, we can find only the shortest paths p(u,b), p(u,c), p(u,h), p(u,v) and p(u,t) changed.
Gu
G’u
– unstable node pairs: (u,b), (u,c), (u,h), (u,v) and (u,t). – V’u={b,c,h,v,t} M.-S. Chen
17
(Main Theorem) After the addition of edge (a,b), every unstable node pair (whose shortest path changed) {v,u} will have v ∈ V’ and u ∈ V’ a
b
V’b
V’a .. ..
. . .
. . .
Only these shortest paths will change after edge addition (and need to be re-calculated)
Concurrent Calculation of CC in SN Time Perform in parallel BFS at nodes a and b in G to obtain V’a={a,x,l,u},V’b={b,c,h,v,t}, Calculate Ga simultaneously.
Calculate G’a and V’a
Perform BFS starting at a ∈ V’b
Calculate Gb
Calculate G’b and V’b
Perform BFS starting at x ∈ V’b Perform BFS starting at l ∈ V’b
Inform nodes in these unstable pairs to re-calculate their shortest paths to others and CC
Perform BFS starting at u ∈ V’b M.-S. Chen
19
Experiments • To evaluate CENDY, we conducted experiments on six real unit-weighted graph datasets of different types. • The case of edge deletion can be done similarly (in light of a companion theorem proposed)
M.-S. Chen
20
Experiments Evaluation on Edge Insertion From this table, we can see that the closeness centralities of all vertices and APL can be updated only by a few of BFS processes. e.g., DBLP contains 460,413 nodes. The na誰ve way requires to perform 460K BFS processes to update closeness centrality and APL. However, CENDY only requires 4K BFS processes to finish the task.
M.-S. Chen
21
Remark • In response to the fast changes in SN, CENDY is devised to efficiently update the closeness centrality of each node in the social network. • The design of new algorithms is called for to efficiently calculate other key parameters in the fast changing social network
M.-S. Chen
22
Outline • Graph reduction • Information Extraction on Social Network Graphs – Capturing key parameters (parameter extraction) – Guide query (information extraction) – Decomposing SN graphs (structure extraction)
M.-S. Chen
23
Motivation of Guide Query Several works on information finding in social networks • Expert finding [Deng’08][Lappas’09] – To find the experts based on some given requirement
• Gateway finding [Koren’06][Wang’10] – To find the gateways between the source group and the target group • Active Friending [Wu’13] – To explore social networks to improve friend finding • Guide query [Lin’13] – To explore social networks to improve friend finding [Deng’08] ICDM 2008. [Lappas’09] KDD 2009.
[Koren’06] KDD 2006. Wang’10] KDD 2010. [Wu’13] KDD2013. [Lin’13] WAIM 2013M.-S. Chen
24
Motivation of Guide Query (Cont’d) • By expert finding, the answer is a list of experts ranked by their expertise. • Using the guide query, the answer is a list of informative friends of the querier ranked by the ability of gathering information from experts – Exploring social relationship – Taking the probabilities of getting help into consideration M.-S. Chen
25
Guide Query: Graph Extraction based on Your Friends This friend is also who I should ask since she can collect information from her friends.
These two friends are who I should ask for information.
I want to know information about Company A or B.
B
A
A A
C
A
D B E
C E
M.-S. Chen
26
Quide Query • Guided query [Lin’13] – For a user initiating the query, the answer is the user’s neighbors that are informative about user-assigned attributes. – An informative neighbor should either have the attributes itself or know some other friends that have the attributes.
[Lin’13] Y.-C. Lin, P. S. Yu, M.-S. Chen, “Guide Query in Social Networks,” WAIM 2013. M.-S. Chen
27
Problem Definition Given a query node q and a set of keywords W = {w1, w2, ‌, w|W|}, the guide query is to find the top-k informative neighbors of q considering W. {B}
q = N0 W = {A, B}
{B}
N41
N4
{A} N11
{D}
{C}
N0
N3
{B}
N1
N32
{B} {A}
{C}
N12
N2 Ni
candidate
{B} N34
target {A, B}
Ni
N31
N13
{A} N21 M.-S. Chen
{A} N33
28
Problem (Cont’d) In the model, an edge is labeled with the probability that a node successfully spreads the request to the linked node. We rank the candidates based on how informative they are, which is evaluated by the proposed {A} N11 InfScore and {C} DivScore P=0.6
{B} {B}
N4
N41
P=0.5
{D} N0
N3
{B}
N1 {B}
P=0.5 P=0.7
P=0.5
{A}
{C}
N12
N2 P=0.3 {A, B}
N13
N32
P=0.2 {A} N21 M.-S. Chen
N31
{B} P=0.8
P=0.5
N34
{A} N33
29
InfScore InfScore: The informative level for a candidate node (i.e., the ability to spread the request to targets). Modeled by the expected number of targets a candidate is able to spread the request to. {B} {B}
N4
N41
P=0.5
{A} N11 P=0.5
{D}
{C}
N0
N3
{B}
N1 {B}
P=0.5 P=0.5
P=0.5
{A}
{C}
N12
N2 P=0.5 {A, B}
N13
N32
P=0.5 {A} N21 M.-S. Chen
N31
{B} P=0.5
P=0.5
N34
{A} N33
30
InfScore InfRatio is defined as the probability that a specific candidate successfully spreads the request to a certain target. {B} {B}
e.g., the InfRatio from N1 to N13 is 0.25
N4
P=0.5
N41
{A} N11 P=0.5
{D}
{C}
N0
N3
{B}
P=0.5
N1 {B}
N32 P=0.5
P=0.5
N2
P=0.25 P=0.5 {A, B}
N13
{A}
{C}
N12
P=0.5
P=0.25 N31
P=0.5
P=0.5
{A} N21 M.-S. Chen
P=0.25
{B} N34
{A} N33
31
InfScore (intensity) The InfScore is the weighted sum of InfRatio. InfScore(N1) = 0.5 + 0.5 + 0.25*2 = 1.5 (N11)
(N12)
(N13)
InfScore(N4) = 1.0 + 0.5 = 1.5 (N4)
{B} {B}
(N41)
N4
P=0.5
N41
{A} N11 P=0.5
{D}
{C}
N0
N3
{B}
P=0.5
N1
N
InfScore
N1
1.5
N2
{B}
1.5
N4
1.5
P=0.5
P=0.5
{A}
{C}
N12
N2
P=0.25
0.5
N3
N32 P=0.25 N31
{B}
P=0.5 {A, B}
N13
P=0.25
{A} N21 M.-S. Chen
N34
{A} N33
32
DivScore (Diversity) The DivScore is an entropy-like measure to evaluate the diversity of possibly accessible target nodes. For each node, the target vector XT is defined as follows. Each item in the vector is a normalized InfScore value, describing the probability distribution on different targets.
With the target vector, the DivScore is defined as follows.
DivScore We design the DivScore as the probability distribution to each possibly accessible target. Example: DivScore(N3) = [-(1/3)*log2 (1/3)]*2 + [-(1/6)*log2(1/6)]*2 Distribution of N3: [0.5/1.5, 0.5/1.5, 0.25/1.5, 0.25/1.5] =[1/3, 1/3, 1/6, 1/6]
{B} {B}
N4
P=0.5
N41
{A} N11 P=0.5
{D}
{C}
N0
N3
N1
N
DivScore
N1
1.585
N2
0.000
N3
1.918
N4
0.918
{B}
N32 P=0.5
P=0.5
N2
P=0.25 P=0.5 N13
{A}
{C}
N12
{A, B}
{B}
P=0.5
{A} N21
P=0.25 N31
{B} P=0.25 {A} N33
N34
Experimental Setup • DBLP dataset [DBLP] – Co-authorship network – Edge probability • Based on the WC (weighted cascade) model • p(Ni -> Nj) = 1 / d(Nj) • d(Nj) is the in-degree of Nj
– Node attribute • Conference names of an author’s publications
[DBLP] http://www.informatik.uni-trier.de/~ley/db/ [Chen’10] W. Chen, et al., “Scalable Influence Maximization for Prevalent Viral Marketing in Large-Scale Social Networks,” KDD 2010. M.-S. Chen
35
Experimental Results Suppose Ming-Syan Chen wants to discuss with people who have published papers on KDD, SDM, CIKM, ICDM, PKDD, which coauthors should he first connect to? (i.e., Either coauthors who have these conf. papers or coauthors who coauthored with people who have these conf. papers.) Query input: • q = ‘Ming-Syan Chen’ • k = 10 • W = [KDD, SDM, CIKM, ICDM, PKDD]
M.-S. Chen
36
Remark • The key notion is to guide the query to right candidates in the social network. – For each candidate, a combination of the expertise and the social relationship with the person initiating the query is considered
• Just like the group formation (KDD-12) and this expert finding problem (WAIM-13), more applications/tools can be enhanced with SR considered
M.-S. Chen
37
Outline • Graph reduction • Information Extraction on Social Network Graphs – Capturing key parameters (parameter extraction) – Guide query (information extraction) – Decomposing SN graphs (structure extraction)
M.-S. Chen
38
Diffusion Analysis in Social Networks • Diffusion of Information can be used to model the interaction among nodes in a network, e.g., – Viruses spread over the internet. – Disease spread in the community. – Rumors/news spread among humans.
M.-S. Chen
39
Example Diffusion • Information diffusion can happen in social networks, such as facebook and twitter. �3
đ?‘›8 đ?‘›5
1
đ?‘›1
3 đ?‘›7
0 đ?‘›4 2 đ?‘›2
đ?‘›6
đ?‘›9
Underlying network Path of Infection
M.-S. Chen
40
The Network is Hidden • In some situations, the underlying network is not known (due to cost or privacy issue). • Network inference problem (NIP) is studied to discover the underlying network �3
đ?‘›8
1
đ?‘›5
đ?‘›1 0
3 đ?‘›7
đ?‘›4
2 To infer the network from what happened. đ?‘›9
đ?‘›2 đ?‘›6 M.-S. Chen
41
Network Inference Problem • Assume there is an underlying information network. • NIP is to infer the information network given a set of cascades. • A cascade đ?? s = [t1s , ‌ , t sN ] is the time records of information s spreading over the network. (N is #nodes), i.e., node đ?‘›đ?‘– gets s (infected) in time t si
• If a node i is never infected with s, set đ?‘Ąđ?‘–đ?‘ = ∞ . • Ex : đ?? đ??Ź = [∞, ∞, 2, ∞, 0,1]
đ?‘›2
đ?‘›3
đ?‘›1
2
đ?‘›5 M.-S. Chen
đ?‘›4
0
đ?‘›6
42
1
Clustering Cascades • Traditionally, NIP assumes there is one underlying network, which may not always be true in reality – e.g., Sports news, political news, and entertainment news are likely to spread in different ways
• Hence, we would like to cluster cascades so that the cascades in each cluster spread in the same pattern – An SN graph is hence decomposed into application-specific ones M.-S. Chen
43
Example Cascades Cascade a (Lakers news) đ?‘›2 đ?‘›3 đ?‘›1
đ?‘›5 0
đ?‘›4
Cascade b (49ers news) đ?‘›2 đ?‘›3 đ?‘›1 0 đ?‘›5
Cascade d (Heats news)
đ?‘›3
đ?‘›1 đ?‘›5 đ?‘›4
2
1
đ?‘›6
0
3
đ?‘›6
đ?‘›4
1
1
đ?‘›2
đ?‘›6
đ?‘›4
đ?‘›6
Cascade c (Redskins news) đ?‘›2 1 đ?‘›3 đ?‘›1 2 0 đ?‘›5
Cascade e (Jets news) đ?‘›2 2 đ?‘›3 đ?‘›1 0 đ?‘›5 đ?‘›4
đ?‘›6 1 M.-S. Chen
Cascade f (Celtics news) đ?‘›2
đ?‘›3
đ?‘›1 đ?‘›5 đ?‘›4
1
0
đ?‘›6 44
To Model Inference Network • Modeling method: – If two nodes are always infected in short time, the weight would be large. – ��� =
1 |đ?&#x2018; :đ?&#x2018;Ąđ?&#x2018;&#x2013;đ?&#x2018; <đ?&#x2018;Ąđ?&#x2018;&#x2014;đ?&#x2018; |
1 đ?&#x2018; :đ?&#x2018;Ąđ?&#x2018;&#x2013;đ?&#x2018; <đ?&#x2018;Ąđ?&#x2018;&#x2014;đ?&#x2018; đ?&#x2018;Ą đ?&#x2018; â&#x2C6;&#x2019;đ?&#x2018;Ą đ?&#x2018; đ?&#x2018;&#x2014; đ?&#x2018;&#x2013;
â&#x20AC;&#x201C; Consider đ?&#x2018;¤12 as an example. đ?&#x2018;¤12
{đ?&#x2018; : đ?&#x2018;Ą1đ?&#x2018; < đ?&#x2018;Ą2đ?&#x2018; } = {đ?&#x2018;?, đ?&#x2018;?, đ?&#x2018;&#x2019;} 1 1 1 1 1 = ( + + )= 3 â&#x2C6;&#x17E;â&#x2C6;&#x2019;0 1â&#x2C6;&#x2019;0 2â&#x2C6;&#x2019;0 2
45
Example Inference Network 0.25 đ?&#x2018;&#x203A;2
0.5
0.5
0.17
đ?&#x2018;&#x203A;1
đ?&#x2018;&#x203A;3 0.5
0.25
đ?&#x2018;&#x203A;5
0.67
0.67
đ?&#x2018;&#x203A;4
0.17
0.5 0.25 M.-S. Chen
đ?&#x2018;&#x203A;6 46
To Cluster Cascades by K-Means â&#x20AC;˘ Transform cascade đ?&#x2019;&#x2022; to N-dim indicator based on whether nodes are infected or not. â&#x20AC;˘ Ex: â&#x20AC;&#x201C; đ?&#x2019;&#x2022;đ?&#x2019;&#x201A; = â&#x2C6;&#x17E;, â&#x2C6;&#x17E;, â&#x2C6;&#x17E;, â&#x2C6;&#x17E;, 0,1 â&#x2020;&#x2019; [0,0,0,0,1,1]
â&#x20AC;&#x201C; đ?&#x2019;&#x2022;đ?&#x2019;&#x192; = 0, â&#x2C6;&#x17E;, â&#x2C6;&#x17E;, 1, â&#x2C6;&#x17E;, â&#x2C6;&#x17E; â&#x2020;&#x2019; [1,0,0,1,0,0] â&#x20AC;&#x201C; đ?&#x2019;&#x2022;đ?&#x2019;&#x201E; = 0,1,2, â&#x2C6;&#x17E;, â&#x2C6;&#x17E;, â&#x2C6;&#x17E; â&#x2020;&#x2019; [1,1,1,0,0,0]
â&#x20AC;˘ Run K-means to get the clustering result. â&#x20AC;&#x201C; (đ?&#x2018;&#x17D; , đ?&#x2018;&#x2018; , f) and (b, c , e) 47
Graph Decomposition â&#x20AC;˘ By considering cascades {a, d, f} and cascades {b, c, e} independently (based on which nodes are infected), the original SN graph is decomposed in accordance with the information carried. Cascades {b, c, e} (NFL)
Cascades {a, d, f} (NBA) 0.25
đ?&#x2018;&#x203A;2 đ?&#x2018;&#x203A;1
0.5
0.5 0.67
0.5
đ?&#x2018;&#x203A;5
0.67
0.33
0.5 đ?&#x2018;&#x203A;3
đ?&#x2018;&#x203A;5
0.17
0.5 đ?&#x2018;&#x203A;4
0.17
đ?&#x2018;&#x203A;1
đ?&#x2018;&#x203A;3
đ?&#x2018;&#x203A;2
đ?&#x2018;&#x203A;6
M.-S. Chen
đ?&#x2018;&#x203A;4
đ?&#x2018;&#x203A;6
48
Remark • Traditionally NIP results in a dense and complex network, which is difficult to capture knowledge. • By properly clustering cascades, we can have a few resulting concise networks which carry clearer information – These resulting networks better match the corresponding cascades than a single dense network. M.-S. Chen
49
Conclusion â&#x20AC;˘ Information extraction is an application/goaloriented process to capture the key ingredients (parameters, information, structure, etc) in the huge SN â&#x20AC;˘ The procedure of information extraction can be integrated into related process for better efficiency in practice
M.-S. Chen
50
Thank you!
M.-S. Chen
51
Graph Summarization Condense the original graph to a
more compact form Lossless and lossy methods Required to examine the entire network 2 1
9
a
3
G 4 5
10
b
d ─ {5, 10}
c
6 8 7
Gs
A revised example form S. Navlakha et al. Graph Summarization with Bounded Error. M.-S. Chen SIGMOD’08
─ {6, 10} Sa={2,3}
Sb={1,9}
Sc={7,8,10}
Sd={4,5,6} 52
Graph Sampling • Graph Sampling – Selecting a subset of the original data – Characteristics of the original graph are preserved – Only a proportion of nodes in the network are visited
Sampling
M.-S. Chen Plotted by NodeXL, an EXCEL template created by the NodeXL team at Microsoft Research
53
A Running Example of CENDY Originally, we have the closeness centralities of all nodes and the average path length of the graph.
Cc ( x)
14 1 13 1 3 2 2 3 1 4 2 5 2 6 2 7 2 47
a b c d h l o r s t u v w x A=
13 13 13 13 13 13 13 13 13 13 13 13 13 13 40 35 37 33 46 47 40 33 40 56 57 44 31 47
An unweighted and undirected graph G with 14 nodes and 18 edges
40 35 37 47 586 LG 14(14 1) 182 M.-S. Chen
54
Example (Cont’d) For the insertion of the edge e(a,b). • We perform BFS at node a in G and G’ to obtain Ga and G’a, and then have V’a={b,c,h,v,t}.
M.-S. Chen
Ga
G’a
55
Example (Cont’d) • Also, we perform BFS at node b in G and G’ to obtain Gb and G’b, and then have V’b={a,x,l,u}.
M.-S. Chen
Gb
G’b
56
Example (Cont’d) • Then, in light of the main theorem, we re-calculate the paths between V’a and V’b
Gx
G’x
• For example, for node x ∈ V’b, we calculate – – – – –
(1): ||p(x,t)| - |p’(x,t)|| = 7 – (1+1+3) = 2 (2): ||p(x,h)| - |p’(x,h)|| = 6 – 4 = 2 (3): ||p(x,v)| - |p’(x,v)|| = 6 – 4 = 2 (4): ||p(x,c)| - |p’(x,c)|| = 5 – 3 = 2 (5): ||p(x,b)| - |p’(x,b)|| = 4 – 2 = 2
• and then update its new closeness centrality: Cc ( x)
13 13 13 47 (1) (2) (3) (4) (5) 47 2 2 2 2 2 37
M.-S. Chen
57
Example (Cont’d) • Finally, we update the closeness centralities of the referenced nodes and recalculate the APL. a b c d h l o r s t u v w x A=
13 13 13 13 13 13 13 13 13 13 13 13 13 13 40 35 37 33 46 47 40 33 40 56 57 44 31 47
a b c d h l o r s t u v w x 13 13 13 13 13 13 13 13 13 13 13 13 13 13 30 28 30 33 39 42 40 33 40 49 47 37 31 37
30 28 30 37 516 LG 14(14 1) 182 M.-S. Chen
58
Example Scenario N0 is initiating a query to find a job in company A or company B. Which friend should N0 ask for information? {B} {B}
N41
N4
{A} N11
{D}
{C}
N0
N3
{B}
N1
N32
{B} {A}
{C}
N12
N2
N31
{B} N34
{A, B}
N13
{A} N21 M.-S. Chen
{A} N33
59
New Contributions • Given M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring Networks of Diffusion and Influence. In KDD ’10, Our work is unique in that: 1. We assume there could be many underlying networks (rather than only one). 2. We model and learn a weighted graph (rather than an unweighted one).
M.-S. Chen
60