Fast visualization of relevant portions of large dynamic networks Przemyslaw A. Grabowicz Luca Maria Aiello Filippo Menczer
Networks‌
Networks are everywhere • Information & knowledge networks • e.g., Wikipedia, WWW, IMDb
• Social networks • e.g., Facebook, Twitter
• Commodity networks • e.g., Internet, transportation networks
• Biological networks • e.g., neural network, gene expression networks
Networks are large Examples: • Wikipedia ~ 106 articles • Online social networks ~ 109 users • Transportation networks ~ 104 airports worldwide • Neural networks ~ 1011 neurons
Internet
Networks are dynamic Examples: • Wikipedia – half million new articles per year • Online social networks – temporal user interactions • Transportation networks – people traveling • Neural networks – traveling action potentials
The Egyptian Revolution on Twitter Visualization by André Panisson (http://youtu.be/2guKJfvq4uI)
How we describe networks? • Structural properties o o o o
Clustering coefficient Modularity Assortativity coefficient … mostly designed for static networks
• Visualizations o Graph layouts o Network filtering o … mostly designed for static networks
Network animation Outline I. Existing software for network visualization II. Filtering of large dynamic graphs III. Experimental datasets IV. Source-code release and summary
I. Existing software for network visualization
D3.JS
Network visualization tools
• can handle dynamic graphs • interactive • static filtering • slower than other tools (written in Java, GUI-based) The Egyptian Revolution on Twitter
Network visualization tools
• can easily plot large networks • fast (written in C++, with interfaces in Python and R) Clusters and activity of Twitter users
Network visualization tools
• can easily plot large networks • fast (C++ core of BGL, with an interface in Python)
Price network
Network visualization tools D3.JS • a tool for data visualization, not just networks • highly interactive • easy to embed in webpages • slow (a JavaScript library)
Songs similar to “Poker Face” of Lady Gaga, based on Last.fm
Character co-occurence in Les Misérables
Visualizing large dynamic networks? (with these or other tools) Challenges: • Hard to distinguish important nodes and edges • Hard to follow the evolution of nodes and edges • Computationally expensive
A technique of filtering large dynamics graphs is needed.
II. Filtering of large dynamic graphs o Our solution o Problem formulation o Which nodes are important to keep? o Key concepts of the algorithm o The algorithm o Computational complexity o Output and visualization
Our solution Filtering
1. Processes a chronological stream of interactions between the nodes of a large network 2. Dynamically filters the most relevant parts of the network, emphasizing old nodes that show fresh activity
Animation
3. Produces differential updates representing the network evolution 4. Can feed these updates directly to visualization tools, potentially to any of the aforementioned tools
t1, n1, n2 t2, n1, n3 t3, n1, n2, n3, n4 t4, n3, n5 t5, n1, …, nm
Problem formulation Imagine that we have a live stream of interactions between nodes that we want to visualize. Stream of interactions: t1, n1, n2 t2, n1, n3 t3, n1, n2, n3, n4 t4, n3, n5 t5, n1, ‌, nm
Stream of interactions
Stream of interactions
Stream of interactions
Stream of interactions
Stream of interactions
Stream of interactions Stream of interactions: t1, n1, n2 t2, n1, n3 t3, n1, n2, n3, n4 t4, n3, n5 t5, n1, …, nm (…) Millions of such interactions
Filtering Stream of interactions: t1, n1, n2 t2, n1, n3 t3, n1, n2, n3, n4 t4, n3, n5 t5, n1, ‌, nm
We filter the network of interactions on-the-fly.
Filtering We aim to pick the most important nodes and visualize them. Which are the most important nodes? 1.  With highest degree 2.  Exhibiting highest activity/node strength 3.  Most central
Filtering We aim to pick the most important nodes and visualize them. Which are the most important nodes? 1.  With highest degree 2.  Exhibiting highest activity/node strength 3.  Most central
Filtering Key factors that we address here: • The importance of nodes changes in time o We update the score of nodes on the fly
• It builds up in time due to repeated activity o We remember the past score and increase it whenever nodes show new activity
• Sometimes it diminishes due to inactivity o We gradually decrease the score to forget the oldest activity
Filtering To sum up: • We introduce scores o (changeable it time)
• Increasing due to the activity
Nodes
Edges
sij (t)
Si (t)
δij
Δi
o (per each arrival of interaction)
• Decreasing due to the forgetting o (every time period)
C forget
Filtering To sum up: • We introduce scores o (changeable it time)
Nodes
Edges
Si (0) = 0
sij (0) = 0
Δi = 1
1 δij = m −1
• Increasing due to the activity o (per each arrival of interaction)
• Decreasing due to the forgetting
C forget = 0.9
o (every time period)
We keep the nodes and edges with the highest scores
Degree/strength filtering?
Activity stream (input)
Visualized network (output)
A filtering buffer
Activity stream (input)
Buffered network (memory)
Visualized network (output)
A filtering buffer
Stage 1 (filtering)
Stage 2 (update-generation)
Why buffer? • Remembers the scores of the network • Computationally inexpensive in comparison with the full network • Smoothens the animations
The algorithm
The algorithm
The algorithm
Buffered network N b ≈ 10 4
The algorithm
Buffered network N b ≈ 10 4
The algorithm
Buffered network N b ≈ 10 4
Visualized network N v ≈ 101 ÷10 2
Computational complexity The algorithm is fast: • Stage 1 • Stage 2 Where: E – the total number of pairwise interactions read Nb – the number of buffered nodes Nv – the number of visualized nodes F – the number of frames produced
Output of the filtering algorithm The output of the filtering step is formatted as JSON files with differential updates to the visualized network. an: Add node cn: Change node dn: Delete node ae: Add edge ce: Change edge de: Delete edge
JSON icon created by http://dryicons.com JSON format of the Gephi Streaming API
Output of the filtering algorithm The output of the filtering step is formatted as JSON files with differential updates to the visualized network. One can feed it directly to: • Our video-generating module o uses igraph for graph plotting and mencoder for video encoding
• Other tools visualizing dynamic networks o Gephi Streaming API o more platforms?
Our video-generating module What it does? 1. Parses JSON differential updates 2. Creates/updates a network using igraph 3. Plots the network using pycairo o o o
the Fruchterman-Reingold layout frames stabilization extra effects: node-popping and node-soaking animations
4. Encodes a video by combining the frames with plotted network using mencoder Let’s see how it works!
III. Experimental datasets o
o Characteristics o Animations Parameters of the filtering algorithm
Experimental datasets 1. The announcement of Bin Laden’s death on Twitter (2011) Nodes: @users and #hashtags Edges: co-appearances in tweets related to Bin Laden’s death
2. Plot keywords from movies (1912-2018)
Nodes: keywords Edges: co-appearances of keywords in the descriptions of movies
3. Words co-appearing in US patents (1976-2010) Nodes: words appearing in the titles of patents (no stopwords) Edges: co-appearances in the titles
Experimental datasets Characteristics: • Periods of time from 2 hours to 106 years • From dozens of thousands to half a million nodes • We visualize at most hundreds of most important nodes
1. The announcement of Bin Laden’s death on Twitter (2011) Nodes: @users and #hashtags Edges: co-appearances in tweets related to Bin Laden’s death
2.  Plot keywords from movies (1912-2018)
Nodes: keywords Edges: co-appearances in the user-generated descriptions of movies
3.  Words co-appearing in US patents (1976-2010) Nodes: words appearing in the titles of patents (no stopwords) Edges: co-appearances in the titles
Algorithm’s parameters
Tcontr – time contraction, i.e., how much shorter is the visualization than the real evolving network Nb – the number of buffered nodes Nv – the number of visualized nodes smin – the minimal score of edges that will be visualized Cforget – the forgetting multiplier Fforget – the number of frames passing between consecutive forgetting events
IV. Source-code release and summary
Open-source
In github: • the filtering algorithm (C++) • the video-generating tool (Python) • preprocessed datasets • documentation
More resources
Whitepaper is available on arxiv
More videos: http://www.youtube.com/user/truthyatindiana/videos
Summary • A filtering algorithm for large streams of interactions producing differential network updates o Computationally inexpensive
• A tool that generates network animations from the network updates o Feel free to produce your own animation tools making use of of the differential network updates!
• All is open-sourced and a whitepaper is released
Thanks for listening! Thanks to the organizers of the data challenge!