Community Detection in Directed Weighted Function-call Networks Zhengxu Zhao1, Yang Guo*2, Weihua Zhao3 Shijiazhuang Tiedao University, Shijiazhuang, Hebei, China
1,3
School of Mechanical Engineering, Shandong University, Jinan, Shandong, China
2
zhaozx@stdu.edu.cn; 2guoyang1013@126.com; 3zhaoweihua9@hotmail.com
1
Abstract Complex network is being considered as a key approach to representing complex systems. We have focused on the static analyses of software systems with the function-call graph and empirically found them to reveal small-world, scale-free features similar to those identified in sociological and technological systems. Another crucial step when studying the structure of networks is to identify communities. Most existing approach to this problem in the previous literature simply ignored the edge direction and applied methods for community detecting in undirected networks. In this paper, we consider the problem of finding communities in directed weighted networks and develop a new community detection algorithm for the networks. Our method regards communities as groups of links rather than nodes in contrast to the existing detection algorithms. This algorithm is tested on function-call graph of artificial system and our experiment is shown to finding communities on the test directed weighted network successfully. Keywords Community Detection Algorithms; Function-call Graph; Link Community; Directional Information
Introduction Complex systems have been found in the area of sociology, economy, biology and many other fields. Generally the way in understanding complex systems is to make it possible to decompose the systems into possible components. So how to represent complex systems is a necessary problem. An effective way is through graphs or networks. As the basis for understanding the behavior of numerous and large complex systems, networks have become a promising research area in many fields. A network is simply a set of nodes or vertices which connected in pairs by edges. It is derived from graph theory which dates back to Euler's solution of the puzzle of Konigsberg’s bridges in 1736[Euler (1736)]. Since then networks have been considered as an important interdisciplinary approach to representing complex systems these years. Many
complex systems, such as large software system, can be displayed as networks, where the basic components of a system are nodes and links between them represent their mutual interactions [Newman (2003)]. Many real networks display big inhomogeneity, revealing a high level of order and organization. Some vertices in the network have been found with large degree coexist with many vertices with low degree, meanwhile there is a higher density of edges with in groups than between them. Girvan and Newman [Girvan et.al (2002)] first named this feature “community structure”. A community is defined as a group of nodes which may share common properties or play similar roles within the network. Mathematically, each node’s degree inside the community should not be smaller than the node’s degree toward any other community. Community detecting algorithms have been proposed in several ways, which can be divided as graph partition methods in computer science and hierarchical clustering methods in sociology. The representative algorithms are Kernighan-Lin algorithm [Kernighan et.al (1970)], spectral analysis [Pothen et.al (1990)], Girvan-Newman algorithm [Girvan et.al (2002)], edge clustering coefficient method [Radicchi et.al (2004)], and so forth. Each algorithm has its own advantages and disadvantages at the computation speed or the required advance knowledge. Related Work There have been substantial works for detecting community structure. The general purpose of these works is to find significant divisions into groups by analyzing the structural properties of the whole network. Kernighan-Lin algorithm [Kernighan et.al (1970)], which is the earliest method proposed by Kernighan and Lin in the year 1970, is a heuristic algorithm for the graph partitioning problem. This method is an optimization of a benefit function Q,
International Journal of Automation and Control Engineering, Vol. 4, No. 1—April 2015 2325-7407/15/01 009-5 © 2015 DEStech Publications, Inc. doi:10.12783/ijace.2015.0401.03
9
10
Zhengxu Zhao, Yang Guo, Weihua Zhao
which represents the difference between the number of edges inside the modules and the ones lying between them. Girvan-Newman algorithm [Girvan et.al (2002)] is a hierarchical method used to detect communities in complex systems. This method uses information about edge betweenness to find community boundaries, and detects communities by progressively removing edges from the original network. Two years later, Newman [Newman (2004)] proposed a fast method based on the optimization of the modularity in order to deal with large networks. He defined a quality function called modularity Q to test whether a particular division is significant. This method is a “greedy” optimization based on the iterative agglomeration of small communities. Radicchi et al. [Radicchi et.al (2004)] have first devised a way to implement a quantitative definition of community in a generic divisive algorithm and then introduced a divisive algorithm based on local quantities. This algorithm has an advantage with respect to computational cost, keeping the same level of reliability. Fortunato et al. [Fortunato et.al (2004)] have developed an algorithm based on information centrality that consists in finding and removing iteratively the edge with the highest information centrality. Clauset et al. [Clauset et.al (2004)] have developed an agglomeration algorithm that works by greedily optimizing the modularity. They repeatedly join together the two communities whose amalgamation produces the largest increase in modularity Q. Wu and Huberman [Wu et.al (2004)] have proposed a method based on the notion of voltage drops across the network which allows for detecting communities within networks of arbitrary size in times that scale linearly with their size. Reichardt and Bornholdt [Reichardt et.al (2004)] have presented a fast community detection algorithm based on a modified q-state Potts model which needs no prior knowledge of the number of communities. They have found the communities in networks coincide with the domains of equal spin value in the minima of a modified Potts spin glass Hamiltonian. Blondel et al [Blondel et.al (2008)] proposed a heuristic method also based on modularity optimization. This method has some advantages such as implement easily and fast speed. Directed Weighted Link Community Detection Algorithm (DWLC) The algorithms presented in the previous section, which mainly detect communities as groups of nodes
for the unweighted network, have been widely applied for finding communities. However, most real networks are essentially the weighted networks, the edge weight of them represents certain meanings. Meanwhile, the weighted networks with the same topological structure also will present a various community structures for different distribution of weight. On the other hand, real networks can’t always be composed of separated sets of communities, they are often characterized by well defined statistics of overlapping and nested communities, which means nodes simultaneously belong to multiple communities. So it is necessary to propose a new algorithm designing for the detection of overlapping communities as well as for the weighted networks. Evans et al [Evans et.al (2009)] used a partition of links of networks, which made a node partition of the line graph of the original network, in order to detect the overlapping communities and have shown that the quality of a link partition can be evaluated by modularity of its corresponding line graph. The next year Ahn et al [Ahn et.al (2010)] considered link communities naturally incorporate overlap while revealing hierarchical organization so that they reinvented communities as groups of links rather than nodes. They used hierarchical clustering with a similarity between links to build a dendrogram where each leaf is a link from the original network and branches represent link communities. The similarity between links is calculated with Jaccard similarity coefficient, that is: S (eik , e jk ) =
| n+ (i ) ∩ n+ ( j ) | | n+ (i ) ∪ n+ ( j ) |
(1)
where n+(i) is defined as the set of node i and its neighbors. For obtaining the most relevant communities it’s necessary to determine the best level where to cut the link dendrogram. So Ahn had also introduced a function called partition density, which is based on link density inside communities. Suppose that a network is divided into C communities, Pc is a partition with nc nodes and mc links. The partition density Dc of community c is defined as: mc − (nc − 1) n (n − 1) / 2 − (n − 1) , nc ≠ 2 Dc = c c c 0 , nc =2
(2)
The partition density D of the whole network with M links, is the average of Dc , weighted by the fraction of
Community Detection in Directed Weighted Function-call Networks
present links: D=
mc − (nc − 1) 1 2 ∑ mc Dc = ∑ mc (nc − 1)(nc − 2) M c M c
(3)
In this section, we will introduce a new algorithm abbreviated to DWLC (Directed weighted link community detection algorithm) for the purpose of extending the link community detection algorithm to finding communities in directed weighted networks. DWLC algorithm will consider the direction and weights of links together, dividing the links with high concentration into a community. For extracting the direction information of links in directed networks, we used the method introduced by Kim et al [Kim et.al (2009)] which can be described as follows. Consider a pair of nodes i and j, node i has low indegree and high out-degree while node j is just the opposite. There is more likely to have a directed link which point from node i to node j. If a directed link running from node j to node i is found in fact, it is so surprise that this abnormal link will play a more important role for community detecting. DWLC algorithm is based on giving higher weight to such abnormal links to consider the effect of link direction instead of simply ignoring the link direction and then calculate the link concentration for merging communities. Let's consider a network N with M links, A=[aij] is the connection matrix of N, ki is the degree of node i, kiin and kiout are respectively the in-degree and out-degree of node i, The link concentration C (eik , e jk ) is defined as:
C (eik ,= e jk ) L(eij ) • S (eik , e jk )
(4)
In this equation, L(eij ) describes the directional information of the link pointing from node i to node j while S (eik , e jk ) represents the weighted similarity of link eik and e jk , as L(eij ) = aij (1 − pij ) + a ji (1 − p ji ) S (eik , e jk ) =
ai • a j | ai | + | a j |2 −ai • a j 2
(5)
where pij is the probability of the abnormal link and ai is a vector of weight to the links between node i and
all the shared neighbor nodes of nodes i and j. The probability pij is defined as: pij
in in k out k out j ki / 2 M j ki = in out in in out in k out k out j ki / 2 M + ki k j / 2 M j ki + ki k j
(6)
11
Thus we can transform the original directed weighted network to a new undirected weighted one containing the directional information and apply the existing well-developed algorithms to this new network. The core idea of DWLC algorithm is using the hierarchical clustering method to merge links with certain link concentration into the same community for detecting communities in directed weighted networks without losing directional information.
FIGURE 1. DIRECTED FUNCTION-CALL GRAPH FOR LUA 1.0
Experiment Results and Discussions In order to test the effectiveness of DWLC algorithm, we selected a directed weighted network reflecting the function calling relationship of software systems. We have proposed a function-call graph reconstruction algorithm [Guo et.al (2012), Zhao et.al (2013)] and established a directed function-call network with 122 nodes and 314 edges, of which the weight of edges is the calling times between relevant functions. Figure 1 shows the topology of function-call network for Lua 1.0. We respectively apply the GN algorithm [Girvan et.al (2002)], spin glass algorithm [Reichardt et.al (2004)] and DWLC algorithm for detecting communities of this network, and show the results in Figure 2 and Figure 3. Comparing the result of three algorithms, the number of detected communities is so different: GN algorithm has divided this network into 26 communities while 9 communities of spin glass algorithm and 15 communities of DWLC algorithm. Detailed speaking, there are 15 communities detected by GN algorithm which only have one node; the size of communities detected by spin glass algorithm is very stable while
12
Zhengxu Zhao, Yang Guo, Weihua Zhao
FIGURE 2. COMMUNITIES DETECTED BY GN AND SPIN GLASS ALGORITHM
FIGURE 3. COMMUNITIES DENDROGRAM OF LUA’S FUNCTION CALL NETWORK
DWLC algorithm has found that some nodes have been divided into multiple communities simultaneously. Considering the communities detected by three algorithms, communities detected by GN algorithm have also been successfully found by DWLC algorithm, except for just the division of 5 nodes; for another, the division of most nodes by spin glass algorithm is consistent with that by DWLC algorithm while only 5 nodes have been divided into other different communities. Conclusions In this paper, we have proposed a community detection algorithm for directed weighted networks considering the directional information. Compared with the existing well-developed algorithms, the function call network of Lua 1.0 was tested to confirm
that our method is working well and the results shows that DWLC algorithm can also detect the overlapping communities of networks. REFERENCES
Ahn Y. Y., Bagrow J. P., Lehmann S. “Link communities reveal multiscale complexity in networks”, Nature, 2010, 466(7307): 761-764. Alex Pothen, Horst D. Simon, Kang-Pu Liou, “Partitioning sparse matrices with eigenvectors of graphs”, SIAM Journal on Matrix Analysis and Applications, vol.11, n.3, pp.430-452, 1990. Clauset A., Newman M. E. J., Moore C., “Finding community structure in very large networks”, Physical Review E , vol. 70, n.6, pp.066111-1-066111-6, 2004.
Community Detection in Directed Weighted Function-call Networks
13
Evans T. S., Lambiotte R. “Line graphs, link partitions, and
Etienne Lefebvre, “Fast unfolding of communities in
overlapping communities”, Physical Review E, 2009,
large networks”, Journal of Statistical Mechanics: Theory
80(1): 016105.
and Experiment, vol.2008, n.10, pp.10008-10019, 2008.
Fortunato S., Latora V., Marchiori M., “Method to find
Wu F., Huberman B. A., “Finding communities in linear time:
community structures based on information centrality”,
a physics approach”, The European Physical Journal B,
Physical Review E, vol.70, n.5, pp.056104-1-056104-13,
vol.38, n.2, pp. 331-338, 2004.
2004.
Yang Guo, Zhengxu Zhao, Yiqi Zhou, “Complexity analysis
Girvan M., Newman M. E. J., “Community structure in
with
function-call
graph
on
Windows
software”,
social and biological networks,” PNAS, vol.99, n. 12,
International Review on Computers and Software, vol.7,
pp.7821–7826, 2002.
n.3, pp.1149-1153, 2012.
Kernighan B. W., Lin S., “An efficient heuristic procedure for
Zhengxu Zhao, Yang Guo. “Scale-Free Model in Software
partitioning graphs”, Bell System Technical Journal,
Engineering: A New Design Method”, Geo-Informatics
vol.49, pp.291−307, 1970.
in Resource Management and Sustainable Ecosystem.
Kim Y., Son S. W., Jeong H., “Community identification in
Springer Berlin Heidelberg, pp. 346-353, 2013.
directed networks”, Complex sciences. Springer Berlin Heidelberg, pp. 2050-2053, 2009. Leonhard Euler. “Solutio problematis ad geometriam situs pertinentis(in
Latin)”,
Commentarii
Academiae
Scientiarum Imperialis Petropolitanae, vol.8, pp. 128-140, 1736. Newman M. E. J., “Fast algorithm for detecting community structure in networks,” Physical Review E, vol.69, n.6, pp.066133-1-066133-5, 2004. Newman M. E. J., “The structure and function of complex networks,” SIAM Review, vol.45, n.2, pp.167-256, 2003. Radicchi F., Castellano C., Cecconi F., Loreto V., Parisi D., “Defining and identifying communities in networks”, PNAS, vol. 101, n. 9, pp.2658-2633, 2004. Reichardt J., Bornholdt S., “Detecting fuzzy community structures in complex networks with a Potts model”, Physical Review Letters, vol.93, n.21, pp.218701-1218701-4, 2004. Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte,
Dr. Zhao obtained BSc, MSc, and PhD in computing science and technology. He is Professor and Chair in Applied Computing at the University of Derby from 1995 and holds a DSc from Derby for his research work. He is currently Professor of Faculty of Information Science and Technology at Shijiazhuang Tiedao University, P R China. His research interests include virtual reality systems, scientific visualization, complex network and information organization. Yang Guo (corresponding author) is now a PhD student in the School of Mechanical Engineering at Shandong University, China. His research interests mainly include structure and dynamic of complex systems and virtual reality systems. Weihua Zhao, associate professor in information organization, archive management at Shijiazhuang Tiedao University, P R China. She can be contacted by email zhaoweihua9@hotmail.com.