In this guide, you will learn how to perform the community detection method, label propagation, in statistical software R, using a practical example to illustrate this process. Readers are provided with links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. This example assumes that you have the data file stored in the working directory being used by R.
Label propagation takes a consensus approach for the detection of communities (i.e., clusters of nodes). As an over-simplified example, imagine each node in the network as a voter who votes for which community she wants to be in. The voters also want to be with their friends, and hence, each voter decides her vote based on what the majority of her friends vote for. In the end, the group of nodes who reach consensus form a community.
This example introduces the community detection method, label propagation, with a dataset about a university karate club. Specifically, we examine the community structure of the social network between the karate club members.
The data were collected and studied by Zachary in his paper “An Information Flow Model for Conflict and Fission in Small Groups” published in 1977. It was collected during a period of three years from 1970 to 1972. The karate club had 34 members, and hence, the network consists of 34 nodes with each representing a member, and an edge exists between two nodes if they interacted outside the club. There are 78 undirected, unweighted edges in total.
What makes this network an ideal example for community detection is an incident occurred during the study. The administrator of the club had a conflict with the instructor during the study, and the club split into two in the end. About half of the members formed a new club with the instructor, and another half stayed with the administrator and found a new instructor. Therefore, there were obviously two factions in the network, led by the instructor and the administrator, respectively, long before the split, which provides a benchmark to test any community detection method.
R is a free open-source software and computing platform well suited for statistical analysis. R does not operate with pull-down menus. Rather, you must submit lines of code that execute functions and operations built into R. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located at http://cran.r-project.org/doc/manuals/r-release/R-intro.html.
For this example, we must first load the node table and the edge table into R. Using the network files provided, the code looks like this (assuming the data file is already saved in your working directory):
Now, the node table and edge table are read in as dataframes. To perform any analysis, we need to turn them into a network object. There are two packages in R commonly used for network analysis: igraph and statnet. Statnet is useful in statistical modeling of networks and will be introduced in SAGE dataset example on Exponential Random Graph Models. In this example, we use igraph which is good at computations on networks.
We need to load the igraph package in order to use it. If you don’t have igraph installed, you will get an error. Run the following code to install it first:
Once it is installed successfully or if already installed, you can load it like this:
Next, we can turn the node and edge tables into a network object by the following command:
G = graph_from_data_frame(d=edges, vertices=nodes, directed=F)
Any column after the first one in the node table will be used as attributes for the nodes, and any column after the second in the edge table will be used as attributes for the edges. The attribute we need is the faction ID each member actually joined after the split, which is called (no surprise) “faction.”
Finally, the following command performs the label propagation method and returns the partition.
community = cluster_label_prop(G)
As the community detection results are available, we can plot it first and inspect it visually.
plot(G, vertex.color = membership(community))
Note that the color of the nodes is set by their community membership.
Lastly, we check the community ID to which each member is assigned and the faction each member actually joined after the split:
For each command above, R will return its results immediately. Here, we summarize the main results.
Figure 1 shows the karate club network with color-coded communities. Label propagation finds two communities, denoted by the colors blue and orange. The resulting community ID for each node might not be exactly the same as the faction ID, since what is called Community 1 could be Faction 2 – the numerics have no meaning here. What we are interested in is whether a community contains the same nodes as a faction does. After comparing this partition to the actual split of the club members, we find that label propagation correctly identifies the actual split of the club members except for one node (#10).
Download this sample dataset to see whether you can replicate these results. Actually, the link between Nodes 34 and 23 is ambiguously reported in Zachary’s original paper. Try repeating the process after removing this link and check how the partition result changes. (Hint: Use the minus operator (-) to delete an edge from the network. E.g., G = G - edge(1|2) will delete the edge between Nodes 1 and 2.)