How-to Guide for R
Introduction

In this guide, you will learn how to perform the community detection method, infomap, in statistical software R, using a practical example to illustrate this process. Readers are provided with links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. This example assumes that you have the data file stored in the working directory being used by R.

Contents
  • Infomap
  • An Example in R: Factions in the Karate Club
    • 2.1 The R Procedure
    • 2.2 Exploring the R Output
  • Your Turn
1 Infomap

Infomap is one of the family of methods that addresses the problem of partitioning the nodes into clusters, which is an active research area called community detection. Infomap was first developed by Rosvall and Bergstrom and has been extended to temporal networks, multiplex networks, and so on.

Infomap is an information-theoretic approach for the detection of communities (i.e., clusters of nodes). Roughly speaking, a group of nodes among which information flows quickly and easily are identified as a community, and the links between communities capture the avenues of information flow between those groups.

2 An Example in R: Factions in the Karate Club

This example introduces the community detection method, infomap, with a dataset about a university karate club. Specifically, we examine the community structure of the social network between the karate club members.

The data were collected and studied by Zachary in his paper “An Information Flow Model for Conflict and Fission in Small Groups” published in 1977. It was collected during a period of 3 years from 1970 to 1972. The karate club had 34 members, and hence, the network consists of 34 nodes with each representing a member, and an edge exists between two nodes if they interacted outside the club. There are 78 undirected, unweighted edges in total.

What makes this network an ideal example for community detection is an incident occurred during the study. The administrator of the club had a conflict with the instructor during the study, and the club split into two in the end. About half of the members formed a new club with the instructor, and another half stayed with the administrator and found a new instructor. Therefore, there were obviously two factions in the network, led by the instructor and the administrator, respectively, long before the split, which provides a benchmark to test any community detection method.

2.1 The R Procedure

R is a free open-source software and computing platform well suited for statistical analysis. R does not operate with pull-down menus. Rather, you must submit lines of code that execute functions and operations built into R. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located at http://cran.r-project.org/doc/manuals/r-release/R-intro.html.

For this example, we must first load the node table and the edge table into R. Using the network files provided, the code looks like this (assuming the data file is already saved in your working directory):

  • nodes = read.csv('dataset-karate-1977-subset1-nodes.csv')
  • edges = read.csv('dataset-karate-1977-subset1-edges.csv')

Now, the node table and edge table are read in as dataframes. To perform any analysis, we need to turn them into a network object. There are two packages in R commonly used for network analysis: igraph and statnet. The statnet is useful in statistical modeling of networks and will be introduced in SAGE dataset example on Exponential Random Graph Models. In this example, we use the igraph which is good at computations on networks.

We need to load the igraph package in order to use it. If you don’t have the igraph installed, you will get an error. Run the following code to install it first:

install.packages(’igraph’)

Once it is installed successfully or if already installed, you can load it like this:

library(’igraph’)

Next, we can turn the node and edge tables into a network object by the following command:

G = graph_from_data_frame(d=edges, vertices=nodes, directed=F)

Any column after the first one in the node table will be used as attributes for the nodes, and any column after the second in the edge table will be used as attributes for the edges. The attribute we need is the faction ID each member actually joined after the split, which is called (no surprise) “faction.”

Finally, the following command performs the infomap method and returns the “best” partition.

community = cluster_infomap(G)

As the community detection results are available, we can plot it first and inspect it visually.

plot(G, vertex.color = membership(community))

Note that the color of the nodes is set by their community membership.

Lastly, we check the community ID to which each member is assigned, and the faction each member actually joined after the split:

  • membership(community)
  • vertex_attr(G,’faction’)
2.2 Exploring the R Output

For each command above, R will return its results immediately. Here, we summarize the main results.

Figure 1 shows the karate club network with color-coded communities. Infomap finds three communities, denoted by the colors blue, orange, and green. The resulting community ID for each node might not be exactly the same as the faction ID, since what is called Community 1 could be Faction 2 – the numerics have no meaning here. What we are interested in is whether a community contains the same nodes as a faction does. After comparing this partition to the actual split of the club, we find that infomap correctly identifies the actual split of the club members except for one node (#10), but it further divides one faction into two communities (the blue and the green communities).

Figure 1: The Karate Club Network With Color-Coded Communities Identified by the Infomap Method.
Figure

3 Your Turn

Download this sample dataset to see whether you can replicate these results. Actually, the link between Nodes 34 and 23 is ambiguously reported in Zachary’s original paper. Try repeating the process after removing this link and check how the partition result changes. (Hint: Use the minus operator (-) to delete an edge from the network. E.g., G = G - edge(1|2) will delete the edge between Nodes 1 and 2.)