How-to Guide for R
Introduction

In this guide, you will learn how to produce community detection with edge betweenness in statistical software R, using a practical example to illustrate this process. Readers are provided with links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. This example assumes that you have the data file stored in the working directory being used by R.

Contents
  • Community Detection With Edge Betweenness
  • An Example in R: Factions in the Karate Club
    • 2.1 The R Procedure
    • 2.2 Exploring the R Output
  • Your Turn
1 Community Detection With Edge Betweenness

Edge betweenness is a centrality measure for edges in a network. It can also be used to partition the nodes in a network into communities because edges with high betweenness are usually bridges between densely connected clusters of nodes. Hence, we iteratively find and remove the edge with largest betweenness to divide a network into a hierarchy of nested communities.

2 An Example in R: Factions in the Karate Club

This example introduces community detection with edge betweenness with a dataset about a university karate club. Specifically, we examine the community structure of the social network between the karate club members.

The data were collected and studied by Zachary in his paper “An Information Flow Model for Conflict and Fission in Small Groups” published in 1977. It was collected during a period of three years from 1970 to 1972. The karate club had 34 members, and hence, the network consists of 34 nodes with each representing a member, and an edge exists between two nodes if they interacted outside the club. There are 78 undirected, unweighted edges in total.

What makes this network an ideal example for community detection is an incident occurred during the study. The administrator of the club had a conflict with the instructor during the study, and the club split into two in the end. About half of the members formed a new club with the instructor, and another half stayed with the administrator and found a new instructor. Therefore, there were obviously two factions in the network, led by the instructor and the administrator, respectively, long before the split, which provides a benchmark to test any community detection method.

2.1 The R Procedure

R is a free open-source software and computing platform well suited for statistical analysis. R does not operate with pull-down menus. Rather, you must submit lines of code that execute functions and operations built into R. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located at http://cran.r-project.org/doc/manuals/r-release/R-intro.html.

For this example, we must first load the node table and the edge table into R. Using the network files provided, the code looks like this (assuming the data file is already saved in your working directory):

  • nodes = read.csv('dataset-karate-1977-subset1-nodes.csv')
  • edges = read.csv('dataset-karate-1977-subset1-edges.csv')

Now, the node table and edge table are read in as dataframes. To perform any analysis, we need to turn them into a network object. There are two packages in R commonly used for network analysis: igraph and statnet. The statnet is useful in statistical modeling of networks and will be introduced in this SAGE Research Methods Dataset on Exponential Random Graph Models. In this example, we use the igraph which is good at computations on networks.

We need to load the igraph package in order to use it. If you don’t have the igraph installed, you will get an error. Run the following code to install it first:

install.packages(’igraph’)

Once it is installed successfully or if already installed, you can load it like this:

library(’igraph’)

Next, we can turn the node and edge tables into a network object by the following command:

G = graph_from_data_frame(d=edges, vertices=nodes, directed=F)

Any column after the first one in the node table will be used as attributes for the nodes, and any column after the second in the edge table will be used as attributes for the edges. The attribute we need is the faction ID each member actually joined after the split, which is called (no surprise) “faction.”

Finally, the following command calculates a partition with edge betweenness.

community = cluster_edge_betweenness(G)

As the community detection results are available, we can plot it first and inspect it visually.

plot(G, vertex.color = membership(community))

Note that the color of the nodes is set by their community membership.

Lastly, we check the community ID to which each member is assigned and the faction each member actually joined after the split:

  • membership(community)
  • vertex_attr(G,’faction’)
2.2 Exploring the R Output

For each command above, R will return its results immediately. Here, we summarize the main results.

Figure 1 shows the karate club network with color-coded communities. Using the edge betweenness method, we find five communities. The resulting community ID for each node might not be exactly the same as the faction ID, since what is called Community 1 could be Faction 2 – the numerics have no meaning here. What we are interested in is whether a community contains the same nodes as a faction does. After comparing this partition to the actual split of the club, we find that the partition correctly identifies the actual split of the club members except for one node (#3), but it further divides one faction into two communities and the other into three communities.

Figure 1: The Karate Club Network With Color-Coded Communities Identified by Edge Betweenness.
Figure

3 Your Turn

Download this sample dataset to see whether you can replicate these results. Actually, the link between Nodes 34 and 23 is ambiguously reported in Zachary’s original paper. Try repeating the process after removing this link and check how the partition result changes. (Hint: Use the minus operator (-) to delete an edge from the network. E.g., G = G - edge(1|2) delete the edge between nodes 1 and 2.)