How-to Guide for R

Introduction

In this guide, you will learn how to produce the configuration model in Statistical Software R, using a practical example to illustrate this process. Readers are provided with links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. This example assumes that you have the data file stored in the working directory being used by R.

Contents

- Configuration Model
- An Example in R: Non-Random Florentine Family Network
- 2.1 The R Procedure
- 2.2 Exploring the R Output

- Your Turn

1 Configuration Model

The configuration random network model is a mathematical model for generating random networks. In other words, it samples from a uniform distribution over the space of networks with the same degree sequence. The configuration model preserves the degree distribution observed in real networks, and hence it is the most commonly used null model in network analysis.

2 An Example in R: Non-Random Florentine Family Network

This example introduces the configuration random network model and compares it with a network of Renaissance Florentine families around 1430. Specifically, we compare the diameter and the clustering coefficient of the Florentine family network with those of the networks generated from the configuration model. The families are nodes, and marriage ties between the families are edges in the network.

This example uses a subset of data from the Florentine Families dataset collected by Padgett (1994) and made publicly available by UCINET (https://sites.google.com/site/ucinetsoftware/datasets/padgettflorentinefamilies). The network is undirected and binary since marriage ties are mutual and dichotomous. It includes 16 nodes and 20 edges.

2.1 The R Procedure

R is a free open source software and computing platform well suited for statistical analysis. R does not operate with pull-down menus. Rather, you must submit lines of code that execute functions and operations built into R. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located at http://cran.r-project.org/doc/manuals/r-release/R-intro.html.

For this example, we must first load the node table and the edge table into R. Using the network files provided, the code looks like this (assuming the data file is already saved in your working directory):

- nodes = read.csv(‘dataset-florentine-1994-subset1-nodes.csv’)
- edges = read.csv(‘dataset-florentine-1994-subset1-edges.csv’)

Now the node table and edge table are read in as dataframes. To perform any analysis, we need to turn them into a network object. There are two packages in R commonly used for network analysis: igraph and statnet. Statnet is useful in statistical modeling of networks and will be introduced in SAGE Research Methods Dataset on Exponential Random Graph Models. In this example, we use igraph, which is good at computations on networks.

We need to load the igraph package in order to use it. If you don’t have igraph installed, you will get an error. Run the following code to install it first

install.packages(‘igraph’)

Once it is installed successfully or if already installed, you can load it like this

library(‘igraph’)

Next, we can turn the node and edge tables into a network object by the following command:

G = graph_from_data_frame(d=edges, vertices=nodes, directed=F)

Any column after the first one in the node table will be used as attributes for the nodes, and any column after the second in the edge table will be used as attributes for the edges. Here, we want to manually specify the name of each node using the “label” column in the node table. This can be done with the following code

V(G)$name = as.character(nodes$label)

You can set other attributes for the nodes similarly. The benefit of naming the nodes in this example is that we can call them by name directly (instead of ID’s) in further analysis in igraph.

The diameter and the clustering coefficient of this network can be calculated by

- diameter(G)
- transitivity(G, type=‘global’)

We also calculate the degree sequence of this network and save it into a vector: d0

d0=degree(G)

Now, we simulate random networks from the configuration model with the same degree sequence as the Florentine family network. We will simulate 1,000 such random networks; for each of them, we simply discard any loops and multi-edges. We store the degree sequence of each random network into the matrix d, and we also calculate the diameter and the clustering coefficient of each random network and save the results into two vectors c and r, respectively.

- d=array(dim = c(1000,vcount(G)))
- c=numeric(1000)
- r=numeric(1000)
- for (i in 1:1000) {
- random_network=sample_degseq(d0)
- random_network=simplify(random_network)
- d[i,]=degree(random_network)
- r[i]=diameter(random_network)
- c[i]=transitivity(random_network, type=‘global’)
- }

To check how the configuration model preserves the degree sequence of the observed network, we calculate the mean degree of each node across all the simulated networks and the associated standard deviation.

- colMeans(d)
- apply(d, 2, sd)

Finally, we calculate the mean and the standard deviation of the diameters of the random networks,

- mean(r)
- sd(r)

and the mean and the standard deviation of their clustering coefficients.

- mean(c)
- sd(c)

2.2 Exploring the R Output

For each command above, R will return its results immediately. Here, we summarize them below.

The observed degree sequence and the mean degree of each node from the simulated networks is shown in Table 1. The numbers in parentheses are the associated standard deviations. The mean degree of each node from the simulations is always smaller than or equal to the observed degree because we only delete but never add edges when removing loops and multi-edges in the random networks. However, the degree sequence from the configuration model is very close to that of the Florentine family network. The discrepancy will normally be negligible for larger networks.

Table 1: Degree of Each Node in the Florentine Family Network and the Mean Degree of Each Node Over 1,000 Simulations From the Configuration Model. | ||||
---|---|---|---|---|

Acciaiuoli | Albizzi | Barbadori | Bischeri | |

Observed | 1 | 3 | 2 | 3 |

Simulated | 1.000 (0) | 2.651 (0.622) | 1.882 (0.400) | 2.665 (0.607) |

Castellani | Ginori | Guadagni | Strozzi | |

Observed | 3 | 1 | 4 | 4 |

Simulated | 2.654 (0.639) | 1.000 (0) | 3.375 (0.809) | 3.367 (0.801) |

Lamberteschi | Medici | Pazzi | Peruzzi | |

Observed | 1 | 6 | 1 | 3 |

Simulated | 1.000 (0) | 4.594 (1.077) | 1.000 (0) | 2.650 (0.640) |

Pucci | Ridolfi | Salviati | Tornabuoni | |

Observed | 0 | 3 | 2 | 3 |

Simulated | 0.000 (0) | 2.657 (0.621) | 1.895 (0.374) | 2.662 (0.614) |

Note: The numbers in parentheses are the associated standard deviations.

The mean diameter of the random networks is 5.933 with a standard deviation of 1.06, while the mean clustering coefficient is 0.121 with a standard deviation of 0.087. The diameter of the Florentine family network (5) is slightly (15.7%) smaller than but comparable to the average diameter of the random networks, but its clustering coefficient (0.19) is much (58%) higher than the average clustering coefficient of the random networks. This discrepancy suggests that the Florentine family marriage network is not random compared to the configuration model. Actually, many real social networks share those properties – a diameter comparable to that of the random network and a high clustering coefficient – and this pattern is called the “small world” phenomenon first formalized by Duncan and Watts.

3 Your Turn

Download this sample data to see whether you can replicate these results. Repeat the process simulating another configuration random network and compare it to the Florentine family marriage network.