In this guide, you will learn how to produce PageRank in Statistical Software R, using a practical example to illustrate this process. Readers are provided with links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. This example assumes that you have the data file stored in the working directory being used by R.
PageRank is a centrality measure for nodes in a network. In other words, it ranks nodes based on their positions in a network. The method assumes a recursive definition of centrality or importance: Nodes pointed to by many important nodes are themselves important. As PageRank was initially used to rank websites based on the hyperlinks between them, it was defined for directed networks; however, it generalizes to undirected and even weighted networks naturally through a random-walk formulation.
This example introduces the PageRank centrality measure with a network of Renaissance Florentine families around 1,430. Specifically, we examine the PageRank centrality of the Florentine families in their marriage network. The families are nodes, and marriage ties between the families are edges in the network.
This example uses a subset of data from the Florentine Families dataset collected by Padgett (1994) and made publicly available by UCINET (https://sites.google.com/site/ucinetsoftware/datasets/padgettflorentinefamilies). The network is undirected since marriage ties are mutual. It includes 16 nodes and 20 edges.
R is a free open source software and computing platform well suited for statistical analysis. It does not operate with pull-down menus. Rather, you must submit lines of code that execute functions and operations built into R. It is best to save your code in a simple text file that R users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with R, we suggest you start with the introduction manual located at http://cran.r-project.org/doc/manuals/r-release/R-intro.html.
For this example, we must first load the node table and the edge table into R. Using the network files provided, the code looks like this (assuming the data file is already saved in your working directory):
Now, the node table and edge table are read in as dataframes. To perform any analysis, we need to turn them into a network object. There are two packages in R commonly used for network analysis: igraph and statnet. Statnet is useful in statistical modeling of networks and will be introduced in this SAGE Research Methods Dataset on Exponential Random Graph Models. In this example, we use igraph, which is good at computations on networks.
We need to load the igraph package in order to use it. If you don’t have igraph installed, you will get an error. Run the following code to install it first
Once it is installed successfully or if already installed, you can load it like this
Next, we can turn the node and edge tables into a network object by the following command:
G = graph_from_data_frame(d=edges, vertices=nodes, directed=F)
Any column after the first one in the node table will be used as attributes for the nodes, and any column after the second in the edge table will be used as attributes for the edges. Here, we want to manually specify the name of each node using the “label” column in the node table. This can be done with the following code
V(G)$name = as.character(nodes$label)
You can set other attributes for the nodes similarly. The benefit of naming the nodes in this example is that we can call them by name directly (instead of ID’s) in further analysis in igraph.
The PageRank centrality of each node in this network can be calculated by
page_rank(G, damping = 0.85).
This will use a teleporting probability of 0.15, which is also the default in igraph.
For each command above, R will return its results immediately. Here, we focus on the PageRank scores.
page_rank(G, damping = 0.85)
will return a list with three components. What we need is the first one which is the PageRank scores for each node and is shown in Table 1. We can see that the Medici family is the most important one with a PageRank centrality of 0.144, followed by Guadagni and Strozzi.
|Table 1: PageRank Centrality of Each Node.|
Download this sample data to see whether you can replicate these results. Repeat the process this time using a different teleporting probability and check how the ranking changes.