In this guide, you will learn how to produce PageRank in the programming software Python, using a practical example to illustrate this process. Readers are provided with links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. This example assumes that you have the data file stored in the working directory being used by Python.
PageRank is a centrality measure for nodes in a network. In other words, it ranks nodes based on their positions in the network. The method assumes a recursive definition of centrality or importance: Nodes pointed to by many important nodes are themselves important. As PageRank was initially used to rank websites based on the hyperlinks between them, it was defined for directed networks; however, it generalizes to undirected and even weighted networks naturally through a random-walk formulation.
This example introduces the PageRank centrality measure with a network of Renaissance Florentine families around 1430. Specifically, we examine the PageRank centrality of the Florentine families in their marriage network. The families are nodes, and marriage ties between the families are edges in the network.
This example uses a subset of data from the Florentine Families dataset collected by Padgett (1994) and made publicly available by UCINET (https://sites.google.com/site/ucinetsoftware/datasets/padgettflorentinefamilies). The network is undirected since marriage ties are mutual. It includes 16 nodes and 20 edges.
Python is an open-source programming language. Python does not operate with pull-down menus. Rather, you must submit lines of code that execute functions and operations built into Python. It is best to save your code in a simple text file that Python users generally refer to as a script file. We provide a script file with this example that executes all of the operations described here. If you are not familiar with Python, we suggest you start with the introduction manual located at https://wiki.python.org/moin/BeginnersGuide. While most computer systems come with a vanilla Python, we recommend installing the distribution made by Anaconda (https://www.anaconda.com/download/) as it contains many packages that are commonly used. This software guide uses this distribution and will prompt to install any package used here but not included in the Anaconda distribution.
For this example, we need the package “igraph” for network analysis. For the installation of this package, please see its official website (https://igraph.org/python/). The package can be loaded as:
import igraph as ig
We also need the package “pandas” for data processing. The installation instructions can be found at its website (https://pandas.pydata.org/pandas-docs/stable/install.html). If you are using the Anaconda distribution of Python, then the package should be installed already. With the package installed, we can load it as:
import pandas as pd
To begin with the analysis, we must first load the data into Python. This can be done with the following code (assuming the data file is already saved in your working directory):
Now, the node table and edge table are read in as dataframes. To perform any analysis, we need to turn them into a network object in igraph:
The first line of the code above constructs an empty network. The second line adds nodes into the network. The nodes are from the “ID” column in the nodes table, and since Python’s index starts from 0 and the ID in the node table starts from 1, we subtract 1 from the node ID. This is necessary because when we add edges (the third line of the code), the nodes are referred to by their numerical IDs, and igraph will treat the IDs as the indices of the nodes. This is also why we subtract 1 from the edge table to make the node ID starting from 0. Note that if the node ID is a string and the nodes are referred to by the strings in the edge table, then we don’t need to worry about the zero-based index since igraph will not treat strings as indices of the nodes.
We can add attributes to the nodes as the following:
G.vs['label'] = nodes['label']
Here, we specify the label of each node using the “label” column in the node table. You can set other attributes for the nodes and edges similarly.
The PageRank centrality of each node in this network can be calculated by the “pagerank” function. Here, calculate the PageRank scores and print them out as the following:
For each command, Python will return its output immediately. Here, we focus on the PageRank scores.
The PageRank scores for each node are shown in Table 1. We can see that the Medici family is the most important one with a PageRank centrality of 0.144, followed by Guadagni and Strozzi.
|Table 1: PageRank Centrality of Each Node.|
Download this sample dataset to see whether you can replicate these results. Repeat the process this time using a different teleporting probability and check how the ranking changes.