How-to Guide for Python

Introduction

In this guide, you will learn how to estimate Getis–Ord Gi statistics using Python. This will require some familiarity with the Representing Spatial Relationships dataset, since the Getis–Ord Gi statistic is a statistic about spatial relationships. We will use the Getis–Ord Gi statistic to identify clusters of similar observations. These clusters, sometimes called “hotspots,” areas where observations tend to be larger on average, or “coldspots,” where observations tend to be smaller on average. Together, this will let us identify local and regional clusters in price. In addition to estimating and understanding Gi, you will learn a few intermediate-level techniques for mapping and plotting in Python.

Contents

- The Concept of the Getis–Ord Statistic
- An Example in Python: Airbnb Prices in Berlin
- 2.1 Reading in Data
- 2.2 Choosing the Scale of Analysis
- 2.3 Computing Getis–Ord Gi Statistics
- 2.4 Visualizing Clusters at Varying Scales

- Your Turn

1 The Concept of the Getis–Ord Statistic

Spatial autocorrelation is a common property of spatial data that describes how near observations are related to one another. Local spatial autocorrelation aims to measure how specific localities, the small areas around each site, vary with respect to the map as a whole. Thus, local autocorrelation measures provide an indication of where geographical clusters are parts of the map that are self-similar and above or below the average of the rest of the map.

While the local Moran’s Ii statistic is able to detect clusters (as well as outliers), the Getis–Ord Gi statistic is explicitly designed to identify clusters. It works by examining whether the sum of observations in the area around site i is larger (or smaller) than average. When the sum of observations in the area around i is larger than average, the areas is a “hotspot,” a cluster of positive higher-than-typical values. Likewise, when the sum is smaller than average, the area is a “coldspot.” The statistic is given formally at site i, using data on other sites xj and a spatial weighting function wij (d). The spatial weighting function is usually a binary distance band weight, meaning that wij = 1 when the distance between sites i and j (dij) is smaller than a threshold d. Otherwise, when sites i and j are further than d apart, wij = 0. With this understanding, the Getis–Ord statistic is defined as:

$${G}_{i}(d)=\frac{{\displaystyle {\sum}_{j}{w}_{ij}(d){x}_{j}}}{{\displaystyle {\sum}_{j}{x}_{j}}}$$

Because the Gi statistic has no specific, well-defined ranges (and, indeed, its expected value is data-dependent), the Gi statistic is usually normalized after computation, and this normalized version of Zi is analyzed directly. When Zi is positive, it indicates a hotspot, and when Zi is negative, it indicates a coldspot.

We will show how to estimate these below.

2 An Example in Python: Airbnb Prices in Berlin

This example uses data for nightly Airbnb prices scraped in June of 2017. In addition to these listings, we will use the official boundaries of districts in Berlin as well, to illustrate the conceptual structure of the statistics.

Working with spatial data in Python requires that we first import a few additional software packages. These are: - geopandas, which provides spatial dataframes; - matplotlib.pyplot, which provides a basic plotting interface in Python; - numpy, which provides many efficient matrix mathematics and numerical routines; - seaborn, which provides more advanced plotting functionality in Python; - pysal which provides the spatial analytical functions used in this notebook; - contextily, which provides basemaps for the images shown. This is optional but makes the maps easier to understand; - mpl_toolkits.axes_grid1.make_axes_locatable, which ensures that color bars are correctly sized in large maps.

2.1 Reading in Data

First, we will read in our data. The Berlin Airbnb data are stored in a spatial data format called GeoJSON, which is a spatial extension of JSON, the JavaScript Object Notation Format. To read most kinds of spatial data in Python (including GeoJSON files), the geopandas package provides a single function. We’ll also make sure to convert the data into the right coordinate reference system, using the to_crs() function.

Then, just for completeness, we will also read in the data pertaining to the district-level aggregate information and do the same thing.

In addition, we’ll prepare an extract of the data, focusing on districts in central Berlin. In the listing data, this is all neighborhoods in the Mitte Borough.

Below, we download the basemap for all of the maps we will make for Berlin proper. This provides a single image that will be able to sit behind the data, improving the quality of our maps:

Now, we’ll just get a sense of the data with a simple map. To visualize the data, we can make a map using the following steps:

Now, we will also extract the Airbnbs that fall within a neighborhood in Central Berlin. We can do this using the query methods of the pandas dataframe. Pandas’s query method allows for us to look into variables outside of the dataframe if they are prefixed with @. So, for us to look into the set of neighborhoods in mitte.neighborhood, which are the neighborhoods in our Central Berlin boundary data:

We can use the following query string directly:

Now, we’ll grab a higher resolution basemap image for mapping within Central Berlin. We’ll do this in the same manner as before but increase the zoom level:

2.2 Choosing the Scale of Analysis

It is in the choice of distance banding that the Getis–Ord Gi statistic tends to differ somewhat from the concepts involved in the Moran’s I statistics. Gi is explicitly developed in relation to the concept of distance band neighborhoods; that observations within a given distance d should all be considered neighboring. This idea derives largely from the geostatistical concept of a variogram, which is discussed in a separate dataset. For now, it suffices to say that the distance-band weight considers all observations within a specific threshold as near one another, and any observation outside of this threshold is not near. Here, we’ll use a threshold of 500 feet to start off. Note that this is a radius around each point, meaning that the full “buffer” around each listing is 1,000 feet wide.

Unfortunately, our Mitte Airbnb listings have a peculiar spatial structure; there are many listings that are somewhat remote. Indeed, from the warning text, it should be clear that there are 8 disconnected observations. This means that there are eight Airbnbs with no other listing within 500 feet. This poses a unique problem for the Getis–Ord statistic, and its values are strongly unreliable if there are no observations within that distance. Thus, to be sure that every observation is at least connected to some other observation, we will attach_islands back to their nearest observation, even if that observation is more than 500 feet away.

The remaining warnings about components of the weights matrix being disconnected are alright for the Gi statistic. This means that, when viewed in total, the graph connecting observations does not allow all sites to reach any other site. Indeed, there are five cliques, or separate connected areas, where observations are within 500 feet of some other observation in the clique, but not within that distance for another part of the clique. This is fine, so long as every observation has neighbors, which we have ensured through our attach_islands call.

2.3 Computing Getis–Ord Gi Statistics

Like in the Moran’s I dataset, we will analyze the log nightly price of Airbnbs rather than the nightly price directly. Again, this is to account for the fact that humans tend to reason about prices in a relative fashion rather than an absolute fashion. Using log prices, we account for the fact that larger prices require comparatively larger changes in raw dollars/euros than smaller prices; what is relevant from the consumer standpoint is the percentage discount or premium, which the log price models linearly.

With this, we can calculate the local Gi statistic using the esda package in Python, a component of the Python spatial analysis library. Because the Gi statistic is a local statistic, we will have to test N hypotheses about whether or not site i is a cluster. This means, in the same fashion as that discussed in the Local Moran’s Ii dataset, the number of permutations required to obtain statistical significance will depend on N. Here, we will use the Bonferroni correction, which is conservative about which observations are “clusters.” Under this correction (and a nominal 5% significance level), we require around 100,000 simulations to determine significance.

To estimate the statistics, we use the G_Local class in esda.getisord:

The getisord object we have just estimated has quite a few properties. For example, the raw Gi statistics are stored:

As discussed, the Gi statistics do not necessarily have a zero expected value nor are they necessarily standardized to have unit variance.

To move the statistics to a more meaningful center and scale, we typically re-scale the Gi statistics after they are estimated. The mean and variance of Gi will depend on the problem (and, in the case of the variance, the site) and are estimated each time the G_Local class is fit:

Fortunately, this standardization is also done at estimation:

Given, this, we can visualize these standardized Getis–Ord scores:

Thus we see that, in general, the extreme Getis–Ord statistics tend to be the significant ones.

Further, we know that the negative Gi statistics are “coldspots” and the positive statistics are “hotspots.”

Mapping these values on top of the locations of each of the listings, we can see the spatial pattern in clusters (assuming the d = 500 foot cutoff), in the map below.

2.4 Visualizing Clusters at Varying Scales

Now that we have seen the Getis–Ord statistic at a 500-foot scale, it is useful to examine how clusters change or merge when considering clustering at smaller or larger spatial scales. This can be done in a similar fashion to what we have already done. Further, since we already know that there are disconnected observations when d = 500, we know that there can only be more disconnected observations when d < 500. So, we’ll be sure to connect islands when this is the case.

Since we are interested in estimating the Gi statistics on the same data, but different threshold d, we use a similar function call but use the new distance band weight:

Repeating this procedure for radii of 1,000 and 2,000 feet:

Now, armed with four different scales of Getis–Ord statistics, we can make four side-by-side maps:

Note that in the hyperlocal version, when d = 100, only a single site in the Regierungsviertel is considered a “hotspot” for rent. Thus, inspecting the maps from left to right, we see that that area becomes an increasingly more prominent hotspot as we move from very small-scale clusters to nearly region-level clusters in the price structure in Airbnbs.

3 Your Turn

Now that you have seen the code to run and make maps of the Getis–Ord Gi statistic, select a different borough of Berlin and run a similar analysis. In particular, changing the Mitte borough to the Friedrichshain-Kreuzberg or Pankow borough is particularly enlightening. So, for a new borough of Berlin,

- Make a histogram showing the normalized Gi statistics.
- Make a cluster map with d = 500 feet and describe the patterns you see.
- Make cluster maps with d = 100, 1,000, and 2,000 feet. Describe how those patterns vary as well.

For more information and topics covered in this guide, consult the prerequisites.