How-to Guide for Python

In this guide, you will learn how to estimate and interpret join count statistics using Python. This will require some familiarity with the Representing Spatial Relationships dataset, since the join count analysis involves spatial relationships. We will use join count statistic to identify whether there is clustering or dispersion in a categorical feature. For instance, we will determine whether “short-stay” residential districts tend to cluster or whether Airbnbs with cats tend to cluster relative to Airbnbs with dogs. Together, this will let us identify local and regional clusters in price. In addition to estimating and understanding join count statistics, you will learn a few intermediate-level techniques for mapping and plotting in Python.

  • The Concept of Join Count Analysis
  • An Example in Python: Airbnb Prices in Berlin
    • 2.1 Reading in Data
    • 2.2 Choosing Spatial Weights Matrices
    • 2.3 Understanding the Join Count Statistic
    • 2.4 Analyzing Data Using Join Counts
  • Your Turn
1 The Concept of Join Count Analysis

Spatial autocorrelation is a common property of spatial data that describes how near observations are related to one another. Often, it is easiest to think of spatial autocorrelation in terms of correlation itself; spatial autocorrelation expresses how a feature at each site in a map is correlated with itself at nearby sites. Since measures of correlation can be related directly to measures of spatial autocorrelation, understanding spatial autocorrelation can be easier when examining continuous data.

However, features come in many different kinds. Categorical data reflect the group, or category, of a site. For example, many Airbnbs are listed by owners with cats, others are listed by owners with dogs, and further more are listed by owners with no pets. This provides a three-category feature; a listing can have cats, have dogs, or have no pets. Categorical data are incredibly common in applied circumstances but pose a unique challenge for the analysis of spatial autocorrelation: namely, the association between categories cannot be measured in the same fashion as association in continuous data. Thus, a different, distinct measure of spatial autocorrelation is needed for categorical data.

Join count analysis is a generic collection of methods designed to analyze and describe the spatial structure of categorical data. In its simplest form, join count analysis is able to express the spatial autocorrelation between sites’ categories. The “join” in join count refers to the pairs of observations that are “joined” together because they are geographically close. For a two-category (dichotomous) feature, join count analysis examines the number of times sites that are near one another fall into different categories. Altogether, the number of cross-category joins gives a good indication of whether the categories cluster away from one another or whether they tend to be thoroughly interspersed. This cross-color join concept can be used for categorical data with more than two categories, too. But, because it may be the case that some categories cluster and others disperse, it is more common to examine pairs of associations in multi-category data.

Thus, below, we will discuss how join count analysis works and examine the structure of short-stay neighborhoods in Berlin.

2 An Example in Python: Airbnb Prices in Berlin

This example uses data for nightly Airbnb prices scraped in June of 2017. In addition to these listings, we will use the official boundaries of districts in Berlin as well, to illustrate the conceptual structure of the statistics.

Working with spatial data in Python requires that we first import a few additional software packages. These are: - geopandas, which provides spatial data frames; - matplotlib.pyplot, which provides a basic plotting interface in Python; - numpy, which provides many efficient matrix mathematics and numerical routines; - seaborn, which provides more advanced plotting functionality in Python; - libpysal, which provides the spatial analytical functions used in this notebook; - contextily, which provides base maps for the images shown. This is optional but makes the maps easier to understand.

2.1 Reading in Data

First, we will read in our data. The Berlin Airbnb data are stored in a spatial data format called GeoJSON, which is a spatial extension of JSON, the JavaScript Object Notation Format. To read most kinds of spatial data in Python (including GeoJSON files), the geopandas package provides a single function. We’ll also make sure to convert the data into the right coordinate reference system, using the to_crs() function.

Then, we will also read in the data pertaining to the district-level aggregate information and do the same thing.

Below, we download the basemap for all of the maps we will make for Berlin proper. This provides a single image that will be able to sit behind the data, improving the quality of our maps:

Now, we’ll just get a sense of the data with a simple map. To visualize the data, we can make a map using the following steps:

2.2 Choosing Spatial Weights Matrices

For join count analysis, linkages between geographically near observations are the “joins” counted for the analysis. Thus, the way these joins are constructed will affect how the statistic works. This is a frequent concern in the analysis of spatial autocorrelation and reflects the fact that how spatial data are represented may reflect entirely different conceptualizations of geography itself. Thus, it is entirely consistent for there to be spatial autocorrelation in data using one geographical representation but no autocorrelation (or even reversed autocorrelation) in another representation. Here, to keep the analysis simple, we will use a common understanding of join to mean touching observations; if two districts share a boundary, we consider them joined. This is a “Queen contiguity” representation of geographical proximity, which is discussed further at length in the Representing Spatial Relationships dataset.

2.3 Understanding the Join Count Statistic

The most basic join count analysis examines how a feature with two categories is clustered in a given map. Usually, in the case of two-category join count analysis, we simply refer to categories by two colors: One category is called the “black” category and the other is the “white” category. When two sites, i, j, are near one another and are in different categories, the join (i, j) is considered a “cross-class” or black-white join. When sites i and j are in the same category, the join is considered a “same-class” join. Thus, in the whole map, joins may be either cross-class or same-class; the total number of same-class joins is a function of the same-class black-black and white-white joins. However, the cross-class joins are a single statistic that provides an indication of how thoroughly mixed the map is. Out of J joins, let there be nc cross-class joins and ns same-class joins, such that J = nc + ns. Conceptually, when nc < ns, observations tend to be surrounded by the same color observations. When nc > ns, observations tend to be surrounded by differently colored observations. Since it is marginally simpler to compute nc than ns, we focus on nc. Further, for multi-class generalizations of this binary join count statistic, the covariation of cross-class joins becomes important, and the cross-class join counts remain the central unit of analysis. Regardless, the numbers for nc and ns cannot be analyzed directly: They directly depend on the structure of connectivity between sites itself.

Thus, we tend to use map permutation methods, akin to those used in the assessment of Moran’s I, to identify when nc is much larger than would be expected at random. In the case of binary join count statistics, it becomes simple to visualize these map randomizations, given the fact that the two categories provide stark visual contrasts:

2.4 Analyzing Data Using Join Counts

Now, we will examine the join count statistics using the esda package in Python. esda stands for “Exploratory Spatial Data Analysis,” methods which typically do not involve explicit models of the outcome under study. Here, as we discuss above, join count analysis focuses on the number of nearby observations that fall into different categories. The esda package provides a method for estimating join count statistics quickly and efficiently, using the Join_Count class. First, though, we must define the joins which link observations. Here, we’ll use the queen contiguity graph between residential districts in Berlin, like we discuss in the Representing Spatial Relationships dataset.

Then, estimating join counts uses the Join_Counts class in esda. Here, we’ll save the results to an object named joins_counted.

The joins_counted object has a few relevant properties, like the total number of joins J:

The total number of black-white or black-black joins:

And, the pseudo-p-value of observing random maps with as many bw or bb joins we did see:

A simpler, more direct way to understand this pseudo-p-value is to examine the set of simulations directly. For example, the number of black-white joins found in the data (167) is plotted in orange in front of the whole distribution of black-white joins seen in 1,000 random maps:

From this histogram, we see that a majority of the simulations had as many as (or more) black-white joins. Thus, the simulated maps tended to be more thoroughly mixed than the observed map, but about 30% of the simulations are less thoroughly mixed as well. This suggests the observed map is well within the expected range of black-white joins. In terms of the actual meaning of the data, this suggests that short-stay districts are not significantly more clustered or dispersed than would be expected from a random shuffle of the map itself.

3 Your Turn

Now that you’ve done a join count analysis for residential districts in Berlin, do the same for listings where the rental has either a cat or a dog. This will identify whether listings with cats tend to cluster or disperse, relative to dog listings. To extract which listings have cats or dogs, we first:

And then, for the black/white category, we then must construct a variable for a listing in the

With this new dataframe containing only Airbnb listings with either cats or dogs,

  • Conduct a join count analysis with a 5-nearest neighbor weight and interpret the Join_Count statistic like we did above.
  • Recall that the K-nearest neighbor weights are asymmetric. This means that the join count statistic is considering a “directed” graph, instead of the joins between observations. Try using .symmetrize () on the weights object and running another 5-nearest symmetric neighbor analysis. Interpret it like we did above.
  • Finally, do a 10-nearest symmetric neighbor analysis and examine the results. Does the number of dog-cat joins become more surprising as the k increases?