Archive

Finding area of intersection with GeoPandas

A friend recently asked me if there were an easy way to determine which Chicago ward someone lives in given their ZIP Code. The sign-in sheets at an event she helped plan solicited event-goers' ZIP Codes. Later on she decided she wanted to follow up with attendees about an upcoming aldermanic candidate forum, but only with those attendees living in a ward where one of the speaking candidates was running. I answered her question with disappointing but expected news: because Chicago ward's are so gerrymandered, it's not easy to guess which ward someone lives in based on their ZIP Code. A single ZIP Code could have 4 or more wards twisting through it.

The question, however, got me thinking: could I estimate the rough probability that someone lives in a particular ward given their ZIP Code? One way to estimate the probability is by finding the proportion of the area of intersection of the ward and ZIP Code to the total area of the ZIP Code.

I very briefly introduced GeoPandas in the previous post. In this post I'll show how to use GeoPandas to find the area of intersection of two geospatial datasets by applying the overlay function. At the end of the post, I'll reflect on how and why to avoid such an estimate in the first place.

Load the data

The Chicago Data Portal provides geospatial data for Chicago ward and Chicago ZIP Code boundaries. I load the two data sets into separate GeoPandas GeoDataFrames. Notice the relative regularity of the ZIP Code boundaries when compared to the salamander-shaped boundaries of Chicago's wards. Some have compared the shape of the city's 2nd ward to the silhouette of a lobster—can you find it?

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import geopandas as gpd

zips = gpd.read_file("data/Boundaries - ZIP Codes.geojson")
zips.plot(figsize=(10, 10), edgecolor='black')
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0xb73a5c0>
In [3]:
wards = gpd.read_file("data/Boundaries - Wards (2015-).geojson")
wards.plot(figsize=(12, 12), edgecolor='black')
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0xec0ffd0>

A Note on ZIP Codes

U.S. ZIP Codes are not polygonal regions, but are linear features that correspond to mail delivery routes. In some cases, a ZIP Code corresponds to a group of addresses at a single location, such as a high-rise apartment building. Any map showing "ZIP Code boundaries" is really showing an approximation of where the ZIP Codes are located. Depending on the method used to draw the boundaries, some ZIP Codes might overlap, and some might not be contiguous, among other possible pathologies. When we're finding the "area" of a ZIP Code, we're really finding the area of its boundary approximation. You can read more about ZIP Codes here.

Calculate the area of each feature

Calling the area attribute on a GeoDataFrame will return a Series containing the area of each geometry in the GeoDataFrame. It is important to keep in mind that the geometries in GeoDataFrames are a collection of coordinates in arbitrary space, so the area attribute will return areas in a unit related to the coordinate system being used. In order to obtain area values that are meaningful to us, we need to make sure to set the proper coordinate reference system (CRS). We can see the current CRS of both datasets by calling the crs attribute.

In [4]:
print wards.crs
print zips.crs
{'init': u'epsg:4326'}
{'init': u'epsg:4326'}

This CRS is the World Geodetic System (WGS84) latitude-longitude projection. Here it is being referred to by its equivalent EPSG code. Let's inspect the geometry columns directly.

In [5]:
print wards.geometry.head()
print zips.geometry.head()
0    (POLYGON ((-87.69623470134458 41.8575549523838...
1    (POLYGON ((-87.66288923669032 41.7988380986824...
2    (POLYGON ((-87.69817510963803 41.8172944075599...
3    (POLYGON ((-87.65524133440029 41.8088331618279...
4    (POLYGON ((-87.66420403810295 42.0212615805274...
Name: geometry, dtype: object
0    (POLYGON ((-87.67762151065281 41.9177578010629...
1    (POLYGON ((-87.72683253163021 41.9226462671259...
2    (POLYGON ((-87.78500237831095 41.9091478547167...
3    (POLYGON ((-87.6670686895295 41.88885188496992...
4    (POLYGON ((-87.70655631674127 41.8955534069940...
Name: geometry, dtype: object

Yep, those coordinates look like latitudes and longitudes.

I want to calculate the area of each boundary in square-miles for intelligibility (some of you may prefer square-km). To do so, I need to convert the coordinates of each GeoDataFrame to a system that uses feet. For geospatial data in and around Chicago, I usually use the EPSG 3435 (Illinois StatePlane East Feet) coordinate system. These coordinates are in U.S. survey feet. Calling the area attribute will return a Series containing the area of each feature in square-feet, which I can then convert to square-miles.

In [6]:
def calculate_sqmi(gdf):
    return gdf.to_crs({'init': 'epsg:3435'}).area / (5280 ** 2)

zips["zip_area_sqmi"] = calculate_sqmi(zips)
wards["ward_area_sqmi"] = calculate_sqmi(wards)

A quick inspection reveals that the calculated square-milages are of an order of magnitude that make sense.

In [7]:
print zips.zip_area_sqmi.head()
print wards.ward_area_sqmi.head()
0    3.804102
1    4.572574
2    1.616629
3    2.541532
4    3.552558
Name: zip_area_sqmi, dtype: float64
0    4.164389
1    3.699769
2    2.352656
3    4.924614
4    1.783942
Name: ward_area_sqmi, dtype: float64

Find the intersection of the two datasets

To find the intersection of the geometries, I use the overlay function with how="intersection". The overlay function allows us to perform set-operations on our dataset, creating new features based on where the geometries do or don't overlap. Read more about the overlay function here. Once done, I need to make sure the resulting intersection uses the same coordinates as our zip and ward boundaries.

In [8]:
intersection = gpd.overlay(zips, wards, how="intersection", use_sindex=True)
intersection.crs = zips.crs

Let's examine the result of finding the intersection of the two datasets, focusing attention on how the wards intersect with ZIP Code 60615, a ZIP Code which lies partially in the Hyde Park neighborhood.

In [9]:
intersection[intersection.zip == "60615"]
Out[9]:
objectid shape_len zip shape_area zip_area_sqmi ward shape_area_2 shape_leng ward_area_sqmi geometry
254 8 38321.3137692 60615 66565454.8469 2.387707 5 115196005.523 102271.665896 4.132088 POLYGON ((-87.58331558982701 41.80582001190835...
255 8 38321.3137692 60615 66565454.8469 2.387707 5 115196005.523 102271.665896 4.132088 POLYGON ((-87.60621549732983 41.79496536146332...
256 8 38321.3137692 60615 66565454.8469 2.387707 20 137290356.97 90105.1956185 4.924614 POLYGON ((-87.60621549732331 41.79496536162259...
257 8 38321.3137692 60615 66565454.8469 2.387707 4 126006901.096 119468.403755 4.519876 POLYGON ((-87.60621151884421 41.79506262138544...
313 8 38321.3137692 60615 66565454.8469 2.387707 3 123341548.953 81530.2751668 4.424269 POLYGON ((-87.62569132923429 41.79430028132465...
314 8 38321.3137692 60615 66565454.8469 2.387707 20 137290356.97 90105.1956185 4.924614 POLYGON ((-87.62540170895157 41.794310159517, ...
316 8 38321.3137692 60615 66565454.8469 2.387707 20 137290356.97 90105.1956185 4.924614 POLYGON ((-87.62464551253092 41.79432553026912...
318 8 38321.3137692 60615 66565454.8469 2.387707 20 137290356.97 90105.1956185 4.924614 POLYGON ((-87.62245797694853 41.79437071439551...
320 8 38321.3137692 60615 66565454.8469 2.387707 20 137290356.97 90105.1956185 4.924614 POLYGON ((-87.62164747846325 41.79439179416078...
324 8 38321.3137692 60615 66565454.8469 2.387707 20 137290356.97 90105.1956185 4.924614 POLYGON ((-87.61597293853798 41.7944871274405,...

Dissolve fragments into single boundary

Notice that the 5th and 20th wards have multiple sites of intersection with 60615. Ideally, I want a single feature for each ward-ZIP Code intersection to simplify the calculations later. I can merge the fragments together by using the dissolve function and passing sum as the method of aggregation. Think of dissolve as an analogue to the pandas groupby function that allows the aggregation of geometries.

Note: zip_area_sqmi and ward_area_sqmi give the full area of each ZIP Code and ward, not the area of intersection.

In [10]:
intersection2 = intersection[["zip", "ward", "zip_area_sqmi", "ward_area_sqmi", "geometry"]]
intersection2 = intersection2.dissolve(by=["zip", "ward", "zip_area_sqmi", "ward_area_sqmi"], aggfunc='sum').reset_index()
In [11]:
intersection2[intersection2.zip == "60615"]
Out[11]:
zip ward zip_area_sqmi ward_area_sqmi geometry
51 60615 20 2.387707 4.924614 (POLYGON ((-87.62540170895157 41.794310159517,...
52 60615 3 2.387707 4.424269 POLYGON ((-87.62569132923429 41.79430028132465...
53 60615 4 2.387707 4.519876 POLYGON ((-87.60621151884421 41.79506262138544...
54 60615 5 2.387707 4.132088 (POLYGON ((-87.60621549732983 41.7949653614633...

Calculate area and proportion of intersections

Now the GeoDataFrame has one feature for each ward-ZIP Code intersection. I calculate the area of each intersection and the proportion of the intersection with respect to the full ZIP Code (pct_int_wrt_zip) and the full ward (pct_int_wrt_ward).

Note: to answer my original question, I only need to calculate pct_int_wrt_zip.

In [12]:
intersection2["intersection_area_sqmi"] = calculate_sqmi(intersection2)
intersection2["pct_int_wrt_zip"] = intersection2.intersection_area_sqmi / intersection2.zip_area_sqmi
intersection2["pct_int_wrt_ward"] = intersection2.intersection_area_sqmi / intersection2.ward_area_sqmi

Inspect the intersections

I'll create a couple quick plots to show how 60615 intersects with Chicago's ward and to show how one particular ward (Ward 20) intersects with 60615.

In [13]:
ax = intersection2[intersection2.zip == "60615"].plot(figsize=(10,10), cmap='Paired', edgecolor='black')
wards[wards.ward.isin(intersection2[intersection2.zip == "60615"].ward)].plot(ax=ax, facecolor='none', edgecolor='black')
ax.set_title('How ZIP Code 60615 intersects with Chicago Wards')
Out[13]:
Text(0.5,1,'How ZIP Code 60615 intersects with Chicago Wards')
In [14]:
ax = intersection2[intersection2.ward == "20"].plot(figsize=(10,10), cmap='Paired', edgecolor='black')
zips[zips.zip == "60615"].plot(ax=ax, facecolor='none', edgecolor='black')
ax.set_title('How Chicago Ward 20 intersects with ZIP Code 60615')
Out[14]:
Text(0.5,1,'How Chicago Ward 20 intersects with ZIP Code 60615')

The results and a lesson about data collection

In [15]:
intersection2[intersection2.zip == "60615"]
Out[15]:
zip ward zip_area_sqmi ward_area_sqmi geometry intersection_area_sqmi pct_int_wrt_zip pct_int_wrt_ward
51 60615 20 2.387707 4.924614 (POLYGON ((-87.62540170895157 41.794310159517,... 0.123309 0.051643 0.025039
52 60615 3 2.387707 4.424269 POLYGON ((-87.62569132923429 41.79430028132465... 0.601311 0.251836 0.135912
53 60615 4 2.387707 4.519876 POLYGON ((-87.60621151884421 41.79506262138544... 0.987530 0.413589 0.218486
54 60615 5 2.387707 4.132088 (POLYGON ((-87.60621549732983 41.7949653614633... 0.675554 0.282930 0.163490

By this method of estimation, given that someone lives in ZIP Code 60615, there is a 25.2% chance that person also lives in the 3rd ward, a 41.4% chance they live in the 4th ward, a 28.3% chance they live in the 5th ward, and a 5.2% they live in the 20th ward.

How reasonable is the estimate? As discussed at the start of the post, ZIP Codes aren't actually polygonal regions. So the area of a ZIP Code boundary is the area of the approximate region that ZIP Code covers. Second, regardless of the types of boundaries used (e.g. say instead we use congressional districts and city neighborhoods), the estimation doesn't account for how people are distributed across the boundaries. The estimation doesn't account for green space or industrial spaces where housed people do not live, nor does it account for clusters of high-density housing. For example, almost all of the intersection of 60615 with the 20th Ward contains Washington Park. The actual probability that someone lives in the 20th ward given that they live in 60615 is essentially 0, not 5.2%.

We could check the accuracy of the estimate for a single ZIP Code by randomly sampling addresses in that ZIP Code and recording which ward they belong to. Such a check is not worth the trouble for this exercise. I think the estimate is useful for getting a rough sense of how the the population is divided up, but I don't think it has much real world utility.

This exercise does, however, highlight something important: collecting better data to start (e.g. a full address) helps to avoid making unreliable estimates later. Collecting more detailed data isn't always possible—with collecting addresses there are privacy and security concerns, or we want to avoid being invasive. And sometimes (like in this case) we don't always know what data will be useful later on. Hopefully though this post makes you think about future data collection projects more carefully.