A friend recently asked me if there were an easy way to determine which Chicago ward someone lives in given their ZIP Code. The sign-in sheets at an event she helped plan solicited event-goers' ZIP Codes. Later on she decided she wanted to follow up with attendees about an upcoming aldermanic candidate forum, but only with those attendees living in a ward where one of the speaking candidates was running. I answered her question with disappointing but expected news: because Chicago ward's are so gerrymandered, it's not easy to guess which ward someone lives in based on their ZIP Code. A single ZIP Code could have 4 or more wards twisting through it.
The question, however, got me thinking: could I estimate the rough probability that someone lives in a particular ward given their ZIP Code? One way to estimate the probability is by finding the proportion of the area of intersection of the ward and ZIP Code to the total area of the ZIP Code.
I very briefly introduced GeoPandas in the previous post. In this post I'll show how to use GeoPandas to find the area of intersection of two geospatial datasets by applying the overlay
function. At the end of the post, I'll reflect on how and why to avoid such an estimate in the first place.
Load the data¶
The Chicago Data Portal provides geospatial data for Chicago ward and Chicago ZIP Code boundaries. I load the two data sets into separate GeoPandas GeoDataFrames
. Notice the relative regularity of the ZIP Code boundaries when compared to the salamander-shaped boundaries of Chicago's wards. Some have compared the shape of the city's 2nd ward to the silhouette of a lobster—can you find it?
%matplotlib inline
import matplotlib.pyplot as plt
import geopandas as gpd
zips = gpd.read_file("data/Boundaries - ZIP Codes.geojson")
zips.plot(figsize=(10, 10), edgecolor='black')
wards = gpd.read_file("data/Boundaries - Wards (2015-).geojson")
wards.plot(figsize=(12, 12), edgecolor='black')
A Note on ZIP Codes¶
U.S. ZIP Codes are not polygonal regions, but are linear features that correspond to mail delivery routes. In some cases, a ZIP Code corresponds to a group of addresses at a single location, such as a high-rise apartment building. Any map showing "ZIP Code boundaries" is really showing an approximation of where the ZIP Codes are located. Depending on the method used to draw the boundaries, some ZIP Codes might overlap, and some might not be contiguous, among other possible pathologies. When we're finding the "area" of a ZIP Code, we're really finding the area of its boundary approximation. You can read more about ZIP Codes here.
Calculate the area of each feature¶
Calling the area
attribute on a GeoDataFrame
will return a Series
containing the area of each geometry in the GeoDataFrame
. It is important to keep in mind that the geometries in GeoDataFrames
are a collection of coordinates in arbitrary space, so the area
attribute will return areas in a unit related to the coordinate system being used. In order to obtain area values that are meaningful to us, we need to make sure to set the proper coordinate reference system (CRS). We can see the current CRS of both datasets by calling the crs
attribute.
print wards.crs
print zips.crs
This CRS is the World Geodetic System (WGS84) latitude-longitude projection. Here it is being referred to by its equivalent EPSG code. Let's inspect the geometry columns directly.
print wards.geometry.head()
print zips.geometry.head()
Yep, those coordinates look like latitudes and longitudes.
I want to calculate the area of each boundary in square-miles for intelligibility (some of you may prefer square-km). To do so, I need to convert the coordinates of each GeoDataFrame
to a system that uses feet. For geospatial data in and around Chicago, I usually use the EPSG 3435 (Illinois StatePlane East Feet) coordinate system. These coordinates are in U.S. survey feet. Calling the area
attribute will return a Series
containing the area of each feature in square-feet, which I can then convert to square-miles.
def calculate_sqmi(gdf):
return gdf.to_crs({'init': 'epsg:3435'}).area / (5280 ** 2)
zips["zip_area_sqmi"] = calculate_sqmi(zips)
wards["ward_area_sqmi"] = calculate_sqmi(wards)
A quick inspection reveals that the calculated square-milages are of an order of magnitude that make sense.
print zips.zip_area_sqmi.head()
print wards.ward_area_sqmi.head()
Find the intersection of the two datasets¶
To find the intersection of the geometries, I use the overlay
function with how="intersection"
. The overlay
function allows us to perform set-operations on our dataset, creating new features based on where the geometries do or don't overlap. Read more about the overlay function here. Once done, I need to make sure the resulting intersection uses the same coordinates as our zip and ward boundaries.
intersection = gpd.overlay(zips, wards, how="intersection", use_sindex=True)
intersection.crs = zips.crs
Let's examine the result of finding the intersection of the two datasets, focusing attention on how the wards intersect with ZIP Code 60615, a ZIP Code which lies partially in the Hyde Park neighborhood.
intersection[intersection.zip == "60615"]
Dissolve fragments into single boundary¶
Notice that the 5th and 20th wards have multiple sites of intersection with 60615. Ideally, I want a single feature for each ward-ZIP Code intersection to simplify the calculations later. I can merge the fragments together by using the dissolve
function and passing sum
as the method of aggregation. Think of dissolve
as an analogue to the pandas groupby
function that allows the aggregation of geometries.
Note: zip_area_sqmi
and ward_area_sqmi
give the full area of each ZIP Code and ward, not the area of intersection.
intersection2 = intersection[["zip", "ward", "zip_area_sqmi", "ward_area_sqmi", "geometry"]]
intersection2 = intersection2.dissolve(by=["zip", "ward", "zip_area_sqmi", "ward_area_sqmi"], aggfunc='sum').reset_index()
intersection2[intersection2.zip == "60615"]
Calculate area and proportion of intersections¶
Now the GeoDataFrame
has one feature for each ward-ZIP Code intersection. I calculate the area of each intersection and the proportion of the intersection with respect to the full ZIP Code (pct_int_wrt_zip
) and the full ward (pct_int_wrt_ward
).
Note: to answer my original question, I only need to calculate pct_int_wrt_zip
.
intersection2["intersection_area_sqmi"] = calculate_sqmi(intersection2)
intersection2["pct_int_wrt_zip"] = intersection2.intersection_area_sqmi / intersection2.zip_area_sqmi
intersection2["pct_int_wrt_ward"] = intersection2.intersection_area_sqmi / intersection2.ward_area_sqmi
Inspect the intersections¶
I'll create a couple quick plots to show how 60615 intersects with Chicago's ward and to show how one particular ward (Ward 20) intersects with 60615.
ax = intersection2[intersection2.zip == "60615"].plot(figsize=(10,10), cmap='Paired', edgecolor='black')
wards[wards.ward.isin(intersection2[intersection2.zip == "60615"].ward)].plot(ax=ax, facecolor='none', edgecolor='black')
ax.set_title('How ZIP Code 60615 intersects with Chicago Wards')
ax = intersection2[intersection2.ward == "20"].plot(figsize=(10,10), cmap='Paired', edgecolor='black')
zips[zips.zip == "60615"].plot(ax=ax, facecolor='none', edgecolor='black')
ax.set_title('How Chicago Ward 20 intersects with ZIP Code 60615')
The results and a lesson about data collection¶
intersection2[intersection2.zip == "60615"]
By this method of estimation, given that someone lives in ZIP Code 60615, there is a 25.2% chance that person also lives in the 3rd ward, a 41.4% chance they live in the 4th ward, a 28.3% chance they live in the 5th ward, and a 5.2% they live in the 20th ward.
How reasonable is the estimate? As discussed at the start of the post, ZIP Codes aren't actually polygonal regions. So the area of a ZIP Code boundary is the area of the approximate region that ZIP Code covers. Second, regardless of the types of boundaries used (e.g. say instead we use congressional districts and city neighborhoods), the estimation doesn't account for how people are distributed across the boundaries. The estimation doesn't account for green space or industrial spaces where housed people do not live, nor does it account for clusters of high-density housing. For example, almost all of the intersection of 60615 with the 20th Ward contains Washington Park. The actual probability that someone lives in the 20th ward given that they live in 60615 is essentially 0, not 5.2%.
We could check the accuracy of the estimate for a single ZIP Code by randomly sampling addresses in that ZIP Code and recording which ward they belong to. Such a check is not worth the trouble for this exercise. I think the estimate is useful for getting a rough sense of how the the population is divided up, but I don't think it has much real world utility.
This exercise does, however, highlight something important: collecting better data to start (e.g. a full address) helps to avoid making unreliable estimates later. Collecting more detailed data isn't always possible—with collecting addresses there are privacy and security concerns, or we want to avoid being invasive. And sometimes (like in this case) we don't always know what data will be useful later on. Hopefully though this post makes you think about future data collection projects more carefully.