The New York City Parks Department maintains an interactive Street Tree Map that details every tree growing under NYC Parks jurisdiction as identified by a team of volunteers in 2015. The map is both impressive and thorough and even allows users to create an account where they can favorite trees and record their stewardship activities. Unfortunately, the city of Chicago does not maintain a similar map or publicly available dataset. On a smaller scale, the University of Chicago in Hyde Park published an online database from a tree inventory conducted on their campus in Autumn 2015. The tree inventory is published as a searchable and filterable map. UChicago's map is not as nice as NYC's Street Tree Map: among other reasons, it's slow, cumbersome to navigate, and not convenient for conducting data analysis. With a little work, however, the data can be scraped for our own perusal.
In this blog post, I'm going to walk through how I scraped and cleaned the data, as well as briefly explore the dataset and brainstorm future avenues for data visualization. My intention is to eventually create a couple interactive visualizations with D3 using the data that I'll share in a future post.
Gathering the data¶
The UChicago tree data needs to be gathered in two stages. First, there is the data which populates the map with markers for each tree. With a little poking around in our web browser's console, we can find the source of the data populating map. The source is an XML file which includes the lat/lon coordinates and unique identifying information about each tree, such as a treeid
and a featureid
.
Next, if we click a tree marker on the map, we can follow a link to a page with more information about that particular tree, including its species, age class, diameter, canopy radius, and its monetary value. We can scrape this data by making an HTTP request to each one of these pages. The URL of each tree's informational page is built using the various ids found in the XML file as parameters.
The plan is to initialize a pandas DataFrame
with the XML data, where each row corresponds to a different tree, and then to iterate over the rows of the DataFrame
. For each row, we can make an HTTP request to the informational page for the corresponding tree and populate new columns in the DataFrame
with the data scraped from those pages.
In Chrome, a link to the XML file can be obtained by opening DevTools and searching under the network panel.
import xml.etree.ElementTree as ET
import requests
xml_url = "https://arborscope.com/includes/generateMarkers.cfm?commonName=x&genus=x&species=x&treeid=&legendid=2&id=09C4C2&showMarkerIDs=on&jumpToTree=&inventoryID=09C4C2&noShow=on"
r = requests.get(xml_url)
root = ET.fromstring(r.content)
for child in root[:5]:
print(child.tag, child.attrib)
Initializing a DataFrame¶
First, initilize a DataFrame
with the data contained in the XML file. Avoid building the DataFrame
by starting with an empty DataFrame
and concatenating one XML feature at a time. That is a slow process, as each call to pd.concat
returns a new DataFrame
with a copy of the data from the DataFrame
in the previous step. Instead, we'll parse the XML data into a list of lists and create the DataFrame
in one go.
import pandas as pd
trees = root.findall("tree")
xml_data = [[tree.get('treeid'), tree.get('featureid'), tree.get('icon'), tree.get('lat'), tree.get('lng')] for tree in trees]
df = pd.DataFrame(xml_data, columns=["treeid", "featureid", "icon", "lat", "lon"])
df.head()
Scraping the data¶
We'll start by scraping the data for the tree in the 0th row of the DataFrame
, and then attempt to replicate this process for all rows. If we navigate to the informational page of any tree on the map, we see the URL is formatted as follows:
https://arborscope.com/featureDetails.cfm?&tid={}&id={}&featureID={}&icon={}
Note that the id
parameter corresponds to the id of the map (09C4C2
) and that the tid
parameter corresponds to our DataFrame
's treeid
column.
Thus, the URL for the tree with treeid
1425 in the 0th row is https://arborscope.com/featureDetails.cfm?&tid=1425&id=09C4C2&featureID=258354&icon=478ba3. Navigate to that page and open up DevTools. Notice that the information about the tree (its species, value, etc.) is stored in a table element with class property-table
. Fortunately*, each piece of information is stored in its own table data (td) element. This makes scraping the data straightforward. Request the webpage and extract the appropriate table element using BeautifulSoup.
* When I initially scraped the tree data earlier this year, the information was in a much more messy form. The tree information was in a paragraph element littered with an assortment of whitespace characters arranged in no discernable pattern, rather than being structured in a table element. Scraping the data required both Python and regex gymnastics to strategically remove the whitespace, so as to preserve the connection between each data field (e.g. "Scientific name") and its data (e.g. "Ginkgo biloba").
import requests
from bs4 import BeautifulSoup
feature_url = "https://arborscope.com/featureDetails.cfm"
MAP_ID = "09C4C2"
def build_payload(treeid, featureid, icon):
payload = {'tid': treeid, 'id': MAP_ID, 'featureID': featureid, 'icon': icon}
return payload
payload = build_payload(*df.loc[0, ["treeid", "featureid", "icon"]])
r = requests.get(feature_url, params=payload)
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("table", "property-table")
print table
More often than not, data exists in a messy form in the "real world". Much of the work of a data scientist involves capturing, cleaning, and processing such data. While an analytic and statistical skill set is important for a data scientist, it is not enough. Gaining fluency with a programming language is crucial for processing data in an efficient manner. Often times cleaning data will call on multiple knowledge sets like Python, HTML, and Regex.
Luckily we can clean up this data with a small amount of effort. Basically, we want to separate the data field names and the corresponding data into two separate lists and to strip the extraneous white spaces and colons ":" characters from the text. Having two lists (technically tuples in this case), one for the field names and one for data, makes it easy to insert the data into our DataFrame
.
table_rows = table.find_all("tr")
field, data = zip(*[[td.text.strip(' \n:') for td in row.find_all("td")] for row in table_rows])
field, data
We can apply the above procedure to scrape the data from each tree's information page. Combine all the above pieces together into a single for-loop that iterates over the DataFrame
and writes the scraped data to the DataFrame
.
Note: there are over 3,700 trees in the database! This task could take several hours to complete.
def scrape_data(df):
for i in df.index:
payload = build_payload(*df.loc[i, ["treeid", "featureid", "icon"]])
r = requests.get(feature_url, params=payload)
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("table", "property-table")
table_rows = table.find_all("tr")
field, data = zip(*[[td.text.strip(' \n:') for td in row.find_all("td")] for row in table_rows])
field, data
for field in fields:
if field not in df.columns:
df[field] = ""
df.loc[i, list(fields)] = list(data)
Save the data once finished so you don't lose it. Because I scraped this data several months ago, some of the data on the website now might be slightly different. Namely, the field "Location Information" with value "View" was not present before. Fortunately this information is not important. You can find the data I originally scraped here.
# scrape_data(df)
# df.to_csv("uchicago_trees.csv", index=False)
df = pd.read_csv("../data/uchicago_trees.csv")
df.fillna("", inplace=True)
Cleaning the data¶
Once all of the data is scraped, there are a couple things that need to be cleaned up to make analysis easier.
First merge redundant columns and tidy up column names.
df["scientific_name"] = df["Scientific name"] + df[" Scientific name"]
df.drop(["Scientific name", " Scientific name"], axis=1, inplace=True)
df.rename(columns={"Common name": "common_name", "Height class": "height_class", "Diameter at breast height": "diameters", "Age class": "age_class", "Canopy radius": "canopy_radius", "Tree Asset Value": "value", "Additional taxonomy": "additional_taxonomy", "Tree information": "tree_info"}, inplace=True)
The genus is first part of a plant's scientific name. Let's create a column just for the trees' genera.
df["genus"] = [sciname.split(" ")[0] for sciname in df.scientific_name]
df.head()
Let's convert canopy_radius
and value
to numerical data types. We'll first need to remove the units from the strings.
df["canopy_radius"] = df.canopy_radius.str.replace(" ft.", "")
df["canopy_radius"] = df.canopy_radius.astype(int)
df["value"] = df.value.replace({"\$": "", ",": ""}, regex=True)
df["value"] = df.value.astype(float)
The diameters
column provides the diameter of each tree measured at breast height. If the tree is multi-stemmed, the diameter of each stem is listed, separated by a comma. By convention, the size of multi-stemmed trees is a composite measurement of each of the stems. According the the City of Portland Oregon's Parks and Recreation Department: "For multi-stemmed trees, the size is determined by measuring all the trunks, and then adding the total diameter of the largest trunk to one-half the diameter of each additional trunk." We'll add a column called dbh
to store this composite measurement.
# https://www.portlandoregon.gov/trees/article/424017
def calculate_dbh(diameter_str):
diameters = [float(diameter) for diameter in diameter_str.split(", ")]
diameters.sort(reverse=True)
return diameters[0] + (sum(diameters[1:]) / 2)
df["dbh"] = [calculate_dbh(diameter_str) for diameter_str in df.diameters.str.replace(" in.", "")]
Exploring the data¶
Now the data is in good enough form to start answering questions about it and to begin discovering patterns. Some initial questions I have are:
- What tree genera and species are most common on UChicago's campus?
- What is the relationship between the value of a tree and its diameter?
- Are trees of the same species planted near one another?
As we explore the data, we should think about ways to highlight patterns and communicate insights through visualizations.
Let's try to gain insight into the first question.
trees_count = df.groupby('genus').count().sort_values('treeid', ascending=False).treeid
trees_count
top_5_trees_pct = trees_count[:5] / df.shape[0]
print top_5_trees_pct
print
print "The top 5 most common tree genera make up {:.2f}% of tress on campus.".format(top_5_trees_pct.sum() * 100)
The most common tree types are: Maple (Acer), Honey Locust (Gleditsia), Oak (Quercus), Hawthorn (Crataegus), and Elm (Ulmus)
maples_count = df[df.genus == "Acer"].groupby('scientific_name').count().sort_values('treeid', ascending=False).treeid
print maples_count
print
print maples_count / df[df.genus == "Acer"].shape[0]
Of maples trees, over one third are Norway maples (Acer platanoides).
I was honestly surprised to learn that honey locust are so common. They are thorny and seem dangerous to have in an urban environment. But now that I think about it, there are a bunch of honey locust on my block. A quick Google search informs me that there is a thornless variety of honey locust that is popular in urban spaces. It is so popular that some cities "discourage planting it to prevent monoculture".
A little more digging led me to a tree diversity guidelines document published in 2007 by the Chicago Bureau of Forestry. According to the document, 15% of the city's street trees are honey locust, which is roughly the same proportion of honey locusts as UChicago. The document also discourages planting honey locusts, except in areas with tough conditions, such as parking islands.
Now, let's see if there is any noticeable relationship between the value of a tree and its diameter at breast height.
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(8, 8))
plt.scatter(df.dbh, df.value, s=2)
plt.title("DBH (inches) vs Value ($usd)")
This was slightly unexpected! Based on the smoothness of each of the branches in the plot, the value of a tree must be related to its DBH according to a formula. It seems that each branch in the scatterplot corresponds to a different set of genera or species:
f, axs = plt.subplots(1, 4, figsize=(16, 4))
axs[0].scatter(df.dbh, df.value, c=["red" if g == "Acer" else "blue" for g in df.genus], s=2)
axs[1].scatter(df.dbh, df.value, c=["red" if g == "Gleditsia" else "blue" for g in df.genus], s=2)
axs[2].scatter(df.dbh, df.value, c=["red" if g == "Quercus" else "blue" for g in df.genus], s=2)
axs[3].scatter(df.dbh, df.value, c=["red" if g == "Crataegus" else "blue" for g in df.genus], s=2)
f.suptitle("DBH (inches) vs Value ($usd)")
for ax, title in zip(axs, ["Acer", "Gleditsia", "Quercus", "Crataegus"]):
ax.set_title(title)
"Total value of trees on UChicago's campus: ${:,.2f}".format(df.value.sum())
A fun project might be to try to derive or create a model of the tree value formulas from the scatterplot.
Thinking about the next question, it seems hard to quantify if trees of the same type are planted "together" or not. Some trees of the same species might be planted in a row, like on a street, or some might be in a tight cluster in a courtyard. This is a question that is probably best answered visually. To see where different types of trees are planted on campus, we can plot their location using their latitude and longitude. We can make a rough plot using the geopandas library. Geopandas is a library that combines pandas and shapely (a geometric manipulation library), enabling you to perform geospatial analyses without GIS software. I won't discuss how to use geopandas or shapely here. If you're unfamiliar with them, I will write about them more in a future post.
from shapely.geometry import Point
df["lat"] = df.lat.astype(float)
df["lon"] = df.lon.astype(float)
geometry = [Point(xy) for xy in zip(df.lon, df.lat)]
import geopandas as gpd
gdf = gpd.GeoDataFrame(df, geometry=geometry)
# Ginkgo trees
f, ax = plt.subplots(1, figsize=(15, 15))
gdf[gdf.genus != "Ginkgo"].plot(ax=ax, markersize=1)
gdf[gdf.genus == "Ginkgo"].plot(ax=ax, markersize=1)
plt.show()
Ellis Avenue has a reputation for smelling bad—or at least had a reputation when I was in school. This was partially due to the smell of sewage from construction, but also from the many Ginkgo trees lining the street. The seeds of the female Ginkgo contain butyric acid, which give off an odor like rotten cheese or butter. The block of Ellis between 57th and 58th Streets stands out on the map: both sides of the street are flanked with Gingko. In fact, it appears that in several locations on campus, Ginkgo trees are planted in rows along streets or walkways.
# Oak trees
f, ax = plt.subplots(1, figsize=(15, 15))
gdf[gdf.genus != "Quercus"].plot(ax=ax, markersize=1)
gdf[gdf.genus == "Quercus"].plot(ax=ax, markersize=1)
plt.show()
To my surprise, oak trees are not evenly dispersed across campus. The main quad (centered at -87.6, 41.79) has a high density of oak trees, and there are several walkways and streets with tightly packed groups of oaks.
Future Visualizations¶
A couple ideas I have for visualizations to make in the future are:
- An interactive map with different colored or shaped markers indicating the location of common tree types. The goal would be to demonstrate how different tree types are dispered across campus.
- A pictograph with icons for different species' leaf shapes showing the breakdown of trees on campus.
- If possible, a graph showing the distribution of tree ages or planting years. I imagine there are formulas for determining tree age using its DBH and species.