Exploring urban tree data

Posted on: Mon 07 January 2019

Tags: #python, #pandas, #geopandas, #data science, #trees

The New York City Parks Department maintains an interactive Street Tree Map that details every tree growing under NYC Parks jurisdiction as identified by a team of volunteers in 2015. The map is both impressive and thorough and even allows users to create an account where they can favorite trees and record their stewardship activities. Unfortunately, the city of Chicago does not maintain a similar map or publicly available dataset. On a smaller scale, the University of Chicago in Hyde Park published an online database from a tree inventory conducted on their campus in Autumn 2015. The tree inventory is published as a searchable and filterable map. UChicago's map is not as nice as NYC's Street Tree Map: among other reasons, it's slow, cumbersome to navigate, and not convenient for conducting data analysis. With a little work, however, the data can be scraped for our own perusal.

In this blog post, I'm going to walk through how I scraped and cleaned the data, as well as briefly explore the dataset and brainstorm future avenues for data visualization. My intention is to eventually create a couple interactive visualizations with D3 using the data that I'll share in a future post.

Gathering the data¶

The UChicago tree data needs to be gathered in two stages. First, there is the data which populates the map with markers for each tree. With a little poking around in our web browser's console, we can find the source of the data populating map. The source is an XML file which includes the lat/lon coordinates and unique identifying information about each tree, such as a treeid and a featureid.

Next, if we click a tree marker on the map, we can follow a link to a page with more information about that particular tree, including its species, age class, diameter, canopy radius, and its monetary value. We can scrape this data by making an HTTP request to each one of these pages. The URL of each tree's informational page is built using the various ids found in the XML file as parameters.

The plan is to initialize a pandas DataFrame with the XML data, where each row corresponds to a different tree, and then to iterate over the rows of the DataFrame. For each row, we can make an HTTP request to the informational page for the corresponding tree and populate new columns in the DataFrame with the data scraped from those pages.

In Chrome, a link to the XML file can be obtained by opening DevTools and searching under the network panel.

In [2]:

import xml.etree.ElementTree as ET
import requests

xml_url = "https://arborscope.com/includes/generateMarkers.cfm?commonName=x&genus=x&species=x&treeid=&legendid=2&id=09C4C2&showMarkerIDs=on&jumpToTree=&inventoryID=09C4C2&noShow=on"
r = requests.get(xml_url)
root = ET.fromstring(r.content)

for child in root[:5]:
    print(child.tag, child.attrib)

('tree', {'featureid': '258354', 'treeid': '1425', 'lng': '-87.596163752', 'lat': '41.789576619', 'icon': '478ba3'})
('tree', {'featureid': '258355', 'treeid': '1424', 'lng': '-87.596083026', 'lat': '41.789572215', 'icon': '478ba3'})
('tree', {'featureid': '258356', 'treeid': '1438', 'lng': '-87.595890027', 'lat': '41.788697862', 'icon': '478ba3'})
('tree', {'featureid': '258357', 'treeid': '1437', 'lng': '-87.595863342', 'lat': '41.788692474', 'icon': '478ba3'})
('tree', {'featureid': '258358', 'treeid': '1436', 'lng': '-87.595840454', 'lat': '41.788696289', 'icon': '478ba3'})

Initializing a DataFrame¶

First, initilize a DataFrame with the data contained in the XML file. Avoid building the DataFrame by starting with an empty DataFrame and concatenating one XML feature at a time. That is a slow process, as each call to pd.concat returns a new DataFrame with a copy of the data from the DataFrame in the previous step. Instead, we'll parse the XML data into a list of lists and create the DataFrame in one go.

In [3]:

import pandas as pd

trees = root.findall("tree")
xml_data = [[tree.get('treeid'), tree.get('featureid'), tree.get('icon'), tree.get('lat'), tree.get('lng')] for tree in trees]

df = pd.DataFrame(xml_data, columns=["treeid", "featureid", "icon", "lat", "lon"])
df.head()

Out[3]:

	treeid	featureid	icon	lat	lon
0	1425	258354	478ba3	41.789576619	-87.596163752
1	1424	258355	478ba3	41.789572215	-87.596083026
2	1438	258356	478ba3	41.788697862	-87.595890027
3	1437	258357	478ba3	41.788692474	-87.595863342
4	1436	258358	478ba3	41.788696289	-87.595840454

Scraping the data¶

We'll start by scraping the data for the tree in the 0th row of the DataFrame, and then attempt to replicate this process for all rows. If we navigate to the informational page of any tree on the map, we see the URL is formatted as follows:

https://arborscope.com/featureDetails.cfm?&tid={}&id={}&featureID={}&icon={}

Note that the id parameter corresponds to the id of the map (09C4C2) and that the tid parameter corresponds to our DataFrame's treeid column.

Thus, the URL for the tree with treeid 1425 in the 0th row is https://arborscope.com/featureDetails.cfm?&tid=1425&id=09C4C2&featureID=258354&icon=478ba3. Navigate to that page and open up DevTools. Notice that the information about the tree (its species, value, etc.) is stored in a table element with class property-table. Fortunately*, each piece of information is stored in its own table data (td) element. This makes scraping the data straightforward. Request the webpage and extract the appropriate table element using BeautifulSoup.

* When I initially scraped the tree data earlier this year, the information was in a much more messy form. The tree information was in a paragraph element littered with an assortment of whitespace characters arranged in no discernable pattern, rather than being structured in a table element. Scraping the data required both Python and regex gymnastics to strategically remove the whitespace, so as to preserve the connection between each data field (e.g. "Scientific name") and its data (e.g. "Ginkgo biloba").

In [4]:

import requests
from bs4 import BeautifulSoup

feature_url = "https://arborscope.com/featureDetails.cfm"
MAP_ID = "09C4C2"

def build_payload(treeid, featureid, icon):
    payload = {'tid': treeid, 'id': MAP_ID, 'featureID': featureid, 'icon': icon}
    return payload

payload = build_payload(*df.loc[0, ["treeid", "featureid", "icon"]])
r = requests.get(feature_url, params=payload)
soup = BeautifulSoup(r.content, "lxml")

table = soup.find("table", "property-table")
print table

<table class="property-table">
<tr>
<td>Common name:</td>
<td>Ginkgo</td>
</tr>
<tr>
<td>Scientific name:</td>
<td>
<i><a href="http://dendro.cnre.vt.edu/dendrology/syllabus/factsheet.cfm?ID=122" target="_blank">Ginkgo biloba</a></i>
</td>
</tr>
<tr>
<td>Height class:</td>
<td>Medium</td>
</tr>
<tr>
<td>Diameter at breast height:</td>
<td>
                                7.2 in.                        
                            </td>
</tr>
<tr>
<td>Age class:</td>
<td>Semi-mature</td>
</tr>
<tr>
<td>Canopy radius:</td>
<td>10 ft.</td>
</tr>
<tr>
<td>Tree Asset Value:</td>
<td>
                        $1,144.57
                        
                    </td>
</tr>
<tr>
<td>Location Information:</td>
<td><a aria-controls="locationInfo" aria-expanded="false" data-toggle="collapse" href="#locationInfo" role="button">View</a></td>
</tr>
</table>

More often than not, data exists in a messy form in the "real world". Much of the work of a data scientist involves capturing, cleaning, and processing such data. While an analytic and statistical skill set is important for a data scientist, it is not enough. Gaining fluency with a programming language is crucial for processing data in an efficient manner. Often times cleaning data will call on multiple knowledge sets like Python, HTML, and Regex.

Luckily we can clean up this data with a small amount of effort. Basically, we want to separate the data field names and the corresponding data into two separate lists and to strip the extraneous white spaces and colons ":" characters from the text. Having two lists (technically tuples in this case), one for the field names and one for data, makes it easy to insert the data into our DataFrame.

In [5]:

table_rows = table.find_all("tr")
field, data = zip(*[[td.text.strip(' \n:') for td in row.find_all("td")] for row in table_rows])
field, data

Out[5]:

((u'Common name',
  u'Scientific name',
  u'Height class',
  u'Diameter at breast height',
  u'Age class',
  u'Canopy radius',
  u'Tree Asset Value',
  u'Location Information'),
 (u'Ginkgo',
  u'Ginkgo biloba',
  u'Medium',
  u'7.2 in.',
  u'Semi-mature',
  u'10 ft.',
  u'$1,144.57',
  u'View'))

We can apply the above procedure to scrape the data from each tree's information page. Combine all the above pieces together into a single for-loop that iterates over the DataFrame and writes the scraped data to the DataFrame.

Note: there are over 3,700 trees in the database! This task could take several hours to complete.

In [6]:

def scrape_data(df):
    for i in df.index:
        payload = build_payload(*df.loc[i, ["treeid", "featureid", "icon"]])
        r = requests.get(feature_url, params=payload)
        soup = BeautifulSoup(r.content, "lxml")

        table = soup.find("table", "property-table")
        table_rows = table.find_all("tr")
        field, data = zip(*[[td.text.strip(' \n:') for td in row.find_all("td")] for row in table_rows])
        field, data

        for field in fields:
            if field not in df.columns:
                df[field] = ""

        df.loc[i, list(fields)] = list(data)

Save the data once finished so you don't lose it. Because I scraped this data several months ago, some of the data on the website now might be slightly different. Namely, the field "Location Information" with value "View" was not present before. Fortunately this information is not important. You can find the data I originally scraped here.

In [7]:

# scrape_data(df)
# df.to_csv("uchicago_trees.csv", index=False)

In [8]:

df = pd.read_csv("../data/uchicago_trees.csv")
df.fillna("", inplace=True)

Cleaning the data¶

Once all of the data is scraped, there are a couple things that need to be cleaned up to make analysis easier.

First merge redundant columns and tidy up column names.

In [9]:

df["scientific_name"] = df["Scientific name"] + df[" Scientific name"]
df.drop(["Scientific name", " Scientific name"], axis=1, inplace=True)
df.rename(columns={"Common name": "common_name", "Height class": "height_class", "Diameter at breast height": "diameters", "Age class": "age_class", "Canopy radius": "canopy_radius", "Tree Asset Value": "value", "Additional taxonomy": "additional_taxonomy", "Tree information": "tree_info"}, inplace=True)

The genus is first part of a plant's scientific name. Let's create a column just for the trees' genera.

In [10]:

df["genus"] = [sciname.split(" ")[0] for sciname in df.scientific_name]

In [11]:

df.head()

Out[11]:

	treeid	featureid	icon	lat	lon	common_name	height_class	diameters	age_class	canopy_radius	value	scientific_name	genus
0	1425	258354	478ba3	41.789577	-87.596164	Ginkgo	Medium	7.2 in.	Semi-mature	10 ft.	$1,144.57	Ginkgo biloba	Ginkgo
1	1424	258355	478ba3	41.789572	-87.596083	Ginkgo	Medium	6.2 in.	Semi-mature	8 ft.	$848.71	Ginkgo biloba	Ginkgo
2	1438	258356	478ba3	41.788698	-87.595890	Serviceberry	Small	4 in.	Semi-mature	6 ft.	$353.26	Amelanchier sp	Amelanchier
3	1437	258357	478ba3	41.788692	-87.595863	Serviceberry	Small	4.2 in.	Semi-mature	6 ft.	$389.47	Amelanchier sp	Amelanchier
4	1436	258358	478ba3	41.788696	-87.595840	Serviceberry	Small	4.8 in.	Semi-mature	6 ft.	$508.70	Amelanchier sp	Amelanchier

Let's convert canopy_radius and value to numerical data types. We'll first need to remove the units from the strings.

In [12]:

df["canopy_radius"] = df.canopy_radius.str.replace(" ft.", "")
df["canopy_radius"] = df.canopy_radius.astype(int)

df["value"] = df.value.replace({"\$": "", ",": ""}, regex=True)
df["value"] = df.value.astype(float)

The diameters column provides the diameter of each tree measured at breast height. If the tree is multi-stemmed, the diameter of each stem is listed, separated by a comma. By convention, the size of multi-stemmed trees is a composite measurement of each of the stems. According the the City of Portland Oregon's Parks and Recreation Department: "For multi-stemmed trees, the size is determined by measuring all the trunks, and then adding the total diameter of the largest trunk to one-half the diameter of each additional trunk." We'll add a column called dbh to store this composite measurement.

In [13]:

# https://www.portlandoregon.gov/trees/article/424017
def calculate_dbh(diameter_str):
    diameters = [float(diameter) for diameter in diameter_str.split(", ")]
    diameters.sort(reverse=True)
    return diameters[0] + (sum(diameters[1:]) / 2)

df["dbh"] = [calculate_dbh(diameter_str) for diameter_str in df.diameters.str.replace(" in.", "")]

Exploring the data¶

Now the data is in good enough form to start answering questions about it and to begin discovering patterns. Some initial questions I have are:

What tree genera and species are most common on UChicago's campus?
What is the relationship between the value of a tree and its diameter?
Are trees of the same species planted near one another?

As we explore the data, we should think about ways to highlight patterns and communicate insights through visualizations.

Let's try to gain insight into the first question.

In [14]:

trees_count = df.groupby('genus').count().sort_values('treeid', ascending=False).treeid
trees_count

Out[14]:

genus
Acer              528
Gleditsia         508
Quercus           399
Crataegus         217
Ulmus             209
Fraxinus          204
Tilia             194
Malus             186
Ginkgo            141
Amelanchier       105
Cercis            103
Celtis            101
Carpinus           75
Cornus             72
Euonymus           63
Corylus            49
Thuja              48
Pinus              46
Syringa            41
Gymnocladus        39
Pyrus              31
Platanus           31
Viburnum           30
Fagus              26
Catalpa            24
Cladrastis         22
Prunus             22
Magnolia           19
Nyssa              18
Aesculus           17
Halesia            16
Betula             16
Robinia            15
Juniperus          13
Hamamelis          10
Pseudotsuga         9
Taxodium            9
Morus               9
Ostrya              8
Cercidiphyllum      7
Alnus               7
Phellodendron       6
Liriodendron        6
Populus             5
Tsuga               5
Taxus               5
Picea               4
Ailanthus           4
Rhus                3
Zelkova             3
Abies               2
Salix               2
Koelreuteria        2
Liquidambar         2
Rhamnus             1
Metasequoia         1
Larix               1
Juglans             1
Ilex                1
Name: treeid, dtype: int64

In [15]:

top_5_trees_pct = trees_count[:5] / df.shape[0]
print top_5_trees_pct
print
print "The top 5 most common tree genera make up {:.2f}% of tress on campus.".format(top_5_trees_pct.sum() * 100)

genus
Acer         0.141139
Gleditsia    0.135793
Quercus      0.106656
Crataegus    0.058006
Ulmus        0.055867
Name: treeid, dtype: float64

The top 5 most common tree genera make up 49.75% of tress on campus.

The most common tree types are: Maple (Acer), Honey Locust (Gleditsia), Oak (Quercus), Hawthorn (Crataegus), and Elm (Ulmus)

In [16]:

maples_count = df[df.genus == "Acer"].groupby('scientific_name').count().sort_values('treeid', ascending=False).treeid
print maples_count
print
print maples_count / df[df.genus == "Acer"].shape[0]

scientific_name
Acer platanoides    181
Acer x freemanii    106
Acer saccharinum     67
Acer saccharum       65
Acer rubrum          51
Acer campestre       30
Acer griseum         12
Acer ginnala          6
Acer palmatum         5
Acer miyabei          4
Acer negundo          1
Name: treeid, dtype: int64

scientific_name
Acer platanoides    0.342803
Acer x freemanii    0.200758
Acer saccharinum    0.126894
Acer saccharum      0.123106
Acer rubrum         0.096591
Acer campestre      0.056818
Acer griseum        0.022727
Acer ginnala        0.011364
Acer palmatum       0.009470
Acer miyabei        0.007576
Acer negundo        0.001894
Name: treeid, dtype: float64

Of maples trees, over one third are Norway maples (Acer platanoides).

I was honestly surprised to learn that honey locust are so common. They are thorny and seem dangerous to have in an urban environment. But now that I think about it, there are a bunch of honey locust on my block. A quick Google search informs me that there is a thornless variety of honey locust that is popular in urban spaces. It is so popular that some cities "discourage planting it to prevent monoculture".

A little more digging led me to a tree diversity guidelines document published in 2007 by the Chicago Bureau of Forestry. According to the document, 15% of the city's street trees are honey locust, which is roughly the same proportion of honey locusts as UChicago. The document also discourages planting honey locusts, except in areas with tough conditions, such as parking islands.

Now, let's see if there is any noticeable relationship between the value of a tree and its diameter at breast height.

In [17]:

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(8, 8))
plt.scatter(df.dbh, df.value, s=2)
plt.title("DBH (inches) vs Value ($usd)")

Out[17]:

Text(0.5,1,'DBH (inches) vs Value ($usd)')

This was slightly unexpected! Based on the smoothness of each of the branches in the plot, the value of a tree must be related to its DBH according to a formula. It seems that each branch in the scatterplot corresponds to a different set of genera or species:

In [18]:

f, axs = plt.subplots(1, 4, figsize=(16, 4))
axs[0].scatter(df.dbh, df.value, c=["red" if g == "Acer"  else "blue" for g in df.genus], s=2)
axs[1].scatter(df.dbh, df.value, c=["red" if g == "Gleditsia"  else "blue" for g in df.genus], s=2)
axs[2].scatter(df.dbh, df.value, c=["red" if g == "Quercus"  else "blue" for g in df.genus], s=2)
axs[3].scatter(df.dbh, df.value, c=["red" if g == "Crataegus"  else "blue" for g in df.genus], s=2)

f.suptitle("DBH (inches) vs Value ($usd)")
for ax, title in zip(axs, ["Acer", "Gleditsia", "Quercus", "Crataegus"]):
    ax.set_title(title)

In [19]:

"Total value of trees on UChicago's campus: ${:,.2f}".format(df.value.sum())

Out[19]:

"Total value of trees on UChicago's campus: $8,900,048.71"

A fun project might be to try to derive or create a model of the tree value formulas from the scatterplot.

Thinking about the next question, it seems hard to quantify if trees of the same type are planted "together" or not. Some trees of the same species might be planted in a row, like on a street, or some might be in a tight cluster in a courtyard. This is a question that is probably best answered visually. To see where different types of trees are planted on campus, we can plot their location using their latitude and longitude. We can make a rough plot using the geopandas library. Geopandas is a library that combines pandas and shapely (a geometric manipulation library), enabling you to perform geospatial analyses without GIS software. I won't discuss how to use geopandas or shapely here. If you're unfamiliar with them, I will write about them more in a future post.

In [20]:

from shapely.geometry import Point

df["lat"] = df.lat.astype(float)
df["lon"] = df.lon.astype(float)
geometry = [Point(xy) for xy in zip(df.lon, df.lat)]

In [21]:

import geopandas as gpd

gdf = gpd.GeoDataFrame(df, geometry=geometry)

In [22]:

# Ginkgo trees
f, ax = plt.subplots(1, figsize=(15, 15))
gdf[gdf.genus != "Ginkgo"].plot(ax=ax, markersize=1)
gdf[gdf.genus == "Ginkgo"].plot(ax=ax, markersize=1)
plt.show()

Ellis Avenue has a reputation for smelling bad—or at least had a reputation when I was in school. This was partially due to the smell of sewage from construction, but also from the many Ginkgo trees lining the street. The seeds of the female Ginkgo contain butyric acid, which give off an odor like rotten cheese or butter. The block of Ellis between 57th and 58th Streets stands out on the map: both sides of the street are flanked with Gingko. In fact, it appears that in several locations on campus, Ginkgo trees are planted in rows along streets or walkways.

In [23]:

# Oak trees
f, ax = plt.subplots(1, figsize=(15, 15))
gdf[gdf.genus != "Quercus"].plot(ax=ax, markersize=1)
gdf[gdf.genus == "Quercus"].plot(ax=ax, markersize=1)
plt.show()

To my surprise, oak trees are not evenly dispersed across campus. The main quad (centered at -87.6, 41.79) has a high density of oak trees, and there are several walkways and streets with tightly packed groups of oaks.

Future Visualizations¶

A couple ideas I have for visualizations to make in the future are:

An interactive map with different colored or shaped markers indicating the location of common tree types. The goal would be to demonstrate how different tree types are dispered across campus.
A pictograph with icons for different species' leaf shapes showing the breakdown of trees on campus.
If possible, a graph showing the distribution of tree ages or planting years. I imagine there are formulas for determining tree age using its DBH and species.