Welcome back to GGPLOT and Grammar of Graphics in Python part 2. If you missed part 1, you can find it here. Today we’re gonna play around with python’s ggplot and see what it can do. This time I managed to get some more interesting data. Looking at data.gov there’s a csv file charting the deats from 1999-2015 in all 50 U.S. states and various causes of death.

The first part of our code (involves grabbing the data)

#!/usr/bin/python3
import requests
import pandas as pd
import os

re = requests.get("https://data.cdc.gov/api/views/bi63-dtpu/rows.csv") #the csv file is returned by following that link. Copying the url into a web browser will return the same file
tfile = open('tempfile','wb')
tfile.write(re.content) # saves the content as a file
tfile.close()

data = pd.read_csv('tempfile') # pandas reads the csv file into a dataframe object using one of it's standard functions.
os.remove('tempfile') #if you want to look at the csv file directly, comment out this line.

data.columns will show you the columns we have to work with (listed here as ‘Year’, ‘113 Cause Name’, ‘Cause Name’, ‘State’, ‘Deaths’,’Age-adjusted Death Rate’). This is a lot of potential variables to graph. If we only want to look at a couple at a time, we will need to group things together. Using pandas’ groupby function.

grouped_data = data.groupby(['Cause Name','Year']) #Creates a groupby object centered around the two colums listed
total_deaths = grouped_data.sum() # Returns a dataframe where the deats (and death rate) have been added together for all rows where the cause name and year are the same (We're ignoring state for the moment).

total_deaths.reset_index(level=0,inplace=True) #The 'Year' and 'Cause Name' are merged into an index and are not treated as a regular column. This will get in the way of graphing, so these two lines turn the index back into regulaer columns
total_deaths.reset_index(level=0,inplace=True)

All right! Let’s graph the stuff!

from ggplot import *

p = ggplot(aes(x='Year',y='Deaths',color='Cause Name'), data=total_deaths)
p + geom_point()
p.show()

We have a graph! It generates a nice scatter plot with the number of deaths per year. Each color reflects a different cause. (You will notice that some colors are repeated, ggplot only has a finite number of default colors. You can address this here[1]. Now, we can easily join the points with lines by typing

p + geom_point() + geom_line()

But man, it’s automatically scaling based on the largest values, which in this case is “All causes.” There are two other distinct trails but I can’t quite tell which causes they are (i suspect one is cancer). Also notice that the legend is screwed up by adding the lines. I don’t know why. But re-creating p allows you to remove the lines.

There are a few ways we might be able to address this, one is by playing with the scale (possibly logarithmic), another is by filtering the data so “All Causes” does not appear, finally is using the facet_wrap function described in the docs[2] to split the graph by cause. We’ll start with the first.

p + geom_point + scale_y_log(base=10)

Adding that code likely generated this error for you:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/ggplot/scales/scale_log.py", line 25, in __radd__
gg = deepcopy(gg)
File "/usr/lib/python3.5/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/usr/lib/python3.5/copy.py", line 297, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.5/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/usr/lib/python3.5/copy.py", line 297, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.5/copy.py", line 166, in deepcopy
y = copier(memo)
File "/usr/local/lib/python3.5/dist-packages/matplotlib/transforms.py", line 127, in __copy__
"TransformNode instances can not be copied. "
NotImplementedError: TransformNode instances can not be copied. Consider using frozen() instead.

Going to the file where the error was raised, I saw the same message:

def __copy__(self, *args):

raise NotImplementedError(
"TransformNode instances can not be copied. "
"Consider using frozen() instead.")
__deepcopy__ = __copy__
</code>

Since it suggested a frozen() function, I went to see if that was there, and indeed it was.
<code>
def frozen(self):
"""
Returns a frozen copy of this transform node. The frozen copy
will not update when its children change. Useful for storing
a previously known state of a transform where
``copy.deepcopy()`` might normally be used.
"""
return self

So I tried re_writing the copy code so it will call frozen instead of raising an error. I copied out the raised error like this (beware to use eight spaces instead of a tab or else the python interpreter will get mad at you):

def __copy__(self, *args):
"""
raise NotImplementedError(
"TransformNode instances can not be copied. "
"Consider using frozen() instead.")
__deepcopy__ = __copy__
"""
return self.frozen(self)

Re-open the python interpreter and re-run the above code. We finally get that logarithmic scale!!!

You will notice that It actually helped. The points in the bottom have separated enough to be individually distinguishable. The only downside to a logarithmic graph is that visible changes are more easily underestimated (something dropping by 50% only looks like it fell half a point, it’s still in the same magnitude). Let’s try filtering the data.

p = ggplot(aes(x='Year',y='Deaths',color='Cause Name'), data=total_deaths[total_deaths['Cause Name'] != 'All Causes'])

Dataframes in pandas allow you to filter in the brackets as seen above and in this tutorial[3].
Using p + geom_point() without the logarithmic scaling now yields this graph:

The two other highest causes are defining the scale of the graph, and as seen before, adding scale_y_log(base=10) helps.

The final method we’ll explore is the “facet_wrap” function. For the moment we’ll continue to exclude “All Causes.”

p + geom_point() + facet_wrap('Cause Name')


Great! Now we can see the graphs independently, but oh noes, they are all for some incoherent reason still using the same scale as the original graph! Luckily we can fix that with this.

p + facet_wrap('Cause Name', scales = "free")


You know you have a good set of data when it raises interesting questions, such as “Why was 2010 such a good year for diseases such as Diabetes, Heart Disease, Pneumitis, Kidney Disease, Stroke and “Unintentional Injuries?” Why are they picking up as 2015 approached? Was it a response to the recession where people went “I’m too poor to get sick, let’s wait a few years and then have a stroke?” Somehow I doubt it.

It this point I recommend you try grouping by state and filtering to see which states are the most death-ridden for any given year. Try some of the different geometries, such as geom_bar, specified in the docs[2] or check out a tutorial[4] on the “grammar of graphics” on which ggplot is based, and have a great week!

References

  1. blog post on expanding ggplot’s color palate https://novyden.blogspot.com/2013/09/how-to-expand-color-palette-with-ggplot.html
  2. ggplot docs for python http://ggplot.yhathq.com/how-it-works.html
  3. 10 Minutes to pansas https://pandas.pydata.org/pandas-docs/stable/10min.html#csv
  4. https://ramnathv.github.io/pycon2014-r/visualize/ggplot2.html