This project aimed to visualize and find interesting trends in the historical instance and geographical distribution of Influenza. tldr; a pretty graph:

png

Analyzing the Influenza Dataset

The influenza dataset was downloaded here from the Tycho Project along with the relevant metadata in data/influenza/US/*.

import sys
import pandas

file_loc = 'data/influenza/US/US.6142004.csv'
df = pandas.read_csv(
    file_loc, 
    low_memory=False, 
    parse_dates=['PeriodStartDate', 'PeriodEndDate'], 
    date_parser=lambda x: pandas.datetime.strptime(x, '%Y-%m-%d')
)

We notice a few things about the dataset:

There is only one PathogenName in the dataset (‘unidentified influenza virus’)
There is only one AgeRange in the dataset (‘0-130’)
There are no DiagnosisCertainty values (‘nan’, (‘NA)
The CountValue column is sometimes a “Cummulative” value
- See the dataset readme under ‘Separate cumulative from non-cumulative time interval series’
- The PartOfCumulativeCountSeries is not provided in this dataset

pandas.unique(df[['PathogenName', 'AgeRange', 'DiagnosisCertainty']].values.ravel('K'))

array(['unidentified influenza virus', '0-130', nan], dtype=object)

Sanitizing the Dataset

We need to sanitize the dataset to show instance counts in non-overlapping date ranges to make aggregate analysis more simple. To do so, we first see if rows in the dataset actually do contain overlapping date intervals.

Verifying Overlaps Exist

from src.lib.parse import overlaps, combine_rows

# these are the columns used to determine 
#  if two rows should be checked for overlapping date ranges
# *in order* (least to most specific)
match_on = [
    'Admin1ISO',
    'Admin2Name',
    'CityName',
    'Fatalities'
]

sort_by = ['PeriodStartDate']
found = overlaps(df, match_on, sort_by)
dupes = pandas.DataFrame(f[1][1] for f in found)
dupes.sort_values(by=['CountValue','CityName'], ascending=False)[[
    'CityName', 
    'PeriodStartDate', 
    'PeriodEndDate', 
    'CountValue'
]]

	CityName	PeriodStartDate	PeriodEndDate	CountValue
169694	LOS ANGELES	1937-05-23	1937-05-29	15
45322	CLEVELAND	1923-11-05	1923-11-11	8
99914	NEWARK	1923-11-05	1923-11-11	7
81621	SAVANNAH	1937-05-23	1937-05-29	4
55066	CHARLESTON	1937-05-23	1937-05-29	4
38670	PHILADELPHIA	1923-11-05	1923-11-11	3
178278	PORTLAND	1937-05-23	1937-05-29	2
70307	MINNEAPOLIS	1937-05-23	1937-05-29	2
175022	DALLAS	1937-05-23	1937-05-29	2
152524	BIRMINGHAM	1937-05-23	1937-05-29	2
53820	WASHINGTON	1937-05-23	1937-05-29	1
189084	TOPEKA	1937-05-23	1937-05-29	1
179730	SAN ANTONIO	1937-05-23	1937-05-29	1
92656	NASHVILLE	1937-05-23	1937-05-29	1
217007	MIAMI	1937-05-23	1937-05-29	1
170996	LOS ANGELES	1937-05-23	1937-05-29	1
162759	LITTLE ROCK	1937-05-23	1937-05-29	1
153725	BIRMINGHAM	1937-05-23	1937-05-29	1
158250	ATLANTA	1937-05-23	1937-05-29	1
141785	WINSTON-SALEM	1937-05-23	1937-05-29	0
106457	WILMINGTON	1937-05-23	1937-05-29	0
159797	WILMINGTON	1937-05-23	1937-05-29	0
166853	WICHITA	1937-05-23	1937-05-29	0
59741	WHEELING	1937-05-23	1937-05-29	0
108348	TAMPA	1937-05-23	1937-05-29	0
151371	TACOMA	1937-05-23	1937-05-29	0
181276	ST. JOSEPH	1937-05-23	1937-05-29	0
132556	SPOKANE	1937-05-23	1937-05-29	0
109846	SHREVEPORT	1937-05-23	1937-05-29	0
131340	SEATTLE	1937-05-23	1937-05-29	0
...	...	...	...	...
99128	MOBILE	1937-05-23	1937-05-29	0
172800	MISSOULA	1937-05-23	1937-05-29	0
216533	MINOT	1937-05-23	1937-05-29	0
217690	MIAMI	1937-05-23	1937-05-29	0
113133	MEMPHIS	1937-05-23	1937-05-29	0
123539	LYNCHBURG	1937-05-23	1937-05-29	0
89258	LOUISVILLE	1937-05-23	1937-05-29	0
187396	LEXINGTON	1937-05-23	1937-05-29	0
183608	LAWRENCE	1937-05-23	1937-05-29	0
204925	LAKE CHARLES	1937-05-23	1937-05-29	0
122291	KNOXVILLE	1937-05-23	1937-05-29	0
72164	KANSAS CITY	1937-05-23	1937-05-29	0
130258	HOUSTON	1937-05-23	1937-05-29	0
144754	HELENA	1937-05-23	1937-05-29	0
168618	GREENVILLE	1937-05-23	1937-05-29	0
182846	GREAT FALLS	1937-05-23	1937-05-29	0
94199	GALVESTON	1937-05-23	1937-05-29	0
135563	FREDERICK	1937-05-23	1937-05-29	0
127075	FORT WORTH	1937-05-23	1937-05-29	0
206745	FLORENCE	1937-05-23	1937-05-29	0
193250	FARGO	1937-05-23	1937-05-29	0
75637	DENVER	1937-05-23	1937-05-29	0
114894	CUMBERLAND	1937-05-23	1937-05-29	0
184782	COVINGTON	1937-05-23	1937-05-29	0
194271	COLORADO SPRINGS	1937-05-23	1937-05-29	0
56246	CHARLESTON	1937-05-23	1937-05-29	0
200854	CHARLESTON	1937-05-23	1937-05-29	0
119862	BRUNSWICK	1937-05-23	1937-05-29	0
204261	BILLINGS	1937-05-23	1937-05-29	0
156074	ALBUQUERQUE	1937-05-23	1937-05-29	0

72 rows × 4 columns

Overlapping Data Example

Picking rows in the overlapping dateranges as an example:

ex_1 = (df.CityName == 'NEWARK') &(df.PeriodStartDate >= '1923-11-01') &(df.PeriodStartDate <= '1923-11-15')
ex_2 = (df.CityName == 'LOS ANGELES') &(df.PeriodStartDate >= '1937-05-15') &(df.PeriodStartDate <= '1937-05-31')
ex_3 = (df.CityName == 'CLEVELAND') &(df.PeriodStartDate >= '1923-11-01') &(df.PeriodStartDate <= '1923-11-30')
df.loc[ex_1 | ex_2 | ex_3].sort_values(by=[
    'CityName',
    'Fatalities', 
    'PeriodStartDate'
])[['CityName', 'PeriodStartDate', 'PeriodEndDate', 'CountValue']]

	CityName	PeriodStartDate	PeriodEndDate	CountValue
45321	CLEVELAND	1923-11-04	1923-11-10	3
45322	CLEVELAND	1923-11-05	1923-11-11	8
45323	CLEVELAND	1923-11-18	1923-11-24	6
45324	CLEVELAND	1923-11-25	1923-12-01	4
46535	CLEVELAND	1923-11-04	1923-11-10	1
46536	CLEVELAND	1923-11-11	1923-11-17	2
169693	LOS ANGELES	1937-05-17	1937-05-23	8
169694	LOS ANGELES	1937-05-23	1937-05-29	15
169695	LOS ANGELES	1937-05-30	1937-06-05	5
170995	LOS ANGELES	1937-05-17	1937-05-23	0
170996	LOS ANGELES	1937-05-23	1937-05-29	1
170997	LOS ANGELES	1937-05-30	1937-06-05	2
99913	NEWARK	1923-11-04	1923-11-10	11
99914	NEWARK	1923-11-05	1923-11-11	7

Sanitizing Overlapping Rows

As described in the dataset readme:

Cumulative case count time series consist of overlapping case count intervals starting on the same date, but ending on different dates.
- For example, each interval in a cumulative count time series can start on January 1st, but end on January 7th, 14th, 21st, etc.
- It is common practice among public health agencies to report cases for cumulative time intervals.

We find:

Between "1923-11-04","1923-11-10" there were 11 reported non-fatal counts of Influenza in Birmingham.
Between "1923-11-05","1923-11-11" there were 7 reported non-fatal counts of Influenza in Birmingham.

We can only be sure that between 1923-11-04 - 1923-11-11 there were between [11,18] (inclusive) counts of influenza recorded in Newark.

More generally:

Given possibly overlapping date ranges [d1, d2], [d3, d4], ... and instance counts c1, c2, c3, ...
We can be sure in the date range
- [min(d1,d2,d3,...), max(d1,d2,d3,...)]
there are between
- [max(c1,c2,c3...), sum(c1,c2,c3,c4...)] disease instance counts.*
We decide to pick the worst case (sum ) of the range - though you could also pick the average or minimum.

to_drop = [combine_rows(prev, cur, df) for prev, cur in overlaps(df, match_on, sort_by)]
df.drop(to_drop, inplace=True)

Test and Verify Sanitizing Worked

found = overlaps(df, match_on, sort_by)
try:
    print(next(found))
except StopIteration:
    print("No overlaps found!")

No overlaps found!

Visualizing The Data

Fatal Instances of Influenza

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

sns.set(rc={'figure.figsize':(20,15)})

ax = sns.scatterplot(
    x='PeriodStartDate', 
    y='CountValue', 
    hue='Admin1ISO', 
    data=df[df.Fatalities == 1]
)

ax.set_yscale('symlog')
ax.set(xlim=(min(df.PeriodStartDate), max(df.PeriodStartDate) + pandas.DateOffset(years=1)))

plt.show()

png

Non-Fatal Instances of Influenza

%matplotlib inline

sns.set(rc={'figure.figsize':(25,15)})

ax = sns.scatterplot(
    x='PeriodStartDate', 
    y='CountValue', 
    hue='Admin1ISO', 
    data=df[df.Fatalities == 0]
)

ax.set_yscale('symlog')
ax.set(xlim=(min(df.PeriodStartDate), max(df.PeriodStartDate) + pandas.DateOffset(years=1)))

plt.show()

png

Combined Fatal and Non-Fatal Instances of Influenza

%matplotlib inline

sns.set(rc={'figure.figsize':(20,15)})

ax = sns.scatterplot(
    x='PeriodStartDate', 
    y='CountValue', 
    hue='Admin1ISO', 
    data=df
)

ax.set_yscale('symlog')
ax.set(xlim=(min(df.PeriodStartDate), max(df.PeriodStartDate) + pandas.DateOffset(years=1)))

plt.show()

png

Preserving Datapoint Magnitudes

The above scatter plots mask an important piece of information, namely two states with similiar count values will overlap and mask the magnitudes. We can overcome this by using a heatmap.

Fatal Instances of Influenza Heatmap

We start by pivoting on the state and year, and aggregating on the fatal CountValue’s. We will also order the states from east to west geographically.

states_east_to_west = [
    'US-ME', 'US-NH', 'US-VT', 'US-MA', 'US-RI', 'US-CT', 'US-NY', 'US-NJ', 'US-PA', 'US-DE', 'US-MD', 'US-DC',
    'US-WV', 'US-VA', 'US-NC', 'US-SC', 'US-GA', 'US-FL',
    'US-MI', 'US-IN', 'US-OH', 'US-KY', 'US-TN', 'US-AL', 'US-MS',
    'US-WI', 'US-IL', 
    'US-MN', 'US-IA', 'US-MO', 'US-AR', 'US-LA',
    'US-ND', 'US-SD', 'US-NE', 'US-KS', 'US-OK', 'US-TX',
    'US-MT', 'US-WY', 'US-CO', 'US-NM',
    'US-ID', 'US-NV', 'US-UT', 'US-AZ',
    'US-WA', 'US-OR', 'US-CA', 
    'US-HI', 'US-AK'


]

df_pv = df[df.Fatalities == 1].groupby(['Admin1ISO', df['PeriodStartDate'].dt.year])['CountValue'].sum()
df_pv = df_pv.reset_index().pivot('Admin1ISO','PeriodStartDate', 'CountValue').reindex(states_east_to_west, axis=0).fillna(0)

sns.heatmap(
    df_pv, 
    cmap=sns.diverging_palette(200, 275, s=99, l=70, as_cmap=True),
    center=800
)

<matplotlib.axes._subplots.AxesSubplot at 0x105c8fb00>

png

Non-Fatal Instances of Influenza Heatmap

df_pv = df[df.Fatalities == 0].groupby(['Admin1ISO', df['PeriodStartDate'].dt.year])['CountValue'].sum()
df_pv = df_pv.reset_index().pivot('Admin1ISO','PeriodStartDate', 'CountValue').reindex(states_east_to_west, axis=0).fillna(0)

sns.heatmap(
    df_pv, 
    cmap=sns.diverging_palette(200, 275, s=99, l=70, as_cmap=True),
    center=60000
)

<matplotlib.axes._subplots.AxesSubplot at 0x1174f7668>

png

Combined Fatal and Non-Fatal Instances of Influenza

df_pv = df[df.Fatalities == 0].groupby(['Admin1ISO', df['PeriodStartDate'].dt.year])['CountValue'].sum()
df_pv = df_pv.reset_index().pivot('Admin1ISO','PeriodStartDate', 'CountValue').reindex(states_east_to_west, axis=0).fillna(0)

sns.heatmap(
    df_pv, 
    cmap=sns.diverging_palette(200, 275, s=99, l=70, as_cmap=True),
    center=60000
)

<matplotlib.axes._subplots.AxesSubplot at 0x11152f6a0>

png

Code And Other Snippets

# %load src/lib/parse.py
def between(s, e, d):
    return s <= d <= e


def is_row_between(prev, cur, match_on):
    """ 
        returns if the date range of prev 
        has any overlap 
        with the date range cur
        and if they have matching match_on values
    """
    (_, r1), (_, r2) = prev, cur
    d1 = (r1.PeriodStartDate, r1.PeriodEndDate, r2.PeriodStartDate)
    d2 = (r1.PeriodStartDate, r1.PeriodEndDate, r2.PeriodEndDate)
    return all(r1[m] == r2[m] for m in match_on) and (between(*d1) or between(*d2))


def iter_with_prev(df):
    """ 
        return an iterator that gives back a tuple on next
        ((i, previous_row), (i+1, previous_row))
        first value is ((0, df[0]), (1, df[1]))
    """
    try:
        rows = df.iterrows()
        prev, curr = next(rows), next(rows)
        yield prev, curr
        prev = curr
        for row in rows:
            yield prev, row
            prev = row
    except StopIteration: # could yield back ((1, df[1]), None) ?
        pass


def combine_rows(prev, cur, df): #, r2_i, df
    """
        combines the two rows into r2, drops r_1
            date range is smallest date range that includes r1 & r2
            count is sum(counts) (the worst caste)
                this could be avg, min, max etc.
        returns the index of the previous row
    """
    (r1_i, r1), (r2_i, r2) = prev, cur
    dates = (r1.PeriodStartDate, r1.PeriodEndDate, r2.PeriodStartDate, r2.PeriodEndDate)
    counts = (r1.CountValue, r2.CountValue)
    df.at[r2_i, 'PeriodStartDate'] = min(*dates)
    df.at[r2_i, 'PeriodEndDate'] = max(*dates)
    df.at[r2_i, 'CountValue'] = sum(counts)
    return r1_i


def overlaps(df, match_on, sort_by):
    """
        yields the previous and current row
        given rows to match on, sort by and a data frame
    """
    for prev, curr in iter_with_prev(df.sort_values(by=match_on + sort_by)):
        if is_row_between(prev, curr, match_on):
            yield prev, curr

df_pv2 = df_pv.div(df_pv.max(axis=1), axis=0)

sns.heatmap(
    df_pv2,
    cmap = 'Spectral_r', 
    linewidths=1,
    square=True,
    cbar=False
)

<matplotlib.axes._subplots.AxesSubplot at 0x133215e48>

png

df_pv2 = df_pv.div(df_pv.max(axis=0), axis=1)

sns.heatmap(
    df_pv2,
    cmap = 'Spectral_r', 
    linewidths=1,
    square=True,
    cbar=False
)

<matplotlib.axes._subplots.AxesSubplot at 0x11473c198>

png