While looking for some details on t-tests, I ran across this post from The Minitab Blog, which investigated Punxsutawney Phil’s powers of prognostication. The authors of that post tested whether there was a statistically significant difference in the average monthly temperature in the months immediately following Phil’s predictions when he saw his shadow vs when he didn’t. Detecting an increase in mean temperature when Phil saw doesn’t see his shadow would support Phil’s prognostication claims.

The original blog post used average temperature data for Pennsylvania over a 30 year period, but they were a bit opaque about the source of their data and didn’t have year labels on their data. As such, I thought I’d perform a similar analysis with a slightly clearer data lineage and longer history. Since I don’t use Minitab, I performed my analysis in Python.

For this analysis, I obtained monthly average temperature data from NOAA’s Global Summary of the Month for Jefferson County PA, which is where the borough of Punxsutawney is located. Phil’s predictions were sourced from the Wikipedia entry for Punxsutawney Phil, however the data provided there were originally sourced from Stormfax.com. The temperature data and Wikipedia page, both sourced in Apr-2020, can be found in this repository.

Here, we set up some package imports that we will use throughout.

import sys, os, re, requests, urllib3, pathlib
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
from IPython.core.display import display, HTML
data_path = "./data/"
temperature_data_path = data_path + "/TempJeffersonCountyPA"

Punxsutawney Phil Data

Because I didn’t feel like copying down the data about Phil’s predictions by hand, we will use Beautiful Soup to extract the data from the Wikipedia summary. To begin, we will need to download the Wikipedia entry if we don’t already have it. The entry, downloaded on 04/17/2020, is included in the repository. (The page may have changed after that date, so it is included for reproducibility).

ghog_pred_html = data_path + "/wiki_phil.html"

if not os.path.exists(ghog_pred_html):
    url = "https://en.wikipedia.org/wiki/Punxsutawney_Phil"
    req = requests.get(url)
    req_text = req.text
    
    with open(ghog_pred_html, 'w') as f:
        f.write(req_text)
        
else:
    with open(ghog_pred_html, 'r') as f:
        req_text = "".join(list(f))

Next we parse the Wikipedia entry. Here, we build the soup and find the table containing Phil’s predictions.

# build the soup
soup = BeautifulSoup(req_text, 'html.parser')

# find the only wikitable, which has the prognostications
table = soup.find('table', {'class' : 'wikitable'})
table_rows = table.find_all('tbody')

# We can throw away the table caption, since it links to 
# the original page
table.find("caption").decompose()

Next we’ll render the table that we extracted. The table doesn’t come with its legend, so we’ll also render it. Some minor modifications of the original HTML were necessary in order to make this look nice in my notebook.

# find the legend
legend = soup.find('div', attrs={'style' : 'float: left;'})
legend = legend.find("tbody")

# the legend was aligning right, so force it to align left
for div in legend.findAll("div", attrs={'class' : 'legend'}):
    div['style'] = 'text-align:left;'
    
# Render the html
display(HTML(table.prettify()))
display(HTML(legend.prettify()))
1887 1888 1889
1890 1891 1892 1893 1894 1895 1896 1897 1898 1899
1900 1901 1902 1903 1904 1905 1906 1907 1908 1909
1910 1911 1912 1913 1914 1915 1916 1917 1918 1919
1920 1921 1922 1923 1924 1925 1926 1927 1928 1929
1930 1931 1932 1933 1934 1935 1936 1937 1938 1939
1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
2020

"Long winter" (103)
"Early spring" (20)
"War clouds have blacked out parts of the shadow." (1)
No appearance (1)
No record (9)

Next we parse the table and use the table’s cell styles to determine whether Phil saw his shadow or not in each year. We will only retain the “Long Winter” (shadow) and “Early Spring” (no shadow) records, and will set all other records as missing.

# get all of the entries in the table, cell contents are years 
# and style denotes if a shadow was seen or not
tds = table.find_all('td', style=True)
years  = [int(td.text) for td in tds]
styles = [td['style'] for td in tds]

# use the color coding to identify if a s
shadow = []
for s in styles:
    if "azure" in s:
        shadow.append(True)   # azure = SHADOW
    elif "gold" in s:
        shadow.append(False)  # gold = NO SHADOW
    else:
        shadow.append(np.nan) # sometimes no records, once war clouds
        
# Create a dataframe of the records
df_phil = pd.DataFrame({"year" : years, "shadow" : shadow})
df_phil = df_phil[~df_phil["shadow"].isnull()]
df_phil.head()
year shadow
0 1887 True
1 1888 True
3 1890 False
11 1898 True
13 1900 True

Temperature Data

NOAA’s site only allows you to export a certain number of records at a time, so I had to split my export into two separate requests. Weather data is (partially) available for Jefferson County from 1892 onward, so I requested all records from June-1892 through March-2020. The dataset I received contained monthly temperature records from multiple stations in Jefferson County. For this analysis we will primarily be interested in the average temperature “TAVG.” We’ll drop the other columns and rename those columns that we do keep.

temperature_data_files = os.listdir(temperature_data_path)
temperature_data_file_paths = [temperature_data_path + "/" + f for f in temperature_data_files]
df_temperature = pd.concat((pd.read_csv(f) for f in temperature_data_file_paths))

print("Data file names")
print("---------------")
for f in temperature_data_files: 
    print(f)

# Build the data frame. Only retain average
df_temperature.columns = [c.lower() for c in df_temperature.columns]
df_temperature['date'] = pd.to_datetime(df_temperature['date'])
df_temperature = df_temperature[['station', 'date', 'tavg']]
df_temperature = df_temperature.rename(columns = {'tavg' : 't_avg'})
df_temperature = df_temperature.dropna()
df_temperature.head()
Data file names
---------------
jefferson_county_pa_temp_189206_195005.csv
jefferson_county_pa_temp_195006_202003.csv
station date t_avg
0 USW00014741 1926-01-01 22.0
1 USW00014741 1926-02-01 23.6
2 USW00014741 1926-03-01 28.3
3 USW00014741 1926-04-01 40.0
4 USW00014741 1926-05-01 53.9

Since we’re not especially interested in the individual stations’ average monthly temperature, we’ll average across stations to get a representative value for Jefferson County. Some additional columns are added for convenience later on.

df_montly_temperature = df_temperature.groupby('date').mean().reset_index()
df_montly_temperature['month'] = pd.DatetimeIndex(df_montly_temperature['date']).month
df_montly_temperature['year']  = pd.DatetimeIndex(df_montly_temperature['date']).year
df_montly_temperature = df_montly_temperature[['date', 'year', 'month', 't_avg']]
df_montly_temperature.head()
date year month t_avg
0 1896-09-01 1896 9 70.3
1 1896-10-01 1896 10 52.1
2 1896-11-01 1896 11 47.1
3 1897-01-01 1897 1 32.8
4 1897-02-01 1897 2 43.6

Here we provide a scatter-plot of the average temperature for each station and compare to the mean across stations, which is plotted as a solid line.

fig, ax = plt.subplots(figsize = (10,5))
df_montly_temperature.plot(x = 'date', y = 't_avg', ax = ax, c='k', alpha=0.25)
sns.scatterplot(x='date', y='t_avg', hue='station', data=df_temperature, ax=ax)
plt.show()

png

Next we create a dataframe of the average temperature for February and March of each year. Some years were missing records for either February or March, so we’ll exclude those years from our analysis. The mean for the two months is then computed as the simple average of each month’s mean. In doing this, we are ignoring the fact that February has fewer days than March. A more careful analysis would not neglect this fact, but this should be good enough for our purposes. 🙂

df_feb_mar_avg = (df_montly_temperature[(df_montly_temperature['month'] == 2) | (df_montly_temperature['month'] == 3)]
               .groupby('year')[['t_avg']]
               .agg(['count', 'mean']))
df_feb_mar_avg.columns = ['_'.join(col).strip() for col in df_feb_mar_avg.columns.values]
df_feb_mar_avg = df_feb_mar_avg.reset_index()
df_feb_mar_avg = df_feb_mar_avg[df_feb_mar_avg['t_avg_count'] == 2]
df_feb_mar_avg = df_feb_mar_avg[['year', 't_avg_mean']]
df_feb_mar_avg = df_feb_mar_avg.rename(columns = {'t_avg_mean' : 't_avg'})
df_feb_mar_avg.head()
year t_avg
0 1897 47.05
3 1900 29.10
4 1901 24.45
6 1911 32.95
7 1912 24.45

2-Sample Mean t-test

Finally we’ve got all of the data that we will need for the t-test. Let’s merge the two dataframes together and throw away any years in which we are missing either a temperature measurement or a prediction from Phil.

df_combined = pd.merge(df_phil, df_feb_mar_avg, left_on = "year", right_on = "year", how = "outer")
df_combined = df_combined.dropna()
df_combined.head()
year shadow t_avg
4 1900 True 29.10
5 1901 True 24.45
15 1911 True 32.95
16 1912 True 24.45
17 1913 True 33.85

Let’s take a quick look at the mean temperature for Feb-March in years when Phil saw his shadow, as compared to the mean in years when he didn’t see his shadow.

df_combined.groupby('shadow').agg({'t_avg' : ['count', 'mean']})
t_avg
count mean
shadow
False 16 30.596875
True 89 30.176966

The means are pretty similar, so I wouldn’t expect our t-test to report a significant difference. Next we’ll convert the temperature to vectors for the t-tests, plot the distribution of the means in each prediction class, and then we’ll run the t-test. In the t-test, we’ll assume the samples have the same variance, although Scipy can run a test that allows for different variance between the different populations.

avg_temp_shadow    = df_combined[df_combined['shadow'] == True ]['t_avg'].to_numpy().flatten()
avg_temp_no_shadow = df_combined[df_combined['shadow'] == False]['t_avg'].to_numpy().flatten()
sns.distplot(avg_temp_shadow,    hist=False, label="Shadow")
sns.distplot(avg_temp_no_shadow, hist=False, label="No Shadow")
plt.show()

png

ttest_rslt = ttest_ind(avg_temp_shadow, avg_temp_no_shadow, 
                       equal_var=False)

print("T-Test Result")
print("-------------")
print("t-value: {}".format(ttest_rslt.statistic))
print("p-value: {}".format(ttest_rslt.pvalue))
T-Test Result
-------------
t-value: -0.4104841358882534
p-value: 0.6859415624804476

The test shows that we failed to reject the null hypothesis that the two means were equal. As such, we cannot conclude that the means are different, but we also cannot conclude that the means are equal. That being said, it seems fairly likely that the means are the same, which reduces my confidence in Phil’s prognostications a little bit. 🙂