How long is a cold spell in Boston?
We’ve had some wacky weather recently. In the space of a week, the temperature went from a high of about 75°F to a low around 15°F. This got me to thinking about what constitutes “normal” weather here in the Boston area, and in particular, how common it is to have a string of consecutive days in which the high temperature stays below freezing. While this was an interesting question in itself, it also seemed like a great opportunity to learn a little about Pandas, the Python data analysis framework.
The first step was finding an appropriate dataset. NOAA provides a daily summaries dataset that includes daily high and low temperature; for Boston, this data extends back to about 1936.
The next step was figuring how to solve the problem. To be explicit, the question I’m trying to answer is:
For any given winter, what was the longest consecutive string of days in which the temperature stayed below freezing?
There are several parts to this problem.
Reading the data⌗
We can read the data using Pandas’ read_csv
method:
df = pandas.read_csv('boston.csv')
This assumes of course that we have previously import
ed the Pandas library:
import pandas
Now we have a dataframe in df
, but it’s using a positional index (i.e., the first item is at index 0
, the second at 1
, etc), whereas we want it to use a date-based index. The data has a DATE
column that we can turn into an appropriate index like this:
df['DATE'] = pandas.to_datetime(df['DATE'])
df.set_index(df['DATE'], inplace=True)
Which winter?⌗
I need to be able to group the data by “winter”. For example, dates from December 21, 2018 through March 20, 2019 would all be associated with “winter 2018”. It would be easy to group the data by year using Pandas’ groupby
method:
df.groupby(df['DATE'].dt.year)...
But what’s the equivalent for grouping by winter? My first attempt was a naive iterative solution:
def get_winter_start(val):
if (val.month == 10 and val.day >= 20) or val.month > 10:
winter = val.year
elif (val.month == 3 and val.day <= 20) or val.month < 3:
winter = val.year-1
else:
winter = 0
return winter
df['winter_start'] = df['DATE'].apply(get_winter_start)
This works, but it’s not particular graceful and doesn’t take advantage of any of the vector operations supported by Pandas. I eventually came up with a different solution. First, create a boolean series that indicates whether a given date is in winter or not:
df['winter'] = (
((df['DATE'].dt.month == 12) & (df['DATE'].dt.day >= 20)) |
(df['DATE'].dt.month < 3) |
((df['DATE'].dt.month == 3) & (df['DATE'].dt.day <= 20))
)
Next, use this boolean series to create a new dataframe that contains only dates in winter. Given this new data, the winter year is the current year for the month of December, or (the current year - 1) for months in Janurary, February, and March:
winter = df[df['winter']].copy()
winter['winter_start'] = (
winter['DATE'].dt.year - (winter['DATE'].dt.month <= 3))
This seems to do the job. You’ll note that in the above expression I’m subtracting a boolean from an integer, which is in fact totally legal and I talk about that in more detail later on in this article.
Finding a sequence of consecutive days: an iterative solution⌗
To find the longest sequence of days below freezing, I again started with an iterative solution:
def max_dbf(val):
acc = []
cur = 0
for i, row in val.iterrows():
if row['TMAX'] <= 32:
cur += 1
else:
if cur:
acc.append(cur)
cur = 0
if cur:
acc.append(cur)
return max(acc)
Which I applied using Pandas’ apply
method:
res = winter.groupby('winter_start').apply(max_dbf)
This time it’s not just ugly, but it’s also noticeably slow. I started doing some research to figure out how to make it faster.
Finding a sequence of consecutive days: a Pandas solution⌗
In an answer to this question on Stack Overflow, user DSM suggests that given a series, you can find the longest sequence of consecutive items matching a condition by first creating a boolean series y
that is True
(or 1
) for items that match the condition (and False
or 0
otherwise), and then running:
result = y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
Using that suggestion, I rewrote the max_dbf
method to look like this:
def max_dbf(val):
y = val['TMAX'] <= 32
res = y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
return max(res)
…and you know what, it works! But what exactly is going on there? There’s a reasonable explanation in the answer, but I’m new enough to Pandas that I wanted to work it out for myself.
Setting the stage⌗
In order to explore the operation of this expression, let’s start with some sample data. This is the value of TMAX
for the month of Janurary, 2018:
data = [
['2018-01-01', 13],
['2018-01-02', 19],
['2018-01-03', 29],
['2018-01-04', 30],
['2018-01-05', 24],
['2018-01-06', 12],
['2018-01-07', 17],
['2018-01-08', 35],
['2018-01-09', 43],
['2018-01-10', 36],
['2018-01-11', 51],
['2018-01-12', 60],
['2018-01-13', 61],
['2018-01-14', 23],
['2018-01-15', 21],
['2018-01-16', 33],
['2018-01-17', 34],
['2018-01-18', 32],
['2018-01-19', 34],
['2018-01-20', 47],
['2018-01-21', 49],
['2018-01-22', 39],
['2018-01-23', 55],
['2018-01-24', 42],
['2018-01-25', 30],
['2018-01-26', 34],
['2018-01-27', 53],
['2018-01-28', 52],
['2018-01-29', 43],
['2018-01-30', 31],
['2018-01-31', 30],
]
data_unzipped = list(zip(*data))
df = pandas.DataFrame({'DATE': data_unzipped[0], 'TMAX': data_unzipped[1]})
Our initial dataframe looks like this:
DATE | TMAX | |
---|---|---|
0 | 2018-01-01 | 13 |
1 | 2018-01-02 | 19 |
2 | 2018-01-03 | 29 |
3 | 2018-01-04 | 30 |
4 | 2018-01-05 | 24 |
5 | 2018-01-06 | 12 |
6 | 2018-01-07 | 17 |
7 | 2018-01-08 | 35 |
8 | 2018-01-09 | 43 |
9 | 2018-01-10 | 36 |
10 | 2018-01-11 | 51 |
11 | 2018-01-12 | 60 |
12 | 2018-01-13 | 61 |
13 | 2018-01-14 | 23 |
14 | 2018-01-15 | 21 |
15 | 2018-01-16 | 33 |
16 | 2018-01-17 | 34 |
17 | 2018-01-18 | 32 |
18 | 2018-01-19 | 34 |
19 | 2018-01-20 | 47 |
20 | 2018-01-21 | 49 |
21 | 2018-01-22 | 39 |
22 | 2018-01-23 | 55 |
23 | 2018-01-24 | 42 |
24 | 2018-01-25 | 30 |
25 | 2018-01-26 | 34 |
26 | 2018-01-27 | 53 |
27 | 2018-01-28 | 52 |
28 | 2018-01-29 | 43 |
29 | 2018-01-30 | 31 |
30 | 2018-01-31 | 30 |
Step 1⌗
We first need to create a boolean series indicating whether or not the temperature is below freezing. We’ll put this into the dataframe as series freezing
:
df['freezing'] = df['TMAX'] <= 32
Our dataframe now looks like this:
DATE | TMAX | freezing | |
---|---|---|---|
0 | 2018-01-01 | 13 | True |
1 | 2018-01-02 | 19 | True |
2 | 2018-01-03 | 29 | True |
3 | 2018-01-04 | 30 | True |
4 | 2018-01-05 | 24 | True |
5 | 2018-01-06 | 12 | True |
6 | 2018-01-07 | 17 | True |
7 | 2018-01-08 | 35 | False |
8 | 2018-01-09 | 43 | False |
9 | 2018-01-10 | 36 | False |
10 | 2018-01-11 | 51 | False |
11 | 2018-01-12 | 60 | False |
12 | 2018-01-13 | 61 | False |
13 | 2018-01-14 | 23 | True |
14 | 2018-01-15 | 21 | True |
15 | 2018-01-16 | 33 | False |
16 | 2018-01-17 | 34 | False |
17 | 2018-01-18 | 32 | True |
18 | 2018-01-19 | 34 | False |
19 | 2018-01-20 | 47 | False |
20 | 2018-01-21 | 49 | False |
21 | 2018-01-22 | 39 | False |
22 | 2018-01-23 | 55 | False |
23 | 2018-01-24 | 42 | False |
24 | 2018-01-25 | 30 | True |
25 | 2018-01-26 | 34 | False |
26 | 2018-01-27 | 53 | False |
27 | 2018-01-28 | 52 | False |
28 | 2018-01-29 | 43 | False |
29 | 2018-01-30 | 31 | True |
30 | 2018-01-31 | 30 | True |
Step 2⌗
Now we start looking at the various components in our expression of interest. In this step, we are looking at the highlighted part below:
Instead of y
, we’re operating on the result of the previous step, df['freezing']
. We’ll place the result of this step into a new series named step2
in the dataframe:
df['step2'] = df['freezing'] != df['freezing'].shift()
This gives us the following:
DATE | TMAX | freezing | step2 | |
---|---|---|---|---|
0 | 2018-01-01 | 13 | True | True |
1 | 2018-01-02 | 19 | True | False |
2 | 2018-01-03 | 29 | True | False |
3 | 2018-01-04 | 30 | True | False |
4 | 2018-01-05 | 24 | True | False |
5 | 2018-01-06 | 12 | True | False |
6 | 2018-01-07 | 17 | True | False |
7 | 2018-01-08 | 35 | False | True |
8 | 2018-01-09 | 43 | False | False |
9 | 2018-01-10 | 36 | False | False |
10 | 2018-01-11 | 51 | False | False |
11 | 2018-01-12 | 60 | False | False |
12 | 2018-01-13 | 61 | False | False |
13 | 2018-01-14 | 23 | True | True |
14 | 2018-01-15 | 21 | True | False |
15 | 2018-01-16 | 33 | False | True |
16 | 2018-01-17 | 34 | False | False |
17 | 2018-01-18 | 32 | True | True |
18 | 2018-01-19 | 34 | False | True |
19 | 2018-01-20 | 47 | False | False |
20 | 2018-01-21 | 49 | False | False |
21 | 2018-01-22 | 39 | False | False |
22 | 2018-01-23 | 55 | False | False |
23 | 2018-01-24 | 42 | False | False |
24 | 2018-01-25 | 30 | True | True |
25 | 2018-01-26 | 34 | False | True |
26 | 2018-01-27 | 53 | False | False |
27 | 2018-01-28 | 52 | False | False |
28 | 2018-01-29 | 43 | False | False |
29 | 2018-01-30 | 31 | True | True |
30 | 2018-01-31 | 30 | True | False |
Looking at the values of step2
in this table, we can see an interesting property: step2
is True
only in cases where the value of df['freezing']
changes.
Step 3⌗
In this step, we apply the cumsum method (“cumulative sum”) to the result of step 2. We store the result in a new series step3
in the dataframe:
df['step3'] = df['step2'].cumsum()
The result looks like this:
DATE | TMAX | freezing | step2 | step3 | |
---|---|---|---|---|---|
0 | 2018-01-01 | 13 | True | True | 1 |
1 | 2018-01-02 | 19 | True | False | 1 |
2 | 2018-01-03 | 29 | True | False | 1 |
3 | 2018-01-04 | 30 | True | False | 1 |
4 | 2018-01-05 | 24 | True | False | 1 |
5 | 2018-01-06 | 12 | True | False | 1 |
6 | 2018-01-07 | 17 | True | False | 1 |
7 | 2018-01-08 | 35 | False | True | 2 |
8 | 2018-01-09 | 43 | False | False | 2 |
9 | 2018-01-10 | 36 | False | False | 2 |
10 | 2018-01-11 | 51 | False | False | 2 |
11 | 2018-01-12 | 60 | False | False | 2 |
12 | 2018-01-13 | 61 | False | False | 2 |
13 | 2018-01-14 | 23 | True | True | 3 |
14 | 2018-01-15 | 21 | True | False | 3 |
15 | 2018-01-16 | 33 | False | True | 4 |
16 | 2018-01-17 | 34 | False | False | 4 |
17 | 2018-01-18 | 32 | True | True | 5 |
18 | 2018-01-19 | 34 | False | True | 6 |
19 | 2018-01-20 | 47 | False | False | 6 |
20 | 2018-01-21 | 49 | False | False | 6 |
21 | 2018-01-22 | 39 | False | False | 6 |
22 | 2018-01-23 | 55 | False | False | 6 |
23 | 2018-01-24 | 42 | False | False | 6 |
24 | 2018-01-25 | 30 | True | True | 7 |
25 | 2018-01-26 | 34 | False | True | 8 |
26 | 2018-01-27 | 53 | False | False | 8 |
27 | 2018-01-28 | 52 | False | False | 8 |
28 | 2018-01-29 | 43 | False | False | 8 |
29 | 2018-01-30 | 31 | True | True | 9 |
30 | 2018-01-31 | 30 | True | False | 9 |
We’re applying the cumsum
method to a boolean series. By doing so, we’re taking advantage of the fact that in Python we can treat boolean values as integers: a True
value evaluates to 1
, and a False
value to 0
. What we get with this operation is effectively a “sequence id”: because step2
is only True
when the value of freezing
changes, the value calculated in this step only increments when we start a new sequence of values for which freezing
has the same value.
Step 4⌗
In the previous step, we calculated what I called a “sequence id”. We can take advantage of this to group the data into consecutive stretches for which the temperature was either below freezing or not by using the value as an argument to Pandas’ groupby
method, and then applying the cumcount method:
df['step4'] = df['freezing'].groupby(df['step3']).cumcount() + 1
The cumcount
method simply numbers the items in each group, starting at 0. This gives us:
DATE | TMAX | freezing | step2 | step3 | step4 | |
---|---|---|---|---|---|---|
0 | 2018-01-01 | 13 | True | True | 1 | 1 |
1 | 2018-01-02 | 19 | True | False | 1 | 2 |
2 | 2018-01-03 | 29 | True | False | 1 | 3 |
3 | 2018-01-04 | 30 | True | False | 1 | 4 |
4 | 2018-01-05 | 24 | True | False | 1 | 5 |
5 | 2018-01-06 | 12 | True | False | 1 | 6 |
6 | 2018-01-07 | 17 | True | False | 1 | 7 |
7 | 2018-01-08 | 35 | False | True | 2 | 1 |
8 | 2018-01-09 | 43 | False | False | 2 | 2 |
9 | 2018-01-10 | 36 | False | False | 2 | 3 |
10 | 2018-01-11 | 51 | False | False | 2 | 4 |
11 | 2018-01-12 | 60 | False | False | 2 | 5 |
12 | 2018-01-13 | 61 | False | False | 2 | 6 |
13 | 2018-01-14 | 23 | True | True | 3 | 1 |
14 | 2018-01-15 | 21 | True | False | 3 | 2 |
15 | 2018-01-16 | 33 | False | True | 4 | 1 |
16 | 2018-01-17 | 34 | False | False | 4 | 2 |
17 | 2018-01-18 | 32 | True | True | 5 | 1 |
18 | 2018-01-19 | 34 | False | True | 6 | 1 |
19 | 2018-01-20 | 47 | False | False | 6 | 2 |
20 | 2018-01-21 | 49 | False | False | 6 | 3 |
21 | 2018-01-22 | 39 | False | False | 6 | 4 |
22 | 2018-01-23 | 55 | False | False | 6 | 5 |
23 | 2018-01-24 | 42 | False | False | 6 | 6 |
24 | 2018-01-25 | 30 | True | True | 7 | 1 |
25 | 2018-01-26 | 34 | False | True | 8 | 1 |
26 | 2018-01-27 | 53 | False | False | 8 | 2 |
27 | 2018-01-28 | 52 | False | False | 8 | 3 |
28 | 2018-01-29 | 43 | False | False | 8 | 4 |
29 | 2018-01-30 | 31 | True | True | 9 | 1 |
30 | 2018-01-31 | 30 | True | False | 9 | 2 |
Step 5⌗
Looking at the results of the previous step, we can see the simply asking for df['step5'].max()
would give us the longest sequence of days for which the value of freezing
remained constant. How do we limit that to only consider sequences in which freezing
is True
? We again take advantage of the fact that a boolean is just an integer, and we multiply the result of the previous step by the value of the freezing
series:
df['step5'] = df['freezing'] * df['step4']
This will zero out all the values from the previous step in which freezing
is False
, because False * x
is the same as 0 * x
. This gives us our final result:
DATE | TMAX | freezing | step2 | step3 | step4 | step5 | |
---|---|---|---|---|---|---|---|
0 | 2018-01-01 | 13 | True | True | 1 | 1 | 1 |
1 | 2018-01-02 | 19 | True | False | 1 | 2 | 2 |
2 | 2018-01-03 | 29 | True | False | 1 | 3 | 3 |
3 | 2018-01-04 | 30 | True | False | 1 | 4 | 4 |
4 | 2018-01-05 | 24 | True | False | 1 | 5 | 5 |
5 | 2018-01-06 | 12 | True | False | 1 | 6 | 6 |
6 | 2018-01-07 | 17 | True | False | 1 | 7 | 7 |
7 | 2018-01-08 | 35 | False | True | 2 | 1 | 0 |
8 | 2018-01-09 | 43 | False | False | 2 | 2 | 0 |
9 | 2018-01-10 | 36 | False | False | 2 | 3 | 0 |
10 | 2018-01-11 | 51 | False | False | 2 | 4 | 0 |
11 | 2018-01-12 | 60 | False | False | 2 | 5 | 0 |
12 | 2018-01-13 | 61 | False | False | 2 | 6 | 0 |
13 | 2018-01-14 | 23 | True | True | 3 | 1 | 1 |
14 | 2018-01-15 | 21 | True | False | 3 | 2 | 2 |
15 | 2018-01-16 | 33 | False | True | 4 | 1 | 0 |
16 | 2018-01-17 | 34 | False | False | 4 | 2 | 0 |
17 | 2018-01-18 | 32 | True | True | 5 | 1 | 1 |
18 | 2018-01-19 | 34 | False | True | 6 | 1 | 0 |
19 | 2018-01-20 | 47 | False | False | 6 | 2 | 0 |
20 | 2018-01-21 | 49 | False | False | 6 | 3 | 0 |
21 | 2018-01-22 | 39 | False | False | 6 | 4 | 0 |
22 | 2018-01-23 | 55 | False | False | 6 | 5 | 0 |
23 | 2018-01-24 | 42 | False | False | 6 | 6 | 0 |
24 | 2018-01-25 | 30 | True | True | 7 | 1 | 1 |
25 | 2018-01-26 | 34 | False | True | 8 | 1 | 0 |
26 | 2018-01-27 | 53 | False | False | 8 | 2 | 0 |
27 | 2018-01-28 | 52 | False | False | 8 | 3 | 0 |
28 | 2018-01-29 | 43 | False | False | 8 | 4 | 0 |
29 | 2018-01-30 | 31 | True | True | 9 | 1 | 1 |
30 | 2018-01-31 | 30 | True | False | 9 | 2 | 2 |
Step 6⌗
Now the answer to our question is as simple as asking for the maximum value from the previous step:
max_consecutive_dbf = df['step5'].max()
And if everything worked as expected, we should find that the longest consecutive sequence of days on which the temperature stayed below freezing was 7 days, from 2018-01-01 through 2018-01-07:
assert max_consecutive_dbf == 7
Results⌗
If we look at the results for the past 20 years, we see the following:
For data used in the above chart, the average stretch in which the temperature stays below freezing is 6.45 days (the average for the entire dataset is 6.33 days).