We’ve had some wacky weather recently. In the space of a week, the temperature went from a high of about 75°F to a low around 15°F. This got me to thinking about what constitutes “normal” weather here in the Boston area, and in particular, how common it is to have a string of consecutive days in which the high temperature stays below freezing. While this was an interesting question in itself, it also seemed like a great opportunity to learn a little about Pandas, the Python data analysis framework.

The first step was finding an appropriate dataset. NOAA provides a daily summaries dataset that includes daily high and low temperature; for Boston, this data extends back to about 1936.

The next step was figuring how to solve the problem. To be explicit, the question I’m trying to answer is:

For any given winter, what was the longest consecutive string of days in which the temperature stayed below freezing?

There are several parts to this problem.

Reading the data

We can read the data using Pandas’ read_csv method:

df = pandas.read_csv('boston.csv')

This assumes of course that we have previously imported the Pandas library:

import pandas

Now we have a dataframe in df, but it’s using a positional index (i.e., the first item is at index 0, the second at 1, etc), whereas we want it to use a date-based index. The data has a DATE column that we can turn into an appropriate index like this:

df['DATE'] = pandas.to_datetime(df['DATE'])
df.set_index(df['DATE'], inplace=True)

Which winter?

I need to be able to group the data by “winter”. For example, dates from December 21, 2018 through March 20, 2019 would all be associated with “winter 2018”. It would be easy to group the data by year using Pandas’ groupby method:

df.groupby(df['DATE'].dt.year)...

But what’s the equivalent for grouping by winter? My first attempt was a naive iterative solution:

def get_winter_start(val):
    if (val.month == 10 and val.day >= 20) or val.month > 10:
        winter = val.year
    elif (val.month == 3 and val.day <= 20) or val.month < 3:
        winter = val.year-1
    else:
        winter = 0

    return winter

df['winter_start'] = df['DATE'].apply(get_winter_start)

This works, but it’s not particular graceful and doesn’t take advantage of any of the vector operations supported by Pandas. I eventually came up with a different solution. First, create a boolean series that indicates whether a given date is in winter or not:

df['winter'] = (
    ((df['DATE'].dt.month == 12) & (df['DATE'].dt.day >= 20)) |
    (df['DATE'].dt.month < 3) |
    ((df['DATE'].dt.month == 3) & (df['DATE'].dt.day <= 20))
)

Next, use this boolean series to create a new dataframe that contains only dates in winter. Given this new data, the winter year is the current year for the month of December, or (the current year - 1) for months in Janurary, February, and March:

winter = df[df['winter']].copy()
winter['winter_start'] = (
    winter['DATE'].dt.year - (winter['DATE'].dt.month <= 3))

This seems to do the job. You’ll note that in the above expression I’m subtracting a boolean from an integer, which is in fact totally legal and I talk about that in more detail later on in this article.

Finding a sequence of consecutive days: an iterative solution

To find the longest sequence of days below freezing, I again started with an iterative solution:

def max_dbf(val):
    acc = []
    cur = 0

    for i, row in val.iterrows():
        if row['TMAX'] <= 32:
            cur += 1
        else:
            if cur:
                acc.append(cur)
                cur = 0
    if cur:
        acc.append(cur)

    return max(acc)

Which I applied using Pandas’ apply method:

res = winter.groupby('winter_start').apply(max_dbf)

This time it’s not just ugly, but it’s also noticeably slow. I started doing some research to figure out how to make it faster.

Finding a sequence of consecutive days: a Pandas solution

In an answer to this question on Stack Overflow, user DSM suggests that given a series, you can find the longest sequence of consecutive items matching a condition by first creating a boolean series y that is True (or 1) for items that match the condition (and False or 0 otherwise), and then running:

result = y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)

Using that suggestion, I rewrote the max_dbf method to look like this:

def max_dbf(val):
    y = val['TMAX'] <= 32
    res = y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
    return max(res)

…and you know what, it works! But what exactly is going on there? There’s a reasonable explanation in the answer, but I’m new enough to Pandas that I wanted to work it out for myself.

Setting the stage

In order to explore the operation of this expression, let’s start with some sample data. This is the value of TMAX for the month of Janurary, 2018:

data = [
  ['2018-01-01', 13],
  ['2018-01-02', 19],
  ['2018-01-03', 29],
  ['2018-01-04', 30],
  ['2018-01-05', 24],
  ['2018-01-06', 12],
  ['2018-01-07', 17],
  ['2018-01-08', 35],
  ['2018-01-09', 43],
  ['2018-01-10', 36],
  ['2018-01-11', 51],
  ['2018-01-12', 60],
  ['2018-01-13', 61],
  ['2018-01-14', 23],
  ['2018-01-15', 21],
  ['2018-01-16', 33],
  ['2018-01-17', 34],
  ['2018-01-18', 32],
  ['2018-01-19', 34],
  ['2018-01-20', 47],
  ['2018-01-21', 49],
  ['2018-01-22', 39],
  ['2018-01-23', 55],
  ['2018-01-24', 42],
  ['2018-01-25', 30],
  ['2018-01-26', 34],
  ['2018-01-27', 53],
  ['2018-01-28', 52],
  ['2018-01-29', 43],
  ['2018-01-30', 31],
  ['2018-01-31', 30],
  ]

data_unzipped = list(zip(*data))
df = pandas.DataFrame({'DATE': data_unzipped[0], 'TMAX': data_unzipped[1]})

Our initial dataframe looks like this:

DATETMAX
02018-01-0113
12018-01-0219
22018-01-0329
32018-01-0430
42018-01-0524
52018-01-0612
62018-01-0717
72018-01-0835
82018-01-0943
92018-01-1036
102018-01-1151
112018-01-1260
122018-01-1361
132018-01-1423
142018-01-1521
152018-01-1633
162018-01-1734
172018-01-1832
182018-01-1934
192018-01-2047
202018-01-2149
212018-01-2239
222018-01-2355
232018-01-2442
242018-01-2530
252018-01-2634
262018-01-2753
272018-01-2852
282018-01-2943
292018-01-3031
302018-01-3130

Step 1

We first need to create a boolean series indicating whether or not the temperature is below freezing. We’ll put this into the dataframe as series freezing:

df['freezing'] = df['TMAX'] <= 32

Our dataframe now looks like this:

DATETMAXfreezing
02018-01-0113True
12018-01-0219True
22018-01-0329True
32018-01-0430True
42018-01-0524True
52018-01-0612True
62018-01-0717True
72018-01-0835False
82018-01-0943False
92018-01-1036False
102018-01-1151False
112018-01-1260False
122018-01-1361False
132018-01-1423True
142018-01-1521True
152018-01-1633False
162018-01-1734False
172018-01-1832True
182018-01-1934False
192018-01-2047False
202018-01-2149False
212018-01-2239False
222018-01-2355False
232018-01-2442False
242018-01-2530True
252018-01-2634False
262018-01-2753False
272018-01-2852False
282018-01-2943False
292018-01-3031True
302018-01-3130True

Step 2

Now we start looking at the various components in our expression of interest. In this step, we are looking at the highlighted part below:

Instead of y, we’re operating on the result of the previous step, df['freezing']. We’ll place the result of this step into a new series named step2 in the dataframe:

df['step2'] = df['freezing'] != df['freezing'].shift()

This gives us the following:

DATETMAXfreezingstep2
02018-01-0113TrueTrue
12018-01-0219TrueFalse
22018-01-0329TrueFalse
32018-01-0430TrueFalse
42018-01-0524TrueFalse
52018-01-0612TrueFalse
62018-01-0717TrueFalse
72018-01-0835FalseTrue
82018-01-0943FalseFalse
92018-01-1036FalseFalse
102018-01-1151FalseFalse
112018-01-1260FalseFalse
122018-01-1361FalseFalse
132018-01-1423TrueTrue
142018-01-1521TrueFalse
152018-01-1633FalseTrue
162018-01-1734FalseFalse
172018-01-1832TrueTrue
182018-01-1934FalseTrue
192018-01-2047FalseFalse
202018-01-2149FalseFalse
212018-01-2239FalseFalse
222018-01-2355FalseFalse
232018-01-2442FalseFalse
242018-01-2530TrueTrue
252018-01-2634FalseTrue
262018-01-2753FalseFalse
272018-01-2852FalseFalse
282018-01-2943FalseFalse
292018-01-3031TrueTrue
302018-01-3130TrueFalse

Looking at the values of step2 in this table, we can see an interesting property: step2 is True only in cases where the value of df['freezing'] changes.

Step 3

In this step, we apply the cumsum method (“cumulative sum”) to the result of step 2. We store the result in a new series step3 in the dataframe:

df['step3'] = df['step2'].cumsum()

The result looks like this:

DATETMAXfreezingstep2step3
02018-01-0113TrueTrue1
12018-01-0219TrueFalse1
22018-01-0329TrueFalse1
32018-01-0430TrueFalse1
42018-01-0524TrueFalse1
52018-01-0612TrueFalse1
62018-01-0717TrueFalse1
72018-01-0835FalseTrue2
82018-01-0943FalseFalse2
92018-01-1036FalseFalse2
102018-01-1151FalseFalse2
112018-01-1260FalseFalse2
122018-01-1361FalseFalse2
132018-01-1423TrueTrue3
142018-01-1521TrueFalse3
152018-01-1633FalseTrue4
162018-01-1734FalseFalse4
172018-01-1832TrueTrue5
182018-01-1934FalseTrue6
192018-01-2047FalseFalse6
202018-01-2149FalseFalse6
212018-01-2239FalseFalse6
222018-01-2355FalseFalse6
232018-01-2442FalseFalse6
242018-01-2530TrueTrue7
252018-01-2634FalseTrue8
262018-01-2753FalseFalse8
272018-01-2852FalseFalse8
282018-01-2943FalseFalse8
292018-01-3031TrueTrue9
302018-01-3130TrueFalse9

We’re applying the cumsum method to a boolean series. By doing so, we’re taking advantage of the fact that in Python we can treat boolean values as integers: a True value evaluates to 1, and a False value to 0. What we get with this operation is effectively a “sequence id”: because step2 is only True when the value of freezing changes, the value calculated in this step only increments when we start a new sequence of values for which freezing has the same value.

Step 4

In the previous step, we calculated what I called a “sequence id”. We can take advantage of this to group the data into consecutive stretches for which the temperature was either below freezing or not by using the value as an argument to Pandas’ groupby method, and then applying the cumcount method:

df['step4'] = df['freezing'].groupby(df['step3']).cumcount() + 1

The cumcount method simply numbers the items in each group, starting at 0. This gives us:

DATETMAXfreezingstep2step3step4
02018-01-0113TrueTrue11
12018-01-0219TrueFalse12
22018-01-0329TrueFalse13
32018-01-0430TrueFalse14
42018-01-0524TrueFalse15
52018-01-0612TrueFalse16
62018-01-0717TrueFalse17
72018-01-0835FalseTrue21
82018-01-0943FalseFalse22
92018-01-1036FalseFalse23
102018-01-1151FalseFalse24
112018-01-1260FalseFalse25
122018-01-1361FalseFalse26
132018-01-1423TrueTrue31
142018-01-1521TrueFalse32
152018-01-1633FalseTrue41
162018-01-1734FalseFalse42
172018-01-1832TrueTrue51
182018-01-1934FalseTrue61
192018-01-2047FalseFalse62
202018-01-2149FalseFalse63
212018-01-2239FalseFalse64
222018-01-2355FalseFalse65
232018-01-2442FalseFalse66
242018-01-2530TrueTrue71
252018-01-2634FalseTrue81
262018-01-2753FalseFalse82
272018-01-2852FalseFalse83
282018-01-2943FalseFalse84
292018-01-3031TrueTrue91
302018-01-3130TrueFalse92

Step 5

Looking at the results of the previous step, we can see the simply asking for df['step5'].max() would give us the longest sequence of days for which the value of freezing remained constant. How do we limit that to only consider sequences in which freezing is True? We again take advantage of the fact that a boolean is just an integer, and we multiply the result of the previous step by the value of the freezing series:

df['step5'] = df['freezing'] * df['step4']

This will zero out all the values from the previous step in which freezing is False, because False * x is the same as 0 * x. This gives us our final result:

DATETMAXfreezingstep2step3step4step5
02018-01-0113TrueTrue111
12018-01-0219TrueFalse122
22018-01-0329TrueFalse133
32018-01-0430TrueFalse144
42018-01-0524TrueFalse155
52018-01-0612TrueFalse166
62018-01-0717TrueFalse177
72018-01-0835FalseTrue210
82018-01-0943FalseFalse220
92018-01-1036FalseFalse230
102018-01-1151FalseFalse240
112018-01-1260FalseFalse250
122018-01-1361FalseFalse260
132018-01-1423TrueTrue311
142018-01-1521TrueFalse322
152018-01-1633FalseTrue410
162018-01-1734FalseFalse420
172018-01-1832TrueTrue511
182018-01-1934FalseTrue610
192018-01-2047FalseFalse620
202018-01-2149FalseFalse630
212018-01-2239FalseFalse640
222018-01-2355FalseFalse650
232018-01-2442FalseFalse660
242018-01-2530TrueTrue711
252018-01-2634FalseTrue810
262018-01-2753FalseFalse820
272018-01-2852FalseFalse830
282018-01-2943FalseFalse840
292018-01-3031TrueTrue911
302018-01-3130TrueFalse922

Step 6

Now the answer to our question is as simple as asking for the maximum value from the previous step:

max_consecutive_dbf = df['step5'].max()

And if everything worked as expected, we should find that the longest consecutive sequence of days on which the temperature stayed below freezing was 7 days, from 2018-01-01 through 2018-01-07:

assert max_consecutive_dbf == 7

Results

If we look at the results for the past 20 years, we see the following:

For data used in the above chart, the average stretch in which the temperature stays below freezing is 6.45 days (the average for the entire dataset is 6.33 days).