Research Notes

When should the seasons start?

This has absolutely nothing to do with my research, but I’ve never liked the way people say that the seasons start on the solstices and equinoxes, with (northern-hemisphere) summer running from June 21-ish through September 21-ish, then autumn on until December 21-ish, etc. I think it’s better to define the seasons this way: summer is June, July and August; autumn is September, October and November; winter is December, January and February; and spring is March, April and May.

My complaint is that the solstice/equinox system doesn’t let summer start until weeks into the heat of June while not letting summer end until well after things have started to cool off in September, it allows plenty of snow to fall before winter begins in late December, etc. I don’t think it makes any sense to say that December 19 is autumn, or that March 19 is winter. I’ve always thought the most defining aspect of summer is “the time when it’s hot” (which is mostly June, July and August), autumn is “the time when it’s cooling off and the leaves are changing” (September through November), winter is “the time when it’s cold and snowy” (that’s mostly December, January and February), and spring is “the time when it’s warming up and plants become green” (March through May). But the solstice/equinox system seems to think the most important aspects of the seasons are “the time when the days are long but getting shorter”, “the time when the days are short and getting shorter still”, “the time when the days are short but getting longer”, and “the time when the days are long and getting even longer”.

In short, I think the most defining quality of each season is the typical temperatures during that season (and everything that comes with that sort of temperature), rather than the length of the day and the sign of the change in that length. And I think those defining temperatures align more with three-month blocks rather than solstice-to-equinox spans (a difference of 3 weeks, or 25% of the length of a season).

And the data agree with me!

NOAA provides climate normals, which tell you, for a given location, the typical temperature on each day of the year. They’re generated by averaging each day’s temperature over the past thirty years. (The ones I used give a typical “average temperature” each day, as opposed to a typical “high” or “low”.) The climate normals generated using temperatures from 1981-2010 are available online. I grabbed that data to see how if aligns with my view of when the seasons should be.

(The data file is products/temperature/dly-tavg-normal.txt in the public FTP server. There are normals for 9,887 weather stations. I filtered out the normals labeled as “quasi-normal” or “provisional”, as well as any normal that didn’t see at least 20 degrees of variation over the year, since those locations don’t have strong, temperature-based seasons. Jupyter notebook here.)

Here are a few plots to get an idea of what those normals can say. Remember that everything is for a “typical” year.

These numbers are “the average temperature on the average day”, not “the average high” or “the average low”. In other words, it’s an average over all hours of all days. Note the smooth trend with latitude in the east, and the strong effect of elevation in the west.
Parts of New Mexico and Arizona heat up very quickly during the summer, while Texas takes a very long time to reach its highest temperatures each summer.
Michigan stands out for having its coldest days very late each winter—the effect of the Great Lakes taking their time to cool down?

That’s neat, but here’s where it starts to get interesting. This is “temperature over the course of the average year” plotted for all 6,104 weather stations:

The solid white line in the middle marks the average temperature across all weather stations for each day of the year. The individual stations all have their oddities, but the average curve is remarkably smooth!

With that curve, it’s very easy to ask what the coldest and warmest day of the year is, averaged across the entire US. The answers are January 13 and July 23 (a few weeks after the winter solstice, and about a month after the summer solstice). If you agree with me that summer is “when it’s hot” and winter “when it’s cold”, it makes sense to say that January 13 and July 23 are “peak winter” and “peak summer”, and that they ought to be the center of their respective seasons (as opposed to “peak summer” being two days after the solstice-based start of summer!).

If we go on to say that the midpoints of autumn and spring should be halfway between the dates of peak summer and winter, and that the boundaries between seasons should be halfway between the “peak” dates of those seasons, then here are what the seasons should be:

The shaded colors mark the seasons. The vertical lines mark peak summer and winter, the warmest and coldest days of the average year at the average US location (or rather, the average US weather station).

Summer should run from Jun 7 to Sep 2, centered on Jul 23.
Autumn should run from Sep 2 to Nov 29, centered on Oct 16.
Winter should run from Nov 29 to Mar 4, centered on Jan 13.
Spring should run from Mar 4 to Jun 7, centered on Apr 21.

There are two things to note. The first is that this is very close to what I’m saying the seasons should be: I think summer should be June, July and August since summer is “the time when it’s hot”, and based on when it’s hottest, the data say summer should be June 7 to September 2.

A second thing to notice is that, since “peak summer” and “peak winter” aren’t exactly half a year apart, there’s a little unevenness to these definitions. Summer is a little under three months long, winter is a little over three months, etc. I think that’s a good reason to just assign three months to each season—it’s really close to the temperature-based definition, but it evens things out a bit (and is also easier to remember!).

The interesting thing is that NOAA agrees with all this! They label the solstice/equinox definitions as the “astronomical seasons”, since they’re based on the motions of the Earth around the Sun. (But I’m an astronomer and even I don’t like this!) NOAA describes the “meteorological seasons”, on the other hand, as 3-month blocks: March, April and May for spring, June, July and August for summer, and so on. Meteorologists and climatologists have the same idea that summer should be “when it’s hottest”, and for more consistent record-keeping they also want seasons of consistent length as well as consistent start dates (the solstices and equinoxes can move around by a day from year to year).

To put this all together, let’s see how the different definitions stack up:

The shaded colors mark the seasons by thee different definitions.. The vertical lines mark peak summer and winter, the warmest and coldest days of the average year at the average US location (or rather, the average US weather station).

If you, like me, feel like temperatures best define the seasons, that middle band contains the seasons you want. It’s visually clear here how close that definition is to the more even and easier-to-remember 3-month blocks, and it’s also clear that the solstice/equinox definitions are offset from the annual temperature cycles.

Take it from me, an astronomer: don’t use the astronomical definitions of the seasons! Follow the lead of meteorologists and climatologists, the people who pay close attention to what’s happening outside as the seasons change.

Aside: Wikipedia will tell you all about how the traditional boundaries of the seasons vary between countries and cultures—see the opening discussion of autumn, for instance.


That’s all using US data, but what if the global picture is different? I couldn’t find climate normals for the whole world, but I did find the Global Historical Climatology Network Daily, which provides 30 GB of daily temperature records for stations across the globe, so I used that to compute my own climate normals.

(I used the contents of ghcnd_all.tar.gz from their FTP server. I used the TAVG data, which should be the same thing that went into the US climate normals. I threw out every station that didn’t have at least 10 years’ worth of data for each day of the year, though I didn’t require it be the same 10 years for each day. Then my normals are just the average of each day’s temperature records. I’m probably skipping some things that the pros do, but I’m no climatology pro.)

This data set includes 115,082 weather stations around the world:

However, most stations don’t have a sufficient quantity of measurements of the daily-average temperature (as opposed to the daily high or low), so I’m limited to only 4,151 stations. They’re almost exclusively in the northern hemisphere, so that’s where I focus.

Now that I have a source of data, here are the same plots as for the US data set:

Summer should run from Jun 7 to Sep 5, centered on Jul 22.
Winter should run from Dec 4 to Mar 6, centered on Jan 19.
Spring should run from Mar 6 to Jun 7, centered on Apr 22.
Autumn should run from Sep 5 to Dec 4, centered on Oct 20.

The global data set doesn’t change much at all. Peak summer is a day earlier, and peak winter 5 days later, and all the other dates have shifted accordingly by a few days. The three-whole-months definition is globally applicable!

Outlining the location of data in Matplotlib

Right now I’m working on a project involving a set of stars, and I very often plot various quantities as a function of two important dimensions (specifically the stars’ surface gravity log g and effective temperature T, which produces something similar to an H-R diagram). Each star is at a fixed location in those two dimensions, so no matter which particular quantity I’m plotting, the landscape of the plot is the same—dwarf stars along the bottom, giant stars in the top-right, the Sun near the bottom-center, and black marking regions of parameter-space that don’t include any stars at all. Over time the outline of this data set—where stars are and where they aren’t—becomes very familiar.

One of many stellar quantities that can be plotted in these two dimensions.

At other times I’m plotting something else but on the same two axes (maybe a different data set, or a fitted function, or something else). Since I’m so familiar with the outline of my main data set, it’s helpful to know where that data lies relative to this other data I’m plotting. Sure, I could just use the tick labels on the axes, but it’s more effective and allows for quicker orientation to draw an outline of the main data set in this other plot, like so:

A different data set is plotted, with the location of the main data set outlined so the viewer (i.e. me) can quickly become oriented.

Here’s the code I’m using to produce this outline. The idea is to produce a 2D histogram of the data set and then make a contour plot of that histogram, drawing a single contour at the 0.5 level so that it separates regions that do have data points from those that don’t. Histogram values greater than 1 are clamped to 1 to ensure the contour line sits consistently at the boundary between histogram bins. (I’ve been using this function for quite a while and honestly can’t remember if I put it together myself or found it on Stack Overflow, so please send me a link if you know of an original source!)

import numpy as np
import matplotlib.pyplot as plt

def outline_data(x, y, **kwargs):
    """Draws an outline of a set of points.
    Accepts two one-dimentional arrays containing the
    x and y coordinates of the data set to be outlined.
    All kwargs are passed to plt.contour"""
    H, x_edge, y_edge = np.histogram2d(x, y, bins=100)
    # H needs to be transposed for plt.contour
    H = H.T
    # Clamp histogram values to 1
    H[H>1] = 1
    # Contour plotting wants the x & y arrays to match the
    # shape of the z array, so work out the middle of each bin
    x_edge = (x_edge[1:] + x_edge[:-1]) / 2
    y_edge = (y_edge[1:] + y_edge[:-1]) / 2
    XX, YY = np.meshgrid(x_edge, y_edge)
    # Fill in some default plot args if not given
    if "alpha" not in kwargs:
        kwargs["alpha"] = 0.5
    if "colors" not in kwargs:
        kwargs["colors"] = "black"
    if "color" in kwargs:
        kwargs["colors"] = kwargs["color"]
    plt.contour(XX, YY, H, levels=[0.5], **kwargs)

Jupyter Notebooks in WordPress

I’ve tried a few different ways of displaying Jupyter notebooks in a WordPress post. Here they are (written more as notes to myself than a comprehensive, step-by-step guide):

One option is to save the notebook to HTML inside Jupyter, upload it in WordPress (as a media item), and display that page as an iframe using a WP plugin.

Another is to upload to Gist and use their embed script (which seems to work fine when copy/pasted).

Both the above options suffer from excessive horizontal padding which limits display space, in my experience.

A third option is to copy/paste sections and use a syntax-highlighting WP extension to make the code look nice, but this is tedious (and I haven’t found a highlighter that I think looks great).

The nicest-looking option, I think, though the most technically-involved, is to save the notebook as HTML in Jupyter (i.e., File > Download as > HTML), copy the body of the page (save for the outer few <div> tags), and paste it into a WP post. I copied some CSS from this page into the WP “Additional CSS” (with a fix for the .c1 class, which was missing a period in the CSS declaration), which makes the Jupyter HTML render correctly. This is super-hacky, but I haven’t noticed it affecting any WP elements. I tweaked the CSS to keep the In [1] labels from showing, so that the code boxes can be wider:

.input_prompt {
	display: none;

By default, the output areas are limited to 300px of height and scroll after that. To display an area in full height, delete the output_stdout class from the surrounding div.

Since Jupyter embeds images as data inside the HTML file, plots come along for the ride automatically (though the text you copy/paste is really long as a result—you can upload the images and update those <img> tags if you want).

The end result looks like this:

In [13]:
dist, _, flow = cv2.EMD(sig1, sig2, cv2.DIST_L2)

[[0. 1. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 2. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 2. 0.]]

(Footnote: the post I linked above links to a WP plugin that provides another possibility for embedding notebooks, but just using Gist’s embed script accomplishes the same thing in a similar way—I think that plugin is useful if you don’t want to lock in to Gist hosting.)

Earth Mover’s Distance in Python

I was exploring the Earth mover’s distance and did some head-scratching on the OpenCV v3  implementation in Python. Here’s some code to hopefully reduce head-scratching for others.  (Fun fact, OpenCV’s Python bindings are automatically generated, so Python documentation isn’t guaranteed. While I found a little bit for the OpenCV 2 implementation, I couldn’t find any for the OpenCV 3 version.)

(View this post as a Jupyter notebook.)

Continue reading “Earth Mover’s Distance in Python”

Dates in matplotlib

The other day I needed to make a plot with dates as the x-axis. Matplotlib supports this, but the examples I was finding weren’t quite as complete as I would have liked. So here’s what I put together as an example.

First, imports. Make sure to get the matplotlib.dates module.

In [1]:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
from datetime import datetime

We’ll want the date values in the form of datetime objects.

Continue reading “Dates in matplotlib”