Last week on Sunday 9 June, hundreds of airplanes experienced a failure mode with some variants of the Collins Aerospace GPS-4000S sensor, grounding many business jets and even causing some delays among regional airlines.
Following is the initial description of the problem from Collins, as I first saw it quoted in Aviation International News:
“The root cause is a software design error that misinterprets GPS time updates. A ‘leap second’ event occurs once every 2.5 years within the U.S. Government GPS satellite almanac update. Our GPS-4000S (P/N 822-2189-100) and GLU-2100 (P/N 822-2532-100) software’s timing calculations have reacted to this leap second by not tracking satellites upon power-up and subsequently failing. A regularly scheduled almanac update with this ‘leap second’ was distributed by the U.S. government on 0:00 GMT Sunday, June 9, 2019, and the failures began to occur after this event.”
This sounded very suspicious to me… but suspicious in a mathematically interesting way, hence the motivation for this post.
The suggestion, re-interpreted and re-propagated in subsequent aviation news articles and comment threads, seems to be that the gummint introduced a “leap second event” in the recent GPS almanac update, and the Collins receivers weren’t expecting it. The “almanac” is a periodic, low-throughput message broadcast by the satellites in the GPS constellation, that newly powered-on receivers can use to get an initial “rough idea” of where they are and what time it is.
It is true that one of the fields in the almanac is a count of the number of leap seconds added to UTC time since the GPS epoch, that is, since the GPS “clock started ticking” on 6 January 1980 at 00:00:00 UTC. So what’s a leap second? Briefly, the earth’s rotation is slowing down, and so to keep our UTC clocks reasonably consistent with another astronomical time reference known as UT1, we periodically “delay” the advance of UTC time by introducing a leap second, which is an extra 61st second of the last minute of a month, historically either at the end of June or the end of December.
There have been 18 leap seconds added to UTC since 1980… but leap seconds are scheduled months in advance, and it has already been announced that there will not be a leap second at the end of this month, and there certainly wasn’t a leap second added this past June 9th.
So what really happened last week? The remainder of this post is purely speculation; I have no affiliation with Collins Aerospace nor any of its competitors, so I don’t have any knowledge of the actual software whose design was reported to be in error. This is just guesswork from a mathematician with an interest in aviation.
GPS week number roll-over
The Global Positioning System has an interesting problem: it’s hard to figure out what time it is. That is, if your GPS receiver wants to learn not just its location, but also the current time, without relying on or comparing with any external clock, then that is somewhere between impossible and difficult, depending on how fancy we want to get. The only “timestamp” that is broadcast by the satellites is a week number– indicating the integer number of weeks elapsed since the GPS epoch on 6 January 1980– and a number of seconds elapsed within the current week.
The problem is that the message field for the week number is only 10 bits long, meaning that we can only encode week 0 through week 1023. After that, on “week 1024,” the odometer rolls over, so to speak, back to indicating week 0 again.
This has already happened twice: the first roll-over was between 21 and 22 August 1999, and the second was just a couple of months ago, between 6 and 7 April 2019. An old GPS receiver whose software had not been updated to account for these roll-overs might show the time transition from 21 August 1999 “back” to 6 January 1980, for example.
It’s worth noting that those roll-overs didn’t actually occur exactly at midnight… or at least, not at midnight UTC. The GPS “clock” does not include leap seconds, but instead ticks happily along, always exactly 60x60x24x7=604,800 seconds per week. So, for example, GPS time is currently “ahead” of UTC time by 18 seconds, corresponding to the 18 leap seconds that have contributed to the “slowing” of the advance of the UTC clock. The GPS week number most recently rolled over on 6 April, not at midnight, but at 23:59:42 UTC.
Using the leap second count
It turns out that we could modify our GPS receiver software to extend its ability to tell time beyond a single 1024-week cycle, by using the leap second count field together with the rolling week number. The idea is that by predicting the addition of more leap seconds at reasonably regular intervals in the future, we can use the week number to determine the time within a 1024-week cycle, and use the leap second count to determine which 1024-week cycle we are in.
This is not a new idea; there is a good description of the approach here, in the form of a patent application by Trimble, a popular manufacturer of GPS receivers. Given the current week number and the leap second count in the almanac, the suggested formula for the “absolute” number of weeks since GPS epoch is given by
where the intermediate value is essentially an estimate of the week number obtained by a linear fit against the historical rate of introduction of leap seconds.
This formula was proposed in 1996, and it would indeed have worked well past the 1999 roll-over… but although it was predicted to “provide a solution to the GPS rollover problem for about 173 years,” unfortunately it would really only have lasted for about 12 years, first yielding an incorrect time in week 1654 on 18 September 2011.
The problem is shown in the figure below, indicating the number of leap seconds that have been added over time since the GPS epoch in 1980, with the red bars indicating the “week zero” epoch and roll-overs up to this point:
Right after the first roll-over in 1999, the reasonably regular introduction of leap seconds stopped, and even once they started to become “regular” again, they were regular at a lesser rate (although still more frequently than the “2.5 years” suggested by the Collins report).
Could something like this be the cause of this past week’s sensor failures? It’s certainly possible: it’s a relatively simple programming exercise to search for different linear fit coefficients in the above formula– a variant of which might have been used on these Collins receivers– that
- Yield the correct absolute week number for a longer time than the above formula, including continuing past the second roll-over this past April; but
- Yield the incorrect absolute week number, for the first time, on 9 June (i.e., the start of week 2057).
Such coefficients aren’t hard to find; for example, the reader can verify that the following satisfies the above criteria:
which corresponds to an estimate of one additional leap second approximately every 2 years.
Edit: See Steve Allen’s comment (and my reply) for a description of what I think is probably a more likely root cause of the problem– still related to interpreting the combination of week number roll-over and leap second occurrences, but with a slightly different failure mode.