With 800 aircraft in the air, the air traffic controllers in the Palmdale Air Route Traffic Control Centre suddenly lost all radio contact in the middle of a busy Tuesday in September 2004. For three long hours, all the pilots in California’s airspace were left to their own devices. There were fortunately no accidents, but a handful of planes flew alarmingly close to one another. The reason was the number 4,294,967,295.
Written by Anders Norås, Chief Technology Officer in Itera
At first glance, this looks like a large number, but in other respects it appears fairly ordinary and unremarkable. It isn’t. The incident in California is only one of several caused by the nature of this number and sloppy programming.
As readers who can count in binary will know, 11111111111111111111111111111111 is the highest number that can be represented using a 32-bit integer. Readers not familiar with binary usually write this number as 4,294,967,295.
The programmers behind the air traffic control systems used in Palmdale had written an algorithm that kept track of time by counting down in milliseconds from 4,294,967,295. When you have counted down in milliseconds for around 49 days, 17 hours, 2 minutes and 47 seconds, you run out of time and get to 0 – a number which is the same in normal numbering and in binary. An unsigned 32-bit binary number cannot be a negative number, and the number “overflows” and resets to thirty two '1's once it goes below zero. This was what happened that fateful autumn day in 2004, and several thousand people were in mortal danger because a programmer had chosen a slightly too small number.
Boeing’s programmers had a different approach to time: instead of counting down, they counted up. Boeing’s software for its Dreamliner planes’ electrical power generation systems kept track of time by counting up by one 100 times a second until it had done so 2,147,483,647 times. The power generation systems would then stop providing power, and the program could not continue counting whether it wanted to or not.
2,147,483,647 is roughly half of 4,294,967,295. Both are 32-bit numbers, but the first is ‘signed’, which means that it can also be negative. With signed numbers, one bit is used to indicate whether the number is positive or negative.
The combination of the number chosen and the algorithm led to an overflow error once the plane had been in operation for 248 days, 13 hours and 14 minutes. The error in turn led to the electrical power generation systems ceasing to produce electricity and therefore to the plane losing electric power. The solution to the problem even on a large machine such as a Dreamliner was the same as in most other IT support cases – turn it off and on again.
What these two fairly simple bugs can teach us is that it is important to understand the significance of the types of data and structures one uses. And, equally importantly: you need to test out all scenarios, not only the sunny versions of events that you expect.
PS. If you are due to fly, I can assure you that Boeing has corrected the error on its Dreamliners. They now use 64-bit numbers, meaning that they do not need to be restarted for around half a billion years, which should be long enough.