Tuesday, August 9, 2016

Delta's System Outage

It's been all over the news that Delta Airlines had a system outage yesterday that delayed or cancelled flights affecting thousands of passengers.
According to Delta's news page on the issue, the original issue was a power outage in Atlanta around 5 AM. Several hundred flights ended up being cancelled, and full system operation wasn't restored until 3:40 PM. Delta's CEO had to issue an apology to customers, and more flights on Tuesday were affected as Delta's operations reset after the outage.

My first reaction when I heard about this was "gee, glad I'm not flying anywhere today." Followed immediately by sympathy for the IT people who are having to deal with all of this. And shortly after, incredulity that an organization the size of Delta wasn't better prepared.

I never did any work directly for an airline, but during my IT career I worked with a lot of mission-critical systems for large corporations. The IT department of every good-sized organization has plans for disaster recovery and business continuity. Basically, these are meant to handle situations where something outside your control shuts down your computer systems. I'm sure Delta had those plans like everyone else, but they certainly weren't adequate to this challenge.

I don't have any direct knowledge of Delta's systems, but I'm pretty sure I know the general gist of how this happened. I've seen it in several other industries. They've got a bunch of old systems that have been around for decades which were never modernized. Either it's too expensive to do so (and thus never attempted), or modernization attempts were made and failed. Over time, a bunch of newer systems have been added that integrate with the old stuff, relying heavily on networked communication. The newer infrastructure might be designed to be redundant and deal with outages, but the older parts are not.

So when an outage hits, the old systems don't handle it well. They crash right in the middle of whatever they were doing, resulting in bad data writes and other unexpected states that the system can't handle. The newer systems might keep running, but that can be just as bad - they'll start getting errors when trying to communicate with the older stuff, likely causing a cascade of failures. Error logs fill up, bad data might be written to databases, filesystems get full, etc. All of these things require a human to troubleshoot and fix manually, so simply turning everything back on after the outage doesn't work. It can take hours or even days to get everything back to normal.

If handling system outages is a high enough priority in the organization, there are ways to deal with these situations. When new systems are put into place, they need to deal gracefully with outages from the older systems. Even the oldest systems can be put into some sort of redundant configuration - one place I worked ran two parallel AS400 systems, and switched back and forth every few months. We knew for sure that we could recover from one of them going out, and the users never knew it was happening. Most importantly, someone high up in the organization has to champion all of this, putting their foot down when someone tries to cut a corner and move forward without the disaster recovery in place.

I suspect Delta will be making some pretty major changes in their IT organization after this fiasco. It won't be easy or quick, though. Whether they stick with it and get it right...well, we'll know when the next power outage hits.