cascading computer system failure at Delta

A cascading computer system failure knocked Delta airlines out of commission on August 8.

At least half of all Delta Air Lines flights Monday were delayed or canceled after a power outage knocked out the airline’s computer systems worldwide…

Delta representatives said the airline was investigating the cause of the meltdown. They declined to describe whether the airline’s information-technology system had enough built-in redundancies to recover quickly from a hiccup like a power outage…

Airlines depend on huge, overlapping and complicated systems to operate flights, schedule crews and run ticketing, boarding, airport kiosks, websites and mobile phone apps. Even brief outages can snarl traffic and cause long delays.

As the world becomes more automated, things might get smoother when everything is working well, but when something goes wrong it might get harder and harder to recover. Hopefully, major government, military and financial computer systems will have “enough built-in redundancies”.

They do, according to an article in The Week. Delta actually had backup systems in place, and the problem was that they didn’t kick in correctly. Major financial companies have even more layers of backups and pay more attention to them because they have even more at stake.

Delta, like most major airlines, likely had one or more back-up systems in place to take over in an emergency like this. Often a company has an extra system housed in its main data center identical to the main system, plus another one in a separate data center in case both local systems are taken out in a major event, like a fire. Some companies even have a third redundant system that is cloud-based or housed in a separate location.

“Some of these disruptions should not have occurred,” Hecht says. “Delta IT did something wrong that caused its redundancy structure to not function as needed. The problem was not the power failure itself; 99.9999 percent of power failures never cause service disruptions.” …

…most airlines use manual testing to verify their data protection, meaning a human being actually has to take time out of their day to test the system on a regular basis. Other industries, like banking and finance, rely on automatic systems to lower the risk of a full blackout. Automated systems can be pricey, and while Delta’s outage is probably costing the company a hefty sum (Southwest’s outage last month was expected to cost the airline up to $10 million), an hour-long outage in the banking sector would create far more mayhem and profit-loss, so finance companies are more likely to pay up for automated systems.

 

Leave a Reply

Your email address will not be published. Required fields are marked *