On Dec. 21, 2022, just as peak holiday season travel was getting underway, Southwest Airlines went through a cascading series of failures in their scheduling, initially triggered by severe winter weather in the Denver area. But the problems spread through their network, and over the course of the next 10 days the crisis ended up stranding over 2 million passengers and causing losses of $750 million for the airline.
How did a localized weather system end up triggering such a widespread failure? Researchers at MIT have examined this widely reported failure as an example of cases where systems that work smoothly most of the time suddenly break down and cause a domino effect of failures. They have now developed a computational system for using the combination of sparse data about a rare failure event, in combination with much more extensive data on normal operations, to work backwards and try to pinpoint the root causes of the failure, and hopefully be able to find ways to adjust the systems to prevent such failures in the future.
The findings were presented at the International Conference on Learning Representations (ICLR), which was held in Singapore from April 24-28 by MIT doctoral student Charles Dawson, professor of aeronautics and astronautics Chuchu Fan, and colleagues from Harvard University and the University of Michigan.
âThe motivation behind this work is that itâs really frustrating when we have to interact with these complicated systems, where itâs really hard to understand whatâs going on behind the scenes thatâs creating these issues or failures that weâre observing,â says Dawson.
The new work builds on previous research from Fanâs lab, where they looked at problems involving hypothetical failure prediction problems, she says, such as with groups of robots working together on a task, or complex systems such as the power grid, looking for ways to predict how such systems may fail. âThe goal of this project,â Fan says, âwas really to turn that into a diagnostic tool that we could use on real-world systems.â
The idea was to provide a way that someone could âgive us data from a time when this real-world system had an issue or a failure,â Dawson says, âand we can try to diagnose the root causes, and provide a little bit of a look behind the curtain at this complexity.â
The intent is for the methods they developed âto work for a pretty general class of cyber-physical problems,â he says. These are problems in which âyou have an automated decision-making component interacting with the messiness of the real world,â he explains. There are available tools for testing software systems that operate on their own, but the complexity arises when that software has to interact with physical entities going about their activities in a real physical setting, whether it be the scheduling of aircraft, the movements of autonomous vehicles, the interactions of a team of robots, or the control of the inputs and outputs on an electric grid. In such systems, what often happens, he says, is that âthe software might make a decision that looks OK at first, but then it has all these domino, knock-on effects that make things messier and much more uncertain.â
One key difference, though, is that in systems like teams of robots, unlike the scheduling of airplanes, âwe have access to a model in the robotics world,â says Fan, who is a principal investigator in MITâs Laboratory for Information and Decision Systems (LIDS). âWe do have some good understanding of the physics behind the robotics, and we do have ways of creating a modelâ that represents their activities with reasonable accuracy. But airline scheduling involves processes and systems that are proprietary business information, and so the researchers had to find ways to infer what was behind the decisions, using only the relatively sparse publicly available information, which essentially consisted of just the actual arrival and departure times of each plane.
âWe have grabbed all this flight data, but there is this entire system of the scheduling system behind it, and we donât know how the system is working,â Fan says. And the amount of data relating to the actual failure is just several dayâs worth, compared to years of data on normal flight operations.
The impact of the weather events in Denver during the week of Southwestâs scheduling crisis clearly showed up in the flight data, just from the longer-than-normal turnaround times between landing and takeoff at the Denver airport. But the way that impact cascaded though the system was less obvious, and required more analysis. The key turned out to have to do with the concept of reserve aircraft.
Airlines typically keep some planes in reserve at various airports, so that if problems are found with one plane that is scheduled for a flight, another plane can be quickly substituted. Southwest uses only a single type of plane, so they are all interchangeable, making such substitutions easier. But most airlines operate on a hub-and-spoke system, with a few designated hub airports where most of those reserve aircraft may be kept, whereas Southwest does not use hubs, so their reserve planes are more scattered throughout their network. And the way those planes were deployed turned out to play a major role in the unfolding crisis.
âThe challenge is that thereâs no public data available in terms of where the aircraft are stationed throughout the Southwest network,â Dawson says. âWhat weâre able to find using our method is, by looking at the public data on arrivals, departures, and delays, we can use our method to back out what the hidden parameters of those aircraft reserves could have been, to explain the observations that we were seeing.â
What they found was that the way the reserves were deployed was a âleading indicatorâ of the problems that cascaded in a nationwide crisis. Some parts of the network that were affected directly by the weather were able to recover quickly and get back on schedule. âBut when we looked at other areas in the network, we saw that these reserves were just not available, and things just kept getting worse.â
For example, the data showed that Denverâs reserves were rapidly dwindling because of the weather delays, but then âit also allowed us to trace this failure from Denver to Las Vegas,â he says. While there was no severe weather there, âour method was still showing us a steady decline in the number of aircraft that were able to serve flights out of Las Vegas.â
He says that âwhat we found was that there were these circulations of aircraft within the Southwest network, where an aircraft might start the day in California and then fly to Denver, and then end the day in Las Vegas.â What happened in the case of this storm was that the cycle got interrupted. As a result, âthis one storm in Denver breaks the cycle, and suddenly the reserves in Las Vegas, which is not affected by the weather, start to deteriorate.â
In the end, Southwest was forced to take a drastic measure to resolve the problem: They had to do a âhard resetâ of their entire system, canceling all flights and flying empty aircraft around the country to rebalance their reserves.
Working with experts in air transportation systems, the researchers developed a model of how the scheduling system is supposed to work. Then, âwhat our method does is, weâre essentially trying to run the model backwards.â Looking at the observed outcomes, the model allows them to work back to see what kinds of initial conditions could have produced those outcomes.
While the data on the actual failures were sparse, the extensive data on typical operations helped in teaching the computational model âwhat is feasible, what is possible, whatâs the realm of physical possibility here,â Dawson says. âThat gives us the domain knowledge to then say, in this extreme event, given the space of whatâs possible, whatâs the most likely explanationâ for the failure.
This could lead to a real-time monitoring system, he says, where data on normal operations are constantly compared to the current data, and determining what the trend looks like. âAre we trending toward normal, or are we trending toward extreme events?â Seeing signs of impending issues could allow for preemptive measures, such as redeploying reserve aircraft in advance to areas of anticipated problems.
Work on developing such systems is ongoing in her lab, Fan says. In the meantime, they have produced an open-source tool for analyzing failure systems, called CalNF, which is available for anyone to use. Meanwhile Dawson, who earned his doctorate last year, is working as a postdoc to apply the methods developed in this work to understanding failures in power networks.
The research team also included Max Li from the University of Michigan and Van Tran from Harvard University. The work was supported by NASA, the Air Force Office of Scientific Research, and the MIT-DSTA program.