As we wrap up the first month of 2023, one can't help but notice that software, or more specifically the failure of software to perform, has garnered a significant amount of attention so far this year. In just a few short weeks, software failures have rendered an airline incapable of transporting passengers, shut down the nation's entire air traffic control system, and as recently as last week halted trading on the New York Stock exchange, all due to various levels of malfunction.
Seeing how the year is still young, could this be a preview of a common theme and more news headlines to come?
Obviously, we hope that's not the case. But, it's hard to ignore that as software more and more powers are economy and our lives, these types of failures are going to happen.
Chances are if you're reading this blog you're not maintaining software as complex as the Stock Exchange, an airline, or the FAA. However, you probably are responsible in one way or another for software that affects your business's ability to perform. While perhaps less sophisticated, the problems that led to these failures can exist out of business or organization of any scale, regardless of how complicated or simplistic the software in place is.
As we progress through this new year, it's a good time to study what caused these failures, and put steps in place so as to avoid a repeat outcome with our own businesses. before digging deeply into preventative measures that we can take, let's briefly summarize some of the issues that we've seen so far this year.
Southwest Airlines Debacle
It's highly unlikely that anyone is not familiar with this story by now, but for those who need a refresher, let me summarize what happened. Over the busy holiday travel period, much of the country was affected by a complicated weather situation. This put a strain on airlines during one of their peak travel seasons. While most airlines were able to manage their way through it, Southwest Airlines basically came to a standstill. From what we can tell, the issue wasn't the ability of airplanes to travel or passengers to show up for their flights, but that the software which manages Southwest's operations was not equipped to handle so many extraordinary variables at once.
In the days that followed, revelations came to light about the inaction of the previous executive team in terms of investments in their software platform. Inefficiencies in crew scheduling, for example, put them at a disadvantage compared to their competitors. The software was so inept and so poorly iterated, that for the first time, the term technical debt began to be mainstreamed. By some estimates, southwest might have lost or will lose, somewhere in the area of $1 billion.
No more than two weeks after the Southwest disaster, on January 11th, the nation's air traffic suffered its largest and most catastrophic disruption of service since 9/11. Flights from coast to coast were grounded and unable to depart due to a software malfunction.
As systems were brought back online, and investigators began researching what happened, it became apparent that human error, and potentially bad practice by a contractor, led to the standstill of air traffic that frustrated an already frazzled public. Furthermore, attention was brought to the FAA itself, and possible weaknesses that have been developed due to underfunding and overworked employees.
Worst of all, this incident shattered confidence in an integral part of our nation's infrastructure, one that drives our economy and our ability to travel. Of all of the systems to suffer a failure, this is perhaps one of the worst and most visible, as it affects many people at one time, and also sidelined our nation's air travel, which is always top of mind for the media and audiences in general.
New York Stock Exchange
As I began writing this post last week, yet another software crisis occurred on the morning of January 24th. The New York Stock exchange opened for trading and immediately again suffered from an issue that involved some of their largest publicly traded companies. The error was one such that as the market opened, many stocks were shown to have abnormally large moves in their value. As you may or may not know, the Stock Exchange long ago enacted a series of circuit Breakers, switches that are enacted when volatility is deemed too high to allow trading to safely continue. When this issue was picked up by the systems, trading was halted in certain securities.
It's too early as of the writing of this post to know what caused this issue, though fingers are pointed to human error yet again, but obviously, it's highly concerning. There have been issues like this in the past with the Stock Exchange, and at this point, most people are probably used to the occasional glitch. However, such errors do not instill confidence in a public that at times thinks that machines and algorithms control the market, not people.
What’s the Real Damage?
In all these instances, the damages can be catastrophic. For Southwest Airlines, it meant the quick erosion of what was a loyal customer base. For the FAA, it leads to a further loss of confidence in both their ability to manage our air traffic control system and to an extent their focus on safety. For the New York Stock exchange, anytime there's a hiccup that involves people's financial well-being, credibility is lost.
What About Your Company?
As I stated above, these examples all site mission-critical systems. We expect these systems to be operational 100% of the time regardless of factors that may or may not have been predicted. However, as we study the root cause yes of all of these failures, the problems are caused by decisions or missteps that are relatively simple and could happen to any company regardless of the scale or complexity of the software.
The Southwest Airlines example was the result of bad decision-making and lack of investment. For many years whistleblowers inside the company were highlighting that the systems were ineffective and outdated. However, the previous management team did not deem it important enough to invest and instead kept building on top of a shoddy framework.
The FAA is an example of potential mismanagement of procedures and failure to adhere to best practices. It seems in that case that some contractors literally deleted the wrong files from a production environment. This should never be allowed to happen, regardless of the size of the organization. And especially when it concerns people's safety.
If I were to tell you two months ago that one of the nation's largest airlines would completely buckle due to outdated software or that the nation's air traffic control system or ground flights nationwide because of accidentally deleted files, you would tell me that it was being at best an alarmist and at worst overly anxiety-ridden. Yet here we are, these things really happened. And we see these things happen to companies of various sizes in our day-to-day practice, sadly. The net effect of these failures is the same, regardless of the scale of the organization. It's a loss of consumer confidence, the declination of reputation, and the stress and anxiety of a fire to put out.
So what can you do to avoid or at worst prepare for the best outcome should a failure occur? These are a few rules worth considering:
Rule 1: Realize You Must Invest. And never ceases to amaze me how little management wants to invest in software, despite how much they invest in other areas. One of my favorite stories was about a company that spent hundreds of thousands of dollars on office space despite the fact that most people were working from home. The same company would not invest and a comprehensive maintenance plan for their software. If you've built and deployed custom software to power your business, you must realize that there is also a need for ongoing improvements and iterative development of the platform. How much you need to invest depends on a variety of variables, which follow outside the scope of this post. But for now, accepting the realization that an investment is required it's a step in the right direction.
Rule 2: Software is Living and Breathing. Software, much like any other asset that your company owns, is not a once-and-done proposition. It needs to be consistently tested, challenged, and improved upon. failure to improve upon your software can lead the trouble down the line. Because the world and technology are continuously changing, so too must your systems. And those changes must happen in a stable and thoughtful way. You can't rush progress and not expect to pay a price in the long term. As I mentioned earlier, the term technical debt was made a mainstream term thanks to Southwest Airlines, Who did not do a good job of properly iterating their systems, thus leading to instability and an inability to innovate.
Rule 3: Mitigate Risks Before Problems Arise. I wish I could stand here and tell you that software failures are completely avoidable, but the truth is they are not. Failures happen, especially these days when so much of our custom software is relying on third parties for key functionality. Planning and implementing methods to deal with errors is time-consuming and adds to the overall scope and cost of a development project. However, proper planning can save many a headache later. As you initially develop and iterate your software, always consider possible points of failure, and what can the system do to survive in a worst-case scenario.
It stands to reason that as businesses and industries become more and more reliant on software to automate and streamline their processes that failures will occur. As I said above, to an extent failures are unavoidable. However, failures that are a result of bad planning, lack of attention, or failure to adhere to best practices are completely avoidable. Sadly, regardless of the importance of a platform, the most simple causes are still causing catastrophic events.
Organizations need to make a point of ensuring that someone has their finger on the pulse of not only their software systems but where the failure points and areas of risk lay. A complete understanding of vulnerabilities is the foundation of a strategy that focuses on the safety, scalability, and deliverability of your software - and the reputation of your business.