Planning for the Worst: How to Avoid Critical Software Breakdowns - NP GROUP

Software can break, sometimes for no good reason. How should you prepare and plan in advance of a software breakdown?

Skip navigation and go to main content
Page Image
Planning for the Worst: How to Avoid Critical Software Breakdowns - NP GROUPNPG882 Pompton Ave, 882 Pompton Ave Cedar Grove, NJ 07009Software can break, sometimes for no good reason. How should you prepare and plan in advance of a software breakdown?
CUSTOM UI/UX AND WEB DEVELOPMENT SINCE 2001

Planning for the Worst: How to Avoid Critical Software Breakdowns

8 MinMAY 17, 2021

This post is inspired by an issue I'm dealing with right now. I found myself the victim of one of those things that just sort of happen: An API integration that failed. On this site, our contact forms generally go directly to HubSpot, where marketing automation takes over. For us, that's a pretty big issue, as so many of our new customers find us from organic means. No leads mean no new business. That isn't good, and when an error is encountered like this, a sinking feeling settles in that is discomforting, to say the least.

Now, you may be wondering why I'd admit to such an error. Shouldn't a web development agency be ready for this kinda thing?

Well, yes, we should, but I'll make two points to maintain my own credibility. First, unknown things happen to anyone. The cause of this particular issue is so odd; it's nearly unpredictable. And secondly, you do know the old line about the shoemaker's kid, right? Often in the day-to-day operation of the business, our own website comes secondary to the work we do for others. In fact, this current site just went live a few months ago after a 5-year run for the old design. But it's true, and I'll admit, this one slipped by all of us.

In reality, this type of thing happens often – systems that never fail, fail. And when that happens, we figure out why, implement fixes, and implement redundancies that will avoid such an issue in the future. Inevitably, the cycle repeats, with something else significant breaking next time, and the exact steps will be retaken. It's literally impossible to prevent unknowns because if it was, the issues wouldn't be unknown, right?

However, the goal of a comprehensive web development and maintenance strategy is to predict as many unknowns as we can, particularly in areas of urgent importance to your company and application. In this post, I will give some suggestions to help you identify critical areas and possible ways you can implement redundancies. But first, my own example…

The Issue: A Contact Form to Nowhere!

As far as I can tell, my contact form stopped working about 2 or 3 weeks ago. I don't get a large number of leads – we are a services organization, so we can go a week or two without a qualified lead, right? Well, turns out that's the first misstep, which is simply assuming that there was just a lull in traffic or conversions. What I should have done a week or two ago was trusted my hunch and investigated. We all get distracted, though, and sometimes we lose track of things. I finally did discover this error when I noticed that someone booked time in my calendar – but I had no idea who they were (more on that in a minute). My process for lead generation involves a contact form inquiry, then a needs assessment – and if you complete both steps in a row, you get a link to my calendar. This person did all of those things and booked an appointment. When I looked at my calendar, I realized that this person had no backfile – there was no activity for me to look at.

That alone would generally be a cause for concern – I had seen this appointment over a week ago. But in this case, it didn't set off an alarm bell because the new lead had the same name as a person I already do business with. So, I had just assumed they already knew how to book an appointment when I saw the meeting get set – and I didn't pay it any attention.

At this point – it all sounds like a passage from Malcolm Gladwell's book Outliers, where he studied how culture and accumulated errors cause plane crashes. In particular, this passage is relevant:

Then the errors start — and it's not just one error. The typical accident involves seven consecutive human errors. One of the pilots does something wrong that by itself is not a problem. Then one of them makes another error on top of that, which combined with the first error still does not amount to catastrophe. But then they make a third error on top of that, and then another and another and another and another, and it is the combination of all those errors that leads to disaster.

The same goes for almost every catastrophic bug we see in the software business. In this case, an underlying issue knocked out our API connectivity. I believed it was just a seasonal slowdown – this isn't a busy time of the year for new leads to come in. One lead managed to go through, but it was the same name as another client. The chain reaction here is much like an air disaster (albeit with a much smaller consequence, of course) – small events and misfortunes turn into a significant issue.

What's the cost of this error? Well, analytics tell me about 15 leads came through, and 3 filled out needs assessments. I managed to salvage one. So – it's hard to put an economic value on it – but it isn't good.

With all of that said, what should I have done? And what will I do now?

Identify Lynchpins

This is obvious in my case, but if you are still in the pre-crisis mode, perhaps you should take a look into where your areas of maximum value are. If you are in lead generation, then study in advance all risk areas around your contact forms. If you are selling products, everything around your checkout process. Find the one place that, if compromised, means the most harm to your business. Then, try to predict points of failure. You need to literally expect the worst – even if it's not probable. Trust me: anything can happen. When completed with this process, you can now work on the bottom steps to alleviate future stressful incidents.

One last note – make sure everyone on your team is aware of these failure points. You want complete ownership by all levels of team members, contractors, and providers.

Build Redundancies

In our case, we luckily had some of the data backed up to local CMS systems, but we – I mean I – failed to check them regularly. Website owners tend to trust API connections and believe they will always work. The truth is, not really. These connections fail all the time. In susceptible environments, we'd introduce logging of requests or store the data locally. In our case, where our own site is often secondary to our clients and their needs, we failed to implement this comprehensively. It was there, but we needed a reminder that it was. As of now, this is my number one priority – making sure any form submitted on our site that relies on API transmission of data either stores the data locally in a database or stores the API calls with data attached. And most of all – it notifies that it's there.

In your case, redundancies can mean any number of things. It could mean making sure there is heightened system availability. Perhaps introducing redundancies of databases or similar via clustering or load balancing. Or it could be identical to our issue – crafting a methodology that mirrors the primary system in case there are outages. Either way, finding how you can have a suitable backup in the case of primary system malfunction is essential.

Tighten Up Deployment Practices

Most likely, my issue was caused by a software update on the server itself, which ultimately crashed DNS services, thus making our system's ability to connect to the API rendered null. It couldn't find the API on the internet. Our suspicion is that it was an update run by the hosting provider. Because there is at times a bit of a gap between system administration and development, it led to an outage. This is something that can be avoided with some simple communication between different teams of people. What is a mundane update for a systems administrator can make software go haywire and leave people running to diagnose what's going on. To fix this, it's essential to follow best practices for development, staging, and live environments. On top of that, you must also make sure any updates are documented and announced ahead of time. This is hard when dealing with providers and web hosts – but it must be thought of and planned for.

Consider Manual and Automated Testing

For our clients, we perform a host of ongoing manual tests for their most critical applications. Usually, this involves a script, which our team will conduct regularly, either looking for errors or areas of possible improvement. While not the cheapest way to test, it works really well.

To avoid that time and/or expense, you could also consider automated testing to ensure that certain things are functioning as they should. This is a decent alternative, but you should also understand that it never works seamlessly. Plus, throwing more tech at something introduces some more possibilities for future error.

Either way, do have some ongoing cadence planned that provides for some assurances that your systems are functioning regularly. Trust, but verify that everything is working to spec.

Learn Each Time

Finally, as I said earlier, every time there is a failure, there is also a new learning experience. A chance to understand what happened, plan for it, and install monitoring or response strategies. Never walk away from an outage without having learned a lesson. Luckily, we aren't manufacturing airplanes, and we have the luxury of taking this approach. Typically, the lesson will be apparent in whatever fix you manage to utilize to cure the issue. Be humble, learn the lesson, and be wiser for it.

Wrapping Up

It really ruined my Sunday afternoon to realize that leads weren't coming in. That feeling sucks, to put it bluntly. But, I also recognize that these things happen, and we move on and grow after each one. For myself, the lesson learned is just how vital these leads are, to keep secondary notifications running, and to never assume that we're just "in a slow period." It would be time well spent to think about how such an outage could affect you and plan ahead. Hopefully, you'll avoid a bit of stress and preserve the calmness of your Sundays!

Focus on your business. Let ous team focus on the upkeep. Learn more about our maintenance services.