Traps to avoid when dealing with time critical issues

Mikey

October 2, 2019

Traps to avoid when dealing with time critical issues

All Developers/Operations team members know the pressure and stress of finding and fixing a critical issue in their code or infrastructure.

When dealing with issues, they often enter panic mode to try and get the issue resolved as soon as possible. While this can yield positive results, it can also lead to mistakes.

Two frequent traps we fall into when in panic mode are:

The quick fix - you cut some corners so that everything starts working again, promising to fix those tests or perform that extra step later

Making too many changes - you implement several fixes at once, then have to figure out what change fixed the issue

Why do we fall into these traps?

The essence of Software engineering is to simplify the development of software by making code that is easy to enhance, maintain and debug.

Today we make use of services, frameworks and modern programming languages. This minimizes the code we need to write ourselves as there is already a tested and verified tool that can do the task for you.

You can implement services, frameworks and programming languages without fully understanding their inner workings. Initially, this makes development quick, but when it comes to debugging an issue, you may lack the knowledge to diagnose the root cause.

For example; having your load-balanced applications running an MVC framework behind an Apache server to route theoretically makes deploying your application simple.

The problem comes if you are trying to debug an issue where connections to your applications are hanging. Is the problem in the load balancer, Apache server or framework, and how do these components handle connections to the app?

Why 'The quick fix' is not the correct approach?

The quick fix is a classic technique for handling escalations or critical issues. The user of your application is happy the software is working, and your manager stops putting pressure on you.

Making the customer and your manager happy is a good thing, but in doing so, you are causing yourself future problems for the following reasons;

You don't find out what caused the issue

If you have an issue in an application and the fix is to restart the application or scale up the application, chances of the issue occurring again are high as the root cause still exists.

This approach could potentially become costly.

Continually scale up to avoid issues requires more resource. 6 Months ago, you were running two copies of the application with no problems, but now you are paying to run eight copies.
Spending 2 hours a week stabilizing software, rather than a days investigation to stop the issue occurring adds up over time.

You have not fixed the issue itself; only it's side-effect

If you have an application where randomly there is an error retrieving information fro your database, you could add checks to the code to protect against the issue. The user would see a friendly error page asking them to try again.

However, you still don't know what caused the database request to fail.

You plan to investigate the issue later, but often that doesn't happen

If we take the example database request above, you often promise to look into the issue. This could be in the form of an engineering ticket, a node or just a mental reminder.

As a software engineer, there are always lots of tasks to do, so it is easy for the priority of investigating the issue to slip and be forgotten.

Why 'Making too many changes' is not the correct approach?

With modern applications, there are many components such as databases, hosting providers or services the application uses that are not controlled by the code you develop.

You could have a bug in the code, but there could also be a misconfiguration in the components external to the application.

Isolating the component causing the issue can be time-consuming, which can lead to the temptation to change many components to fix all the suspected causes.

Where there are several suspected causes, and they each have their own fix/workaround, to save time, it is logical to deploy all the changes simultaneously.

Deploying these changes does resolve the issue, but then you have to consider the following;

How many changes were implemented?

There is no guarantee that the issue resolves itself from the first set of changes. There may be several iterations of changes made until the issue is resolved.

It is easy to lose track of precisely what changes were implemented, resulting in unknown changes being added to the application.

What changes had a positive impact on the application?

If a deploy had 10 changes and fixed the issue, there is no way to know which change or combination of changes fixed the issue. In this situation, you should remove the changes that had no impact, but you cannot do this without the risk of breaking the application again.

As removing changes from the last deploy of the application risks downtime of the application, often the changes are left in the application.

What side-effects could the changes have introduced?

Initially, changes made to the application appear to have resolved the issue, but you must consider the long term side effects of these changes.

For example, increasing the maximum concurrent connections to a server may resolve an issue with clients seeing random timeouts, but it will also increase the resource consumption of the server. This may result in further issues weeks or months later.

Tools to help to avoid these traps

The key to overcoming the temptation of these traps is to reduce the time it takes to find and fix the issue; this way, there is no need to cut corners to make the application users happy.

It is possible to reduce the diagnostic time with logs, APM software and error monitoring. Using any available tools to help you is heavily recommended where possible.

April 12, 2022

January 25, 2022

January 21, 2022

July 15, 2021

July 1, 2021

September 30, 2019

September 24, 2019

September 9, 2019

September 6, 2019

July 15, 2021

June 28, 2021

June 9, 2021

January 20, 2021

November 17, 2020