September 30, 2019
All Developers/Operations team members know the pressure and stress of finding and fixing a critical issue in their code or infrastructure.
When dealing with issues, they often enter panic mode to try and get the issue resolved as soon as possible. While this can yield positive results, it can also lead to mistakes.
Two frequent traps we fall into when in panic mode are:
The quick fix - you cut some corners so that everything starts working again, promising to fix those tests or perform that extra step later
Making too many changes - you implement several fixes at once, then have to figure out what change fixed the issue
The essence of Software engineering is to simplify the development of software by making code that is easy to enhance, maintain and debug.
Today we make use of services, frameworks and modern programming languages. This minimizes the code we need to write ourselves as there is already a tested and verified tool that can do the task for you.
You can implement services, frameworks and programming languages without fully understanding their inner workings. Initially, this makes development quick, but when it comes to debugging an issue, you may lack the knowledge to diagnose the root cause.
For example; having your load-balanced applications running an MVC framework behind an Apache server to route theoretically makes deploying your application simple.
The problem comes if you are trying to debug an issue where connections to your applications are hanging. Is the problem in the load balancer, Apache server or framework, and how do these components handle connections to the app?
The quick fix is a classic technique for handling escalations or critical issues. The user of your application is happy the software is working, and your manager stops putting pressure on you.
Making the customer and your manager happy is a good thing, but in doing so, you are causing yourself future problems for the following reasons;
If you have an issue in an application and the fix is to restart the application or scale up the application, chances of the issue occurring again are high as the root cause still exists.
This approach could potentially become costly.
If you have an application where randomly there is an error retrieving information fro your database, you could add checks to the code to protect against the issue. The user would see a friendly error page asking them to try again.
However, you still don't know what caused the database request to fail.
If we take the example database request above, you often promise to look into the issue. This could be in the form of an engineering ticket, a node or just a mental reminder.
As a software engineer, there are always lots of tasks to do, so it is easy for the priority of investigating the issue to slip and be forgotten.
With modern applications, there are many components such as databases, hosting providers or services the application uses that are not controlled by the code you develop.
You could have a bug in the code, but there could also be a misconfiguration in the components external to the application.
Isolating the component causing the issue can be time-consuming, which can lead to the temptation to change many components to fix all the suspected causes.
Where there are several suspected causes, and they each have their own fix/workaround, to save time, it is logical to deploy all the changes simultaneously.
Deploying these changes does resolve the issue, but then you have to consider the following;
There is no guarantee that the issue resolves itself from the first set of changes. There may be several iterations of changes made until the issue is resolved.
It is easy to lose track of precisely what changes were implemented, resulting in unknown changes being added to the application.
If a deploy had 10 changes and fixed the issue, there is no way to know which change or combination of changes fixed the issue. In this situation, you should remove the changes that had no impact, but you cannot do this without the risk of breaking the application again.
As removing changes from the last deploy of the application risks downtime of the application, often the changes are left in the application.
Initially, changes made to the application appear to have resolved the issue, but you must consider the long term side effects of these changes.
For example, increasing the maximum concurrent connections to a server may resolve an issue with clients seeing random timeouts, but it will also increase the resource consumption of the server. This may result in further issues weeks or months later.
The key to overcoming the temptation of these traps is to reduce the time it takes to find and fix the issue; this way, there is no need to cut corners to make the application users happy.
It is possible to reduce the diagnostic time with logs, APM software and error monitoring. Using any available tools to help you is heavily recommended where possible.
Technical Support Engineer and Developer advocate for nerd.vision and other software projects. Background in object-oriented software development and data analysis. Personal hobbies include simulation gaming and watch collecting.