Fixing Deployment Pipeline Steps' Failures in a Timely Manner
by Areti Panou
One of the cornerstones of Continuous Delivery is to establish a reliable deployment pipeline. Each step of the pipeline aims to provide confidence and feedback on the quality of our changes in a timely manner. Removing any blockers from the pipeline should be the utmost priority for a team so that there is a constant flow of feedback for the changes.
But if keeping the pipeline running is so important, why do we sometimes ignore failures? Why do we put up with failing tests? What makes us turn a blind eye to broken steps? What stops us from taking action to fix the issues?
One might find the answer in Douglas Adams’ 1982 novel “Life, the Universe and Everything”, which is the third book in the five-volume “Hitchhiker's Guide to the Galaxy” series.
“The Somebody Else's Problem field... relies on people's natural predisposition not to see anything they don't want to, weren't expecting, or can't explain”.
The purpose of the field is to hide something in plain sight by simply making it somebody else’s problem. And deployment pipelines that have failed steps which are taking too long to be fixed, tend to be covered by this field. The failing steps are somebody else’s problem because we don't want to see them, we don't expect to see them, or we can't explain them.
In this article, you will find examples of such unpleasant steps along with some ideas and suggestions on how to improve them.
Things you don’t want to see
Quarantine, stabilise or remove
Flaky steps, that in most cases are some sort of automated checks, are steps that are failing randomly. They fail, not because they have detected an issue with your application after you changed something, but just because they feel like it. Richard Bradshaw explains in detail how things might get flaky. But the fact remains: if a step gets the reputation of being flaky, even if someone knows that their change was the one that caused it to fail, it’s likely that they won’t spend any time fixing it.
An approach to reduce the amount of these flaky steps is to take Martin Fowler's advice and quarantine them.
Place any non-deterministic test in a quarantined area. (But fix quarantined tests quickly.)
Once flaky steps are singled out, it is easier to evaluate them and answer the following questions:
If the purpose of a step is to anticipate a business risk, shouldn’t we, as a product team, invest to make it more stable? And if it isn’t, is it worth having it at all?
Long running steps
Parallelise to save time
Another category of steps that when they fail, tend to get ignored, are the ones that run for a long time. Not the ones that you have just enough time to go get yourself a coffee, but the ones that you could go home, cook a three course meal and still not know if they are finished or not. If your task is to fix such a step it will take a very long time to know if you are done with it or not, making you reluctant to start with it in the first place.
For these steps, one could start thinking about alternative ways to address the same risks but with splicing things differently. Can we break them down into smaller, faster pieces? Can we run them in parallel to save time?
Keeping in mind that each step of the pipeline is there to provide us with fast feedback, maybe it makes sense to start thinking in "smaller" questions so that we improve the running time of each step.
Simplify or replace
Finally, you won't want to work on failing steps that you know from previous experience will take you the rest of the day to fix or debug. The following example sums it up nicely.
A 30 minute code change took 2 weeks to get the acceptance tests working.
This is a quote from a presentation by Sarah Wells, who is Technical Director for Operations and Reliability at the Financial Times.
Having such cumbersome acceptance steps in your pipeline increases the chance that when people break them, they don’t even want to start working on them.
This is the solution that the Financial Times applied to resolve this issue.
Introduce synthetic monitoring. This replaced our acceptance tests.
By moving the step out of the pipeline and shifting it to after deployment to production, they improved a situation in their pipeline that nobody wanted to take care of. Granted this approach doesn’t work for every product or business but you can consider what could be the analogy in your case.
Having detailed error messages logged might be one way to reduce the complexity of fixing such failing steps.The aim is to implement steps that are intuitive and not time-consuming to fix. And in case they can’t be made simpler, maybe it is time to replace them altogether with something different.
Things you don’t expect to see
Unresponsive external services
Ping external services before starting the pipeline
There is always a certain number of external services used in a deployment pipeline. Once these systems are configured, we stop paying any attention to them and we consider them a staple in our process. And it is quite an unexpected situation when these services that you use to build or test things are not responding.
Surely, if you cannot connect to your Nexus repository, this is such a rare and critical situation that somebody must already be working on it. So not your problem.
In this case, does it make sense to even start the entire pipeline for a new change, if you know that the 10th step will fail?
An idea to resolve this would be to do a fast sanity check of all the needed services before starting. But even if you can't put together something like that, clear error log messages in each step pointing to whether the application or the infrastructure failed can be a real help to address the unexpected situation. Minimising the surprise of external services not being available or having enough information on how to get them back can help to address their failures sooner than later.
New, unannounced steps
Establish a new-steps ritual
So far, the reasons mentioned here about step failures have mainly to do with technical aspects. Nevertheless, lack of communication can also cause problems.
Consider for example, a change of yours breaking a step that you are not even aware existed. Spending the time to investigate what it is that the step is doing and how you can fix it is something you might not find that appealing. Nobody told you about it, it was not supposed to be there, it’s somebody else’s problem.
Luckily, this can be an easy one to address by simply making a bit of a fuss, each time that a new step is introduced into the pipeline. You can establish a small, fun ritual about it, like voting for a nickname for it. Whoever gives the best name gets a sticker.
Adhering a ritual to the announcement of a new step makes it more memorable, thus removing the unexpected factor.
Being the n-th person notified of the failed step
Create public awareness of who is resolving the failure
One of the most common situations is receiving a notification asking you to fix a failing step, but seeing that you and 6 other people received the same call to action. You don’t expect to be receiving this email, since surely somebody else caused the failure before you, otherwise, why are they on the mailing list? Again, it’s somebody else’s problem!
A solution would be to check if the technology of your pipeline offers any options to visualise if someone is already working on a fix for a failure.
The Claim Jenkins plugin is a great example of this. This plugin allows users to take responsibility for failed steps and creates a dashboard with all the claimed and unclaimed failing steps. It gives a good overview of what is being fixed as well as what needs attention.
Creating this kind of team awareness gives visibility that you should do something and that is a reason for you to act.
Things you can’t explain
Steps with unclear purpose
Make the value and information of the step visible
Most people don’t want to deal with things they find meaningless. They want to know “why” they need to work on something. Consider having to fix a step called "Scenario Tests". Scenarios of what? What level of the application are they exercising?
So instead of having steps with just a generic name, it is a good idea to give descriptive names and attach some further information to it. For example:
Make the description of the step as explicit as possible.
If you have any wiki or info pages that describe the implementation of the step, add them there as well.
Add the contact information and availability of people that have a deeper understanding of the step.
Having the right amount of information makes people less reluctant to start fixing failed steps. This is because they understand the value (the why) that the step brings and they will need to invest less time in figuring out what they are all about in the first place.
Steps with unclear failure implications
Make the impact and the reasons of the failure transparent
Consider as well what happens when, you understand what the purpose of a step is, but you perhaps don’t fully comprehend the consequences of its failure.
For example, let’s say that you have a step in your pipeline checking the licenses of the open source libraries used in your commercial product. What would it mean if a type of license that is not allowed by your company has been detected? Why fail the step if only one "bad" license is found and then have to put in the extra effort to fix it?
The implications of having a failed step checking that a legal requirement of your company is fulfilled are different than one checking an internal quality criterion. In the example above, your employer might not be very happy if you include an open source component in your product with a license that effectively makes your product free for everyone. On the other hand, they would probably show more tolerance if code coverage doesn't reach an agreed mark.
So along with the information about the purpose of the step, it also makes sense to explicitly describe the failure thresholds and the background of why they are set so, as the impact of the failure is important here.
Steps with unclear fix deadlines
Set a time limit for the fixes
Having a deployment pipeline doesn't necessarily mean that all changes going through it will end up in production right away. In fact, many teams deploy their changes periodically, for example daily or weekly. The pipeline may not be linear but rather asynchronous with some steps not running upon change but periodically as well.
So, if a step is running once a day why should you fix it now, when you could do it two minutes before it runs again?
To handle this situation, you need to find a way to remove any ambiguity regarding the time span that a failure is allowed in the pipeline. Ideally, there should be a fix as soon as the failure is known. But to handle the factor of not knowing if the fix should be done now or later, it is a good idea to define a small time interval for which it is ok to have a failed step and then automatically revert the change if this is exceeded. For example, if a failure cannot be fixed in 15 minutes it means that this requires more rework so it might make sense to revert the change, to make the pipeline run again, and have the time necessary to fix the issue locally.
Having the same time interval for fixes for all steps helps to build in muscle memory in your team and remove the uncertainty factor about when a fix is due.
Benefits of deactivating the field
The Somebody Else's Problem field can act as a fun heuristic to identify not only the reasons why a pipeline is broken, but more importantly, why it doesn’t get fixed. Each time a step of the pipeline is broken and there is no timely solution to fix it, you can start thinking if this is something you don't want to see, don't expect or can't explain. Then, you could use some of the ideas mentioned here to address the situation as a team, regardless of the technical experience of each member, and deactivate the field.
By deactivating the Somebody Else's Problem field around your pipeline you improve the quality in both the steps and the entire process. In the end, it helps you move closer to continuous delivery by having a reliable deployment pipeline, giving the team confidence in the quality of their changes.
Life, the Universe and Everything (1982) by Douglas Adams
Your Tests aren’t Flaky, You Are! by Richard Bradshaw
Eradicating Non-Determinism in Tests (2011) by Martin Fowler
How to design team rituals to accelerate change by Gustavo Razzetti
Claim Jenkins plugin
Start with Why by Simon Sinek
Areti Panou is a product owner of internal tools that help development teams comply with regulations at SAP SE. From tester to quality coach and recently to product owner, she likes finding out what the big picture is and how to bring the right people together to improve it.