Categories
Engineering Management

Reversing the Tide of Application Support

This is a quick story of recognizing and attacking a problem of ballooning application support in my department. View on Plato

Problem

We had a process on my team for dealing with critical user issues. Support would enter a ticket describing the problem, and we would have a rotation of engineers responsible for addressing these problems as they would come in.
 

When we started, we had one engineer on a part-time basis, and a few years later, we had three full-time engineers dealing with support issues with no end in sight to the proliferation of new issues.
 

I identified three main problems:

  • The high cost of Support.
    These three engineers were doing something that was not valuable to the business.
  • Employee satisfaction.
    Being on a rotation would make engineers reasonably unhappy. I dared to assume that there was a risk of retention every time an engineer was on a critical rotation.
  • Negative customer outcome.
    I assumed that these issues were tied to lost items or real delays in customer deliveries.
     

Actions taken

Before I could delve into solving burdensome technical support, I needed validation for the three problems mentioned above.
 

Negative customer outcomes
I did a sampling of the orders that had issues reported on them over several months and compared them across cancellation rate, the number of days to deliver, and a couple of other metrics. I found out that the delivery time was four days longer, and the cancellation rate was around 50 percent higher on those orders.
 

Employee satisfaction

I sent out a survey to all the developers on my team, asking them to gauge job satisfaction when they are on criticals. I did some back-of-the-napkin math on risk attrition and it showcased that every critical rotation entailed a 10 percent increase in the risk of attrition.
 

The high cost of Support

This assumption was the most tangible and thus most straightforward to prove empirically. I was able to show the exact amount of money we were wasting on paying three full-time engineers by multiplying their median salaries and adding opportunity costs.
 

I put together a one-page document outlining my findings and brought it to my Director of Engineering. He was convinced with no effort and encouraged me to take my old team and go solve this problem next quarter. I was excited to share the news with my team, whom I explained what I thought would be the best way to go after this problem.
 

First off, we needed to classify the issues coming in. I assumed that — following the Pareto principle — 80 percent of all issues were caused by 20 percent of bugs. This turned out to be true. We started by having someone manually classify the issues that allowed us to figure out the biggest root causes that accounted for the most issues. We were able to put together a roadmap to address these root causes. But, I didn’t feel it would be enough because we were going to get back in the same situation unless we address the process itself.
 

I couldn’t help but notice an overall lack of accountability. Engineers within that department were three to four times a year on the rotation, but not in succession, and were not incentivized to drive changes in the underlying products that were causing the issues in the first place.
 

I envisioned a centralized triage that would take the incoming tickets and then route them to the team responsible for the products causing these issues rather than having all the departments pull in for a standard rotation. I came up with an automated ticket routing system where users would fill out the information about their location and the problem they were dealing with, the tool that was impacted, and some general personal information. Using that information we were able to implement a rules engine on top of it and automatically route most issues that were coming in to the teams responsible for them.
 

The burden on the centralized rotation was thus reduced and it would take only one person to spend an hour a day at most double-checking or dealing with anything that the automation wouldn’t pick up. The newly established process increased accountability of individual teams as they could see the common problems coming to them and prioritize addressing root causes.
 

After building out the tools to do that automated ticket routing and classification, we had to present the new process and tools and get the department bought into the changes. Getting them on board was rather easy since most people were burnt out on the way things were operating.
 

Finally, we switched to the new process. We continued to measure over time if we would be able to keep a single person on rotation part-time and if the overall volume of critical tickets was reduced. Indeed, that is what happened — the number of tickets dropped from 400 tickets a week to 50 over the course of six months.
 

Lessons learned

  • Great ideas can come from anywhere in the organization. My organization is largely feature-driven — the business proliferates a lot of ideas and there are never enough people to do all of the feature work. But if you have a good idea that you could justify, you could drive your initiatives from anywhere in the organization.
  • Don’t sit idle when you see the problem. Even if you can’t address it on your own, get started and try to rally other people around it.
  • Problems are more likely to be dealt with when there’s accountability. If there is none, make sure to introduce it. Let data drive your decisions.

Leave a Reply

Your email address will not be published. Required fields are marked *