Categories
Engineering Management

Reversing the Tide of Application Support

This is a quick story of recognizing and attacking a problem of ballooning application support in my department. View on Plato

Problem

We had a process on my team for dealing with critical user issues. Support would enter a ticket describing the problem, and we would have a rotation of engineers responsible for addressing these problems as they would come in.
 

When we started, we had one engineer on a part-time basis, and a few years later, we had three full-time engineers dealing with support issues with no end in sight to the proliferation of new issues.
 

I identified three main problems:

  • The high cost of Support.
    These three engineers were doing something that was not valuable to the business.
  • Employee satisfaction.
    Being on a rotation would make engineers reasonably unhappy. I dared to assume that there was a risk of retention every time an engineer was on a critical rotation.
  • Negative customer outcome.
    I assumed that these issues were tied to lost items or real delays in customer deliveries.
     

Actions taken

Before I could delve into solving burdensome technical support, I needed validation for the three problems mentioned above.
 

Negative customer outcomes
I did a sampling of the orders that had issues reported on them over several months and compared them across cancellation rate, the number of days to deliver, and a couple of other metrics. I found out that the delivery time was four days longer, and the cancellation rate was around 50 percent higher on those orders.
 

Employee satisfaction

I sent out a survey to all the developers on my team, asking them to gauge job satisfaction when they are on criticals. I did some back-of-the-napkin math on risk attrition and it showcased that every critical rotation entailed a 10 percent increase in the risk of attrition.
 

The high cost of Support

This assumption was the most tangible and thus most straightforward to prove empirically. I was able to show the exact amount of money we were wasting on paying three full-time engineers by multiplying their median salaries and adding opportunity costs.
 

I put together a one-page document outlining my findings and brought it to my Director of Engineering. He was convinced with no effort and encouraged me to take my old team and go solve this problem next quarter. I was excited to share the news with my team, whom I explained what I thought would be the best way to go after this problem.
 

First off, we needed to classify the issues coming in. I assumed that — following the Pareto principle — 80 percent of all issues were caused by 20 percent of bugs. This turned out to be true. We started by having someone manually classify the issues that allowed us to figure out the biggest root causes that accounted for the most issues. We were able to put together a roadmap to address these root causes. But, I didn’t feel it would be enough because we were going to get back in the same situation unless we address the process itself.
 

I couldn’t help but notice an overall lack of accountability. Engineers within that department were three to four times a year on the rotation, but not in succession, and were not incentivized to drive changes in the underlying products that were causing the issues in the first place.
 

I envisioned a centralized triage that would take the incoming tickets and then route them to the team responsible for the products causing these issues rather than having all the departments pull in for a standard rotation. I came up with an automated ticket routing system where users would fill out the information about their location and the problem they were dealing with, the tool that was impacted, and some general personal information. Using that information we were able to implement a rules engine on top of it and automatically route most issues that were coming in to the teams responsible for them.
 

The burden on the centralized rotation was thus reduced and it would take only one person to spend an hour a day at most double-checking or dealing with anything that the automation wouldn’t pick up. The newly established process increased accountability of individual teams as they could see the common problems coming to them and prioritize addressing root causes.
 

After building out the tools to do that automated ticket routing and classification, we had to present the new process and tools and get the department bought into the changes. Getting them on board was rather easy since most people were burnt out on the way things were operating.
 

Finally, we switched to the new process. We continued to measure over time if we would be able to keep a single person on rotation part-time and if the overall volume of critical tickets was reduced. Indeed, that is what happened — the number of tickets dropped from 400 tickets a week to 50 over the course of six months.
 

Lessons learned

  • Great ideas can come from anywhere in the organization. My organization is largely feature-driven — the business proliferates a lot of ideas and there are never enough people to do all of the feature work. But if you have a good idea that you could justify, you could drive your initiatives from anywhere in the organization.
  • Don’t sit idle when you see the problem. Even if you can’t address it on your own, get started and try to rally other people around it.
  • Problems are more likely to be dealt with when there’s accountability. If there is none, make sure to introduce it. Let data drive your decisions.
Categories
Engineering Management

Building a Culture of Automated Testing

This is a quick story of taking my team on a journey of enlightenment to automated testing. View story on Plato

Problem

As we were rapidly growing, it became increasingly hard to work with the codebase because of a lack of automated tests. The problem was that most of the engineers didn’t understand the value of automated testing and were resistant to the change. They hadn’t seen it before in practice and it took significant effort to buy them into it and build a culture that would support that practice.
 

Actions taken

To start with, I developed a visible measurement of where we were by creating a page that would display for each team the number of tests that were being written every sprint and setting a target of having 50 percent of our commits accompanied by automated testing. This would make things more visible and get people thinking about the problem.
 

We also celebrated people that were writing the automated tests. In all-hands, we would praise and talk about systems and teams that had been doing well in terms of having automated tests. I identified a testing champion on one of my teams who didn’t need any convincing at all and embraced the testing unquestionably. Their advocacy on the ground was persuasive and they were able to show the value to their peers in real-time. Our testing champion came up with another thing that I ended up incorporating into my all-hands — the test of the week contest.
 

We would have engineers submit their automated tests and then before a meeting, everybody would vote for their favorite one based on a handful of criteria. I would announce the winner in all-hands and they would be rewarded with $10. In addition, anytime I saw an opportunity to showcase a concrete benefit of testing, I would communicate it broadly. For example, “This test prevented us from having a massive outage” or “This test showed us right away where we were missing key acceptance criteria.” I would send out an email or announce it at all-hands explaining how things could blow up if our automated tests hadn’t caught it before it reached production or before we left for the weekend.
 

Over the course of about a year, we continued with celebrating small wins and acknowledging our champions’ effort. By the end of the year, almost every change that went out was accompanied by automated tests. It became a habituated practice that engineers didn’t question anymore. Product managers also understood the value and weren’t pushing back against us. Through all of those things, we were able to change the culture and have the team genuinely buy into automated tests.
 

Lessons learned

  • When you are trying to change your culture, it is vital to celebrate positive momentum. Celebrate broadly anything that aligns with the change that you’re trying to make.
  • Finding a champion or champions on your team that would buy into the vision could help efficiently advocate for the change.
  • Try to show the value wherever possible. People should buy into it for the intrinsic value, not just because they’re being asked to or being rewarded for it.
Categories
Engineering Management

Setting a Code Review Culture

Doing code reviews is crucial for any successful engineering team, but without being given care it is easy for the process to devolve into one of stress and hurt feelings. As an engineering leader it’s well worth your time to encourage a culture of respect and professionalism towards code reviews. Outlined below is a way to frame this to your engineering team.

Treat each other with respect

Bringing respect to code reviews means being thoughtful and empathetic on both ends, reviewer and reviewee.

  • As a reviewee- you should assume good intent on the part of your reviewer. The reviewer wants to help you make the best decisions for our team and our code base. They have taken time away from their own priorities to contribute to the improvement of the team’s systems. Your initial reaction to comments may be “this is a waste of time, this is nitpicky and not a big deal, this is dumb”, etc. but challenge yourself to take a step back and see your reviewer not as an adversary, but as a partner in the effort to make smart engineering decisions. Everyone who is reviewing code has a chance to add context, provide knowledge, and illuminate risks, and should never be immediately disregarded.
  • As a reviewer- you should assume the reviewee is not stupid. Sometimes simple mistakes are made, anyone who says they’ve never made a simple mistake is either lying to themselves or has never written any substantial code. You should never treat a mistake with condescension or disdain. Additionally, you should recognize that in this profession there is rarely a single right answer on how to do something. Building systems is a series of tradeoffs – time, complexity, performance, cleanliness, etc. Your immediate reaction to seeing a review may be “that’s bad that’s not how I would have done this”. Challenge yourself to take a step back and consider the tradeoffs the reviewee may have made, and consider if in their shoes you can clearly say that it was the wrong decision. 

So what’s this mean?

  • As a reviewee-
    • Don’t be a jerk – if someone leaves a comment you don’t agree with, consider it an opportunity for a discussion, to learn from and build stronger ties to your peers.
    • Don’t just drop someone’s comments. Give the reviewer the respect they deserve and not only think critically about the comments they’ve provided you, but also talk to them and learn why they wrote them. If you’re totally positive this is not a comment that needs to be resolved, reply and clearly explain why. Admit that you could be wrong and welcome a discussion if they want one prior to dropping.
  • As a reviewer-
    • Don’t be a jerk – If there’s time to write a comment then there’s time to make sure the comment isn’t antagonistic
    • Ask probing questions without an ego. Assume the reviewee has thought about the problem at least as much as you have.
    • Offer your availability to talk things through. – Don’t just leave some comments and run, treat leaving a comment as an invitation for a discussion. You are not an all-knowing code wizard, and just as a reviewee might make mistakes, you might make mistakes in your review.

Some tactical suggestions

  • As a reviewee-
    • Provide a good overview that contains context on what the patch is for, and if there is additional context e.g. “this is a quick scrappy MVP that we don’t plan to use long-term“ or “this is the new architecture pattern we are trying out to see how well it works”, etc
    • Make your code understandable – many miscommunications happen because it’s not clear what the code is doing, be careful about unnecessary complexity.
    • If what you’re doing is so complex that it’s hard to understand if you didn’t write it, then add descriptive, well-written comments to your code
  • As a reviewer-
    • Don’t leave comments that are simple assertions e.g. “use foo.go() not foo.start()”. Your comments should be teaching opportunities e.g. “Based on what you’re trying to do here, to make foo go, you might want to use foo.go() instead of foo.start(), foo.start() does not do the {go subroutine} which based on the overview it seems you want”
    • Point to good examples to help support better understanding. 
    • Try not to harp too much on code style (whitespace, newlines indentation).

On a final note, building a culture is hard. To get this to stick you should reinforce the message in meetings with your team. Encourage every engineer to call out instances of disrespectful review behavior.