Category: Engineering Management

How to Deal With a Seismic Change in Strategy

Post author By scalarco
Post date December 28, 2020
No Comments on How to Deal With a Seismic Change in Strategy

This is a quick story of how I led my department through a massive, rapid, and uncomfortable change in direction. View on Plato

Problem

My department was organized across geographies and parts of the journey of transportation. We had a team in North America focused on the North American transportation network and a team in Berlin, Germany, focused on the EU and UK transportation network. Within these two, we had teams that were working on the first, middle, and last-mile part of the journey. The teams were relatively independent and each was looking at their section of the problem space. Coming into 2020, our leadership team decided to eliminate the organizational structure by geography and miles. They decided to have one transportation team for both North America and Europe and instead of being organized by miles, to organize teams by functions (for example, Planning, Drivers, or Warehousing Technology). However, not only did the leadership team announce this seismic change in strategy, but they also introduced an initiative to revamp our entire transportation network by the peak holiday season of 2020. My fellow leaders and I were left to figure out how we could respond to all these unfolding challenges.

Actions taken

This significant strategic change was initiated by a series of reorgs that had to be executed at full tilt.

Reorg One

There was a team focused on middle-mile warehousing technology for North America, and another one focused on last-mile warehousing technology also for North America. Furthermore, there was a team in Europe responsible for warehouse technology, and we had to figure out how to consolidate all of these. It was decided that the team in Europe would be responsible for global warehousing technology that left us with extra 12 engineers here in North America. I had to find an adequate engagement for them while also managing through the transition.

Reorg Two

The second reorg was a bit more subtle since we were not restructuring teams but changing their focus and ownership. They had to change their roadmaps to incorporate the new global strategy that was disruptive and felt disempowering. Their past effort of developing these roadmaps was suddenly overwritten by leadership.

A month into the reorgs, the Covid-19 pandemic hit, and all of a sudden we were thrust into working remotely without having a solid plan for executing against the soon-approaching deadline. In other, more regular, circumstances, we would all come together for design sprints dissecting what we would try to achieve and how. Because we were all thrown remote unexpectedly, this turned out to be a communication challenge as well.

Finally, the plan itself was rather aggressive. It implied rebuilding top-to-bottom everything about our system and attempting to roll that out within nine months.

Reorgs — What to do

We had to be fully transparent to the team involved. I brought the team together and shared with them what was happening. It was important to impart to the team Why this was happening, Why it mattered, Why this is the right thing to do for both the business and the team and help them get bought into that.

Also, we had to make sure that we were able to bridge the gap between where we were and where we would like to get. That meant doing two things. First off, we had to ensure the transition of ownership of what the team was building to the team in Europe. That included compiling a list of all the things that this team owned and ensuring a smooth transfer of ownership over to the other team while maintaining continuity of service for our users.

We also had to continue running a legacy system and a new system in parallel for at least some time. Therefore, we needed to build some type of bridge solution that would allow us to operate using both of those in parallel. To do so, I repurposed the warehousing technology team to build this bridge-layer solution that I called the transportation interface. I wrote up a team charter, set prioritization for the team and got them bought into it. That went well, the team was excited about new opportunities but I had to shrink the team a bit. I found other teams within my department that had a strong need for engineers and I was able to identify engineers that I could move over without sacrificing what we needed to do on this interface.

As for the other reorg, revisiting roadmaps was done on a case by case basis. I would sit with each team and discuss how these changes impacted their existing roadmap, how we could leverage what they had already done, how they should proceed with the planning, etc. Since they already completed most of their planning, we looked at their objectives with an intent to make them global instead of having them be focused on North America. That required a lot of talking with our European partners and learning about things they were previously solely responsible for and knowledgeable about. We established a close relationship with people within our company that we never worked closely with before and better understood their problem space. This intensive exchange allowed us to craft roadmaps that were encompassing of global technology and allowed us to develop a holistic approach to the problem.

Changing the strategy
Considering the significance of the announced change, we made sure to implement our activities without boiling the ocean. We had to rebuild our entire technology stack and re-architect everything in an exceedingly short amount of time. Also, we had to be iterative and break off work into small pieces. The bridge solution allowed us to run legacy and new in parallel and launch a single warehouse. Launching warehouse technology in our pilot facility helped us learn what we needed to learn before bringing it to the rest of the network.

That, too, enabled us to be methodical and roll it out over time since we would have the bridge solution in place to facilitate the change. We also had to bring together senior engineers and leads from all groups to look at the architecture and the architectural vision as a whole and provide feedback. That was something we tried to do early on and then continuously as our vision evolved.

Lessons learned

Leverage positive momentum. While we were trying to figure out how to execute the new strategy, we kept working and adapting to the change in the roadmap. We found some vital pieces that we could include in the new roadmap. For example, one of my teams was focused on building a platform for managing truckloads and we were able to use that platform and make it central to some of our new system designs. That gave us a real leg up and allowed us to build on top of the already existing framework.
Simplify the scope wherever possible. We wouldn’t have been able to deliver if we hadn’t cut some things down. We could only complete scaled-down versions of the final solution — it would allow us to test our hypotheses and deliver something tangible instead of looking two years ahead.
The Covid-19 pandemic made us suddenly transition to working remotely and we were not entirely prepared for that. We should have cleared everyone’s schedule, done a design sprint and gotten everybody together upfront to align on things before going forward. Instead, we ended up having ad hoc 30/60/90 minutes calls over several months to get aligned and figure out how we would fit all the pieces together. In hindsight, I wish I had spent more time up front doing the design sprint.
Make sure that the team feels bought into the changes and understands the Why. By no means should the team feel ordered around for no good reason.

Tags architecture, candor, global, reorg, strategy

Engineering Management

Reversing the Tide of Application Support

Post author By scalarco
Post date December 28, 2020
No Comments on Reversing the Tide of Application Support

This is a quick story of recognizing and attacking a problem of ballooning application support in my department. View on Plato

Problem

We had a process on my team for dealing with critical user issues. Support would enter a ticket describing the problem, and we would have a rotation of engineers responsible for addressing these problems as they would come in.

When we started, we had one engineer on a part-time basis, and a few years later, we had three full-time engineers dealing with support issues with no end in sight to the proliferation of new issues.

I identified three main problems:

The high cost of Support.
These three engineers were doing something that was not valuable to the business.
Employee satisfaction.
Being on a rotation would make engineers reasonably unhappy. I dared to assume that there was a risk of retention every time an engineer was on a critical rotation.
Negative customer outcome.
I assumed that these issues were tied to lost items or real delays in customer deliveries.

Actions taken

Before I could delve into solving burdensome technical support, I needed validation for the three problems mentioned above.

Negative customer outcomes
I did a sampling of the orders that had issues reported on them over several months and compared them across cancellation rate, the number of days to deliver, and a couple of other metrics. I found out that the delivery time was four days longer, and the cancellation rate was around 50 percent higher on those orders.

Employee satisfaction

I sent out a survey to all the developers on my team, asking them to gauge job satisfaction when they are on criticals. I did some back-of-the-napkin math on risk attrition and it showcased that every critical rotation entailed a 10 percent increase in the risk of attrition.

The high cost of Support

This assumption was the most tangible and thus most straightforward to prove empirically. I was able to show the exact amount of money we were wasting on paying three full-time engineers by multiplying their median salaries and adding opportunity costs.

I put together a one-page document outlining my findings and brought it to my Director of Engineering. He was convinced with no effort and encouraged me to take my old team and go solve this problem next quarter. I was excited to share the news with my team, whom I explained what I thought would be the best way to go after this problem.

First off, we needed to classify the issues coming in. I assumed that — following the Pareto principle — 80 percent of all issues were caused by 20 percent of bugs. This turned out to be true. We started by having someone manually classify the issues that allowed us to figure out the biggest root causes that accounted for the most issues. We were able to put together a roadmap to address these root causes. But, I didn’t feel it would be enough because we were going to get back in the same situation unless we address the process itself.

I couldn’t help but notice an overall lack of accountability. Engineers within that department were three to four times a year on the rotation, but not in succession, and were not incentivized to drive changes in the underlying products that were causing the issues in the first place.

I envisioned a centralized triage that would take the incoming tickets and then route them to the team responsible for the products causing these issues rather than having all the departments pull in for a standard rotation. I came up with an automated ticket routing system where users would fill out the information about their location and the problem they were dealing with, the tool that was impacted, and some general personal information. Using that information we were able to implement a rules engine on top of it and automatically route most issues that were coming in to the teams responsible for them.

The burden on the centralized rotation was thus reduced and it would take only one person to spend an hour a day at most double-checking or dealing with anything that the automation wouldn’t pick up. The newly established process increased accountability of individual teams as they could see the common problems coming to them and prioritize addressing root causes.

After building out the tools to do that automated ticket routing and classification, we had to present the new process and tools and get the department bought into the changes. Getting them on board was rather easy since most people were burnt out on the way things were operating.

Finally, we switched to the new process. We continued to measure over time if we would be able to keep a single person on rotation part-time and if the overall volume of critical tickets was reduced. Indeed, that is what happened — the number of tickets dropped from 400 tickets a week to 50 over the course of six months.

Lessons learned

Great ideas can come from anywhere in the organization. My organization is largely feature-driven — the business proliferates a lot of ideas and there are never enough people to do all of the feature work. But if you have a good idea that you could justify, you could drive your initiatives from anywhere in the organization.
Don’t sit idle when you see the problem. Even if you can’t address it on your own, get started and try to rally other people around it.
Problems are more likely to be dealt with when there’s accountability. If there is none, make sure to introduce it. Let data drive your decisions.

Tags business case, software, support

Engineering Management

Building a Culture of Automated Testing

Post author By scalarco
Post date December 28, 2020
No Comments on Building a Culture of Automated Testing

This is a quick story of taking my team on a journey of enlightenment to automated testing. View story on Plato

Problem

As we were rapidly growing, it became increasingly hard to work with the codebase because of a lack of automated tests. The problem was that most of the engineers didn’t understand the value of automated testing and were resistant to the change. They hadn’t seen it before in practice and it took significant effort to buy them into it and build a culture that would support that practice.

Actions taken

To start with, I developed a visible measurement of where we were by creating a page that would display for each team the number of tests that were being written every sprint and setting a target of having 50 percent of our commits accompanied by automated testing. This would make things more visible and get people thinking about the problem.

We also celebrated people that were writing the automated tests. In all-hands, we would praise and talk about systems and teams that had been doing well in terms of having automated tests. I identified a testing champion on one of my teams who didn’t need any convincing at all and embraced the testing unquestionably. Their advocacy on the ground was persuasive and they were able to show the value to their peers in real-time. Our testing champion came up with another thing that I ended up incorporating into my all-hands — the test of the week contest.

We would have engineers submit their automated tests and then before a meeting, everybody would vote for their favorite one based on a handful of criteria. I would announce the winner in all-hands and they would be rewarded with $10. In addition, anytime I saw an opportunity to showcase a concrete benefit of testing, I would communicate it broadly. For example, “This test prevented us from having a massive outage” or “This test showed us right away where we were missing key acceptance criteria.” I would send out an email or announce it at all-hands explaining how things could blow up if our automated tests hadn’t caught it before it reached production or before we left for the weekend.

Over the course of about a year, we continued with celebrating small wins and acknowledging our champions’ effort. By the end of the year, almost every change that went out was accompanied by automated tests. It became a habituated practice that engineers didn’t question anymore. Product managers also understood the value and weren’t pushing back against us. Through all of those things, we were able to change the culture and have the team genuinely buy into automated tests.

Lessons learned

When you are trying to change your culture, it is vital to celebrate positive momentum. Celebrate broadly anything that aligns with the change that you’re trying to make.
Finding a champion or champions on your team that would buy into the vision could help efficiently advocate for the change.
Try to show the value wherever possible. People should buy into it for the intrinsic value, not just because they’re being asked to or being rewarded for it.

Tags automation, culture, software, testing

Engineering Management

Technology and Innovation Panel

Post author By scalarco
Post date October 28, 2020
No Comments on Technology and Innovation Panel

In January 2019 I spoke to an audience of over a thousand students and retail industry veterans on Technology and Innovation, and my role as a Software leader.

Engineering Management

Setting a Code Review Culture

Post author By scalarco
Post date October 28, 2020
No Comments on Setting a Code Review Culture

Doing code reviews is crucial for any successful engineering team, but without being given care it is easy for the process to devolve into one of stress and hurt feelings. As an engineering leader it’s well worth your time to encourage a culture of respect and professionalism towards code reviews. Outlined below is a way to frame this to your engineering team.

Treat each other with respect

Bringing respect to code reviews means being thoughtful and empathetic on both ends, reviewer and reviewee.

As a reviewee- you should assume good intent on the part of your reviewer. The reviewer wants to help you make the best decisions for our team and our code base. They have taken time away from their own priorities to contribute to the improvement of the team’s systems. Your initial reaction to comments may be “this is a waste of time, this is nitpicky and not a big deal, this is dumb”, etc. but challenge yourself to take a step back and see your reviewer not as an adversary, but as a partner in the effort to make smart engineering decisions. Everyone who is reviewing code has a chance to add context, provide knowledge, and illuminate risks, and should never be immediately disregarded.
As a reviewer- you should assume the reviewee is not stupid. Sometimes simple mistakes are made, anyone who says they’ve never made a simple mistake is either lying to themselves or has never written any substantial code. You should never treat a mistake with condescension or disdain. Additionally, you should recognize that in this profession there is rarely a single right answer on how to do something. Building systems is a series of tradeoffs – time, complexity, performance, cleanliness, etc. Your immediate reaction to seeing a review may be “that’s bad that’s not how I would have done this”. Challenge yourself to take a step back and consider the tradeoffs the reviewee may have made, and consider if in their shoes you can clearly say that it was the wrong decision.

So what’s this mean?

As a reviewee-
- Don’t be a jerk – if someone leaves a comment you don’t agree with, consider it an opportunity for a discussion, to learn from and build stronger ties to your peers.
- Don’t just drop someone’s comments. Give the reviewer the respect they deserve and not only think critically about the comments they’ve provided you, but also talk to them and learn why they wrote them. If you’re totally positive this is not a comment that needs to be resolved, reply and clearly explain why. Admit that you could be wrong and welcome a discussion if they want one prior to dropping.
As a reviewer-
- Don’t be a jerk – If there’s time to write a comment then there’s time to make sure the comment isn’t antagonistic
- Ask probing questions without an ego. Assume the reviewee has thought about the problem at least as much as you have.
- Offer your availability to talk things through. – Don’t just leave some comments and run, treat leaving a comment as an invitation for a discussion. You are not an all-knowing code wizard, and just as a reviewee might make mistakes, you might make mistakes in your review.

Some tactical suggestions

As a reviewee-
- Provide a good overview that contains context on what the patch is for, and if there is additional context e.g. “this is a quick scrappy MVP that we don’t plan to use long-term“ or “this is the new architecture pattern we are trying out to see how well it works”, etc
- Make your code understandable – many miscommunications happen because it’s not clear what the code is doing, be careful about unnecessary complexity.
- If what you’re doing is so complex that it’s hard to understand if you didn’t write it, then add descriptive, well-written comments to your code
As a reviewer-
- Don’t leave comments that are simple assertions e.g. “use foo.go() not foo.start()”. Your comments should be teaching opportunities e.g. “Based on what you’re trying to do here, to make foo go, you might want to use foo.go() instead of foo.start(), foo.start() does not do the {go subroutine} which based on the overview it seems you want”
- Point to good examples to help support better understanding.
- Try not to harp too much on code style (whitespace, newlines indentation).

On a final note, building a culture is hard. To get this to stick you should reinforce the message in meetings with your team. Encourage every engineer to call out instances of disrespectful review behavior.

Tags code reviews, management, software