Manage technical debt in a microservices architecture

Author | Glenn Engstrand

Translated by | Rayden

Planning | Ding Xiaoyun

Review | Zhang Weibin

At QCon Plus, Glenn Engstrand talks about a way to facilitate technical debt management. Most people involved in software development have trouble trying to get the product manager or project manager to agree that they will spend time fixing the technical debt of the project. Engstrand's approach at Optum Digital (formerly Rally Health) enables these issues with different priorities to be managed in a systematic and non-adversarial manner.

What is technical debt?

Broadly speaking, technical debt is a series of decisions made during software development that impair a team's ability to create value by building features.

You should be familiar with the following communication: The product manager describes the next feature they want to add to the product. Developers require a long time to implement the feature, which the average manager would consider too long. Developers talk about the need to solve problems that arise when modifying a lot of hard-to-understand code, or dealing with various flaws in old code bases or frameworks. Therefore, developers are asking for more time to solve these problems. The product manager will reject their request and point out that there is still a whole host of desired features waiting to be implemented.

If the situation cannot be resolved for a long time, this vicious circle may lead to a loss of market competitiveness and even the overall collapse of complex software systems.

We have two ways to deal with this situation temporarily, one is to choose a solution that is simple or fast but not optimal, and the other is to lead to a lack of technology stack or lack of capabilities. Both cases require the engineering team's time to deal with occasional complex problems that affect value creation or defect repair.

Paying off technical debt while maintaining rapid delivery of functionality can be difficult, and the larger the system architecture, the harder it is. Managing the technical debt of dozens or hundreds of microservices is much more complex than a single service, and the risk of non-repayment grows faster.

Every software company encounters a time when it has to deal with technical debt.

At Optum Digital, a product set (also known as a software product line) is a combination of products that meet specific needs. Each product will have multiple teams, often aligned with software clients or back-end services. There are also teams responsible for platform-oriented features that span multiple project sets. Each team is likely to be responsible for various software libraries. We have more than 700 engineers working on hundreds of microservices. They attach great importance to technical debt because the risk of getting out of control is very large.

In 2018, our Chief Technology Officers (CTOs) initially published a blog post about the importance of investing in engineering, and over the course of more than two years, they split their entire business into microservices.

Technical Competency Program

The company's engineers came up with a Technology Capability Plan (TCP) to address technical debt.

TCP is a community-based methodology for developing plans to repay technical debt. In engineering, it communicates information to the engineering and product ends by collecting, organizing, and communicating changing needs in the technology field to ensure the longevity and adaptability of the architecture. In other words, it can be used to point out when a company will get into trouble if it doesn't take concrete steps in a timely manner.

Manage technical debt in a microservices architecture

The program encourages communities to develop plans to repay technical debt in a specific format. After recording the risk score for technical debt in each area, prioritize the processing to be done based on that risk score. Through prioritization planning, the engineering time for technical debt repayment can be actively and effectively negotiated with the product manager.

Engineering community

Within an organization, engineering communities are formed horizontally, in other words, they are not related to a particular team or product. Engineers often join these communities because they are keen to use the same technology, so the community is open and grows organically.

However, if certain communities have strategic value, they can make them an invitation. If members are screened, then the focus of the community should be on culture add rather than culture fit, and it should be representative and diverse.

They often have specialized ways of communicating (e.g., wiki threads, chat channels, and email lists) to facilitate ongoing communication and resource sharing.

The policies of these engineering communities are made from the bottom up, which is key to maintaining the validity and authenticity of TCP.

Monthly community meetings are recorded and shared, and minutes are sent to all engineers. Meeting activity is kept in an updated record for each community, and these documents are collected each quarter and published internally as TCP.

Make a plan to repay technical debt

Each community's plan contains about a page or so of instructions and a table that records the evolution of technology needed over time.

Each row in the table corresponds to a "preferred", "acceptable", "non-asserted", or "unacceptable" (PADU) version of a programming language, framework, library, or platform-as-a-service, which is relevant to the organization's technology stack.

Each column represents a time period (for example, a quarter or a year). The entire table can show data for the next three years.

Each cell in the table contains the lifecycle state of the technology for that time range: plan, deprecate, migrate, use, or remove.

The "Planned" status indicates that a plan needs to be made for the upgrade. "Deprecation" means that the team can no longer adopt the version of the technology. "Migration" indicates that each team should proactively migrate to the appropriate version. "Use" indicates that the technical version should be used. "Removal" means that the technology may fail at any time during that time period.

Instruction pages describe the background in which each technology is applied and the possible impact or consequences of not following the plan. These community-driven initiatives can help organizations manage technical debt from using outdated, unsafe, or unsupported versions of technology.

Each product set also submits a TCP plan. Product set-driven plans can be used to guide the repayment of other forms of technical debt, such as refactoring large code bases or splitting a single service into multiple smaller services.

In addition to the community and product contributions sections in TCP, the vision of the program is introduced, and there is a chapter that describes the most risky product areas at the moment.

What is technical risk?

Once the engineering community and major product set engineers have developed plans to repay technical debt, a series of engineering investments are required. With limited resources, how should processing be prioritized? Product managers don't know what to do because the engineering investment doesn't come from them. To answer this question, you need to understand what kind of risks it poses if you don't follow the plan.

This risk can be quantified by the risk score. The higher the score, the greater the risk and the higher the priority. The technology score in the Use state is always zero, and the risk score increases gradually as the technology version changes to the Migrated, Deprecated, or Removed state.

Planning is all bottom-up, while risk scoring is top-down. Technical debt repayment plans are developed by community engineers, while the prioritization of the list of plans is set by engineering managers.

Optum Digital's metrics are collected into the so-called Balanced Scorecard, a strategic performance management tool developed by Harvard Business School.

Technical debt in various programs is aggregated on a product basis. The risk score for each product is the sum of all the technical risk scores for that product. Even if only one technology in a product is still in use or relies on a technology that is already in the Deprecation, Migration, or Removal state, the product's risk score can be negatively impacted. If there are multiple code bases in a product that are not compliant, the risk score is calculated only once. The median of the aggregated results for each product risk score is recorded in the Balanced Scorecard.

It is valuable to use automated static code analysis on the repository to identify technical dependencies. Ci/CD, DevOps, and GitOps need to be supported to make it easier to calculate this metric quickly and reliably.

To help the team focus on the product, we also calculate the TCP risk score in different ways. In this case, each technology in the plan is summarized in a code base, and the risk score for each code base is the sum of all the technical risk scores in that code base.

The total risk score of the product code base is summarized as the total risk score of the product itself. In this way, we can track the risk elimination of each product or make a product comparison based on TCP risk.

Get an investment in engineering in the roadmap

Now that we have a priority plan to pay off technical debt, which was developed by a team of engineers and supported by leadership, how do we fund this plan and put it in the roadmap?

First, let's review what usually happens when engineering managers and product managers sit down to make development plans for the next sprint without TCP: in a TCP-free environment, there are only engineering managers and product managers, and product managers can always use sales as a reason to achieve their goals.

Let's revisit this situation in the case of TCP. The product manager had just finished a meeting with the executives to discuss the importance of TCP and how to reduce the TCP risk score in the Balanced Scorecard, and then sat down with the engineering manager to plan for the next sprint. The product manager asked for three new features. The engineering manager said, "If it were up to me, I would make all three of these features available to you." Unfortunately, all of our engineers and their supervisors have realized that this extremely risky technical debt needs to be repaid this quarter. If you've been following TCP for the past year, you should already be aware of this. ”

Are you seeing this change? Because TCP is the authoritative enterprise-wide consensus on technical debt and its risks, engineering managers do not have to argue or threaten, but use this collective bargaining power to get the engineering investment they deserve.

In a small number of cases where product managers are still not flexible enough to approve engineering investments, the problem will eventually escalate to management to solve. Remember when the risk score was part of the Balanced Scorecard? For management, the Balanced Scorecard is their dashboard to observe where the company is headed. These indicators displayed on the dashboard can give them a more realistic sense of technical debt, making them more likely to choose to repay the end of the debt of the engineering investment.

An alternative to TCP

The only other systematic way I know of managing technical debt is documented in the book Google Site Reliability Engineering.

Let's take a quick look at this approach and explain why I think TCP is better.

First, reach consensus based on SLO or service level objectives. Each time the system exceeds one of these SLOs, it is treated as an error. Each time window has an agreed number of acceptable errors, called an error budget. If the system exceeds its error budget before the next time window, no features will be released.

To avoid this, product managers should be more willing to transfer engineering resources to pay off technical debt.

Let's explain why Google's SRE method is so unstable. For most product managers, the causal relationship between functionality and sales seems more real than the causal relationship between technical debt and system outages. There is an assumption that eliminating technical debt always makes the system more stable. While this is true in the long run, it is not guaranteed to be effective in the short term.

Wrong budgeting promotes short-term thinking and is therefore not conducive to product managers approving such engineering investments. Because it's hard to predict when the wrong budget will be exceeded, it's hard to plan when to schedule engineering investments.

This approach tends to create a confrontation between the product manager and the engineering manager, and this stalemate makes acceptance more risky, so it is more difficult to obtain management approval. Finally, this approach tends to politicize the payment of technical debt, with product managers trying to play with the system by convincing executives not to factor certain disruptions into their wrong budgets, or by renegotiating SLOs or miscalciating budgets to delay the consequences of lack of engineering investment.

The TCP approach, on the other hand, focuses on reaching a true consensus between products and engineering. TCP-driven development is more predictable in terms of roadmap, so all parties involved are less prone to problems.

Summary

Can the Technical Competency Program solve all the engineering problems? Of course not.

Will you still have technical debt? Absolutely there will be.

Is there still a need to take shortcuts to deliver functionality at the customer's drive? I'm sure you will.

TCP is not intended to prevent or restrict engineers and product managers from doing what they do best in software development and publishing. TCP signals to engineers and product managers that taking shortcuts incurs additional costs, and that they cannot ignore those costs indefinitely.

With TCP, you don't have to wait until the outage is severe before you start paying off your technical debt. No process, policy, technology or tool can be an effective alternative to quality engineering.

TCP documents the consensus of engineers about what is the riskiest technical debt and when it is reasonable to repay it. For TCP to be respected, its plans must be relevant, accurate, persuasive, and credible. This can only be achieved if its contributors are experienced and mature professionals with strong engineering skills and a quality of integrity.

I think this quote from our TCP documentation is the best summary:

Designing long-lasting and adaptable products requires a deep understanding of today's realities and tomorrow's possibilities. It needs to understand the technologies and market forces that drive it, and it needs to be committed to concentration and continuous progress over the long term.

https://www.infoq.com/articles/managing-technical-debt-microservices/

Manage technical debt in a microservices architecture

Read on