Why change awareness is important for the troubleshooting of modern applications

Author | Mickael Alliel

Translated by | Wang Qiang

Planning | Ding Xiaoyun

Microservices and highly distributed systems are very complex. There are many moving parts in the system, including the application itself, infrastructure, version, and configuration. Often, this makes it difficult for operators to keep track of what's happening in production or other development environments (QA, development, pre-production), which becomes a problem when you need to troubleshoot the system.

In this article, I'll clarify some of the different use cases for monitoring and observability, talk about when to use these two concepts, and how to use them correctly. I'll then focus on the topic of Change Intelligence, a new way to teleoperate actionable metrics, logs, and traces to effectively troubleshoot events in real time. Until now, observability has focused on aggregating data that is relevant to your system, while monitoring is a standardized check to verify that everything based on that data is working properly. Change perception constitutes an enhancement to existing telemetry technologies, providing more context in a world where everything is distributed and companies have dozens or even hundreds of services that communicate with each other. This allows operators to make connections between recent changes and ultimately understand their impact on the overall system.

The mystery of observability and monitoring

Let's start with monitoring and observability, and how they are generally implemented today in engineering organizations with complex microservices operations activities.

You'll often hear people say that observability is the sum of metrics, logs, and tracks, but in fact, this kind of telemetry is only a prerequisite for getting observability correctly. For your data to be truly usable, you need to make sure it's connected to your business needs and how your application works.

When it comes to monitoring, most people think of dashboards. They can be very pretty, but most developers don't really want to spend the whole day staring at them. These tend to be static alerts that require developers to pre-determine what might go wrong.

One interesting trend I'm seeing is that operations people actually use observability tools to find the root cause of problems, and then incorporate that learning into monitoring tools to monitor recurring problems in the future. If you have proper documentation and a workbook, the combination between monitoring and observability can work wonders for your reliability program.

These checks can take various forms, with simple examples being ensuring that a system's latency falls below a certain threshold, or a more complex example of checking whether a complete business process is as expected (adding items to the shopping cart and successfully checking out). You might think you've got the whole picture of what's going on in the system because you already have monitoring and observability, but you still need a key piece of the puzzle to be able to see the full picture.

Missing Puzzle: Change Perception

To be able to really get the insight you need from the system when problems arise, you need to add another piece of the puzzle to your puzzle, and that's change perception. Change awareness involves not only understanding when something has changed, but also understanding why it has changed, who has changed it, and what impact the change has had on your system.

For operations engineers, existing data shocks are often overwhelming. Therefore, change awareness is introduced to provide broader context about telemetry, as well as information that you already have. For example, if you have three services communicating with each other and one of them has an elevated error rate, based on your telemetry, that's a good indicator of what went wrong.

This is a good basis for suspecting that something is wrong with your system, however, the next step, and the more critical step, is that you always have to start digging to find the root cause behind this anomalous telemetry data. Without change perception, you need to analyze countless problems to understand the root cause, even if you follow a guide step by step as you did before, because at the end of the day, every event is unique.

So, what does change perception involve?

First, if you implement change awareness correctly, it should include all the contexts (such as service mapping) related to "which service is communicating with which service", information about how changes made in service A affect other services (downstream and upstream due to dependencies), configuration changes made at the application level and configuration changes made at the infrastructure layer or in the cloud environment, versions that are not pinned to cause a snowball effect that you can't control, cloud outages, And anything that could affect business continuity.

So, to tie these three concepts together would be:

Observability gives you a data format that meets your business needs.

Monitoring is a set of checks to make sure your business is running properly.

Change awareness links all this information, enabling you to understand what/why/object/how the change in the system is made, so you can find the root cause of the event faster.

Incident troubleshooting in practice

I worked in monolithic environments before moving to microservices, so I have first-hand experience with the huge differences between these types of environments. In monolithic systems, monitoring and observability are good elements, while in microservices they are completely necessary.

Trying to troubleshoot multiple services in a system that have different purposes and perform different tasks is a very complex task. These smaller services are often split into smaller chunks and run several operations at the same time, so information needs to be constantly exchanged between them.

So for example — when you get an alert (usually in the form of a message on a page or Slack) that a certain part of your business isn't working properly — a lot of times, in a truly massively distributed environment, this can be attributed to many different services. Without proper monitoring and observability, you won't immediately be able to tell which service is down. They can help you understand what's going on in the pipeline of this microservice, and which component is failing.

When you're in the vortex of an event, you'll spend most of your time trying to understand the root cause of the problem to get rid of the obstacles. You first have to figure out where the problem is happening in your hundreds of applications or servers, and then once you isolate a failed service or application, you want to know exactly what's going on. This assumes some prerequisites that are not always met:

You have the necessary permissions for all relevant systems

You understand the whole stack and all the technologies in all of these systems

You have enough experience to fully understand the problem and then solve it

As a DevOps engineer (today in Komodor, previously in Rookout), I often encounter these types of scenarios, so here's a short story from the front line. I remember one time my team and I started getting a lot of bugs from a key service in our system [spoilers: we received numeric values and when trying to insert them into our database, the column types didn't match].

The only error message we can use is: invalid value. Then we had to search our systems and recent changes, trying to understand the data and error we were dealing with — we spent the whole day researching the error and finally learned that it was a change that was implemented seven months ago. The column type in the database is integer, and we are trying to insert a larger number, so a large integer data/column type is needed. If there wasn't any platform or system to help us associate these errors with the associated changes, for a configuration that happened more than seven months ago, even something as simple as being too big a number would have left the entire experienced team with a whole day because we didn't have a change perception to help us solve the problem.

So – keeping this example in mind, let's look at some of the other examples where I'm going to show what troubleshooting can look like with and without change-aware involvement.

Observability examples

In this example, you'll see a dashboard that shows the latency of the number of requests, the number of errors, and the response time.

Why change awareness is important for the troubleshooting of modern applications

Observability-based monitoring

Next the monitoring comes in to ingest this data and then provides an understanding of the appropriate thresholds to check whether this is acceptable in the context of its history.

This is the result of a "query" that periodically checks historical and real-time data to alert you to any activity that exceeds the 2% error, such as:

What happens next without change awareness:

Without a change-aware solution, you'll receive an alert from DataDog (based on the example above) telling you that the activity exceeded the 2% error threshold. You start thinking "Why is this happening?" And came up with theories like: application code might be modified, network issues, cloud providers or third-party tool issues, or even problems with another service that itself has problems with." To find the real answer, you need to go through a lot of metrics and logs, and then try to piece together the full picture of the problem in order to find the root cause, but there's not much indication of what happened to the system, where it happened, and how it happened, because there's a lack of that kind of data in current monitoring and observation tools.

Monitor and observe with change awareness:

You'll get alerts from DataDog, but the difference is that your next troubleshooting step will be greatly simplified because you have a change-aware solution that already gives you the necessary context for all of the above theories. Once you've used a change-aware solution as your only source of truth, you can immediately see recent historical changes, correlate those changes to factors that might affect the service (such as code changes, configuration changes, upstream resources, or related services), and then quickly find the root cause instead of scouring multiple solutions and their logs and metrics and trying to piece them together into a complete image, like trying to find a needle in a haystack.

This change awareness can be based on release notes, audit logs, version differences, and attributes (who made the changes). The related mappings for this change are then cross-referenced into a myriad of different connected services to find the most likely culprits of failures, enabling faster recovery.

Providing data in the form of a timeline and service map, rather than just dashboards with thresholds and limits, can provide better context for the overall system.

The screenshot above shows an example of K8s (Komodor's) change-aware solution, which shows an alert triggered by DataDog. Now you have a timeline that shows all the changes that happened in a particular service before the problem happened; so you have the context to find the root cause faster.

As the screenshot above shows, we can use this information to trace from the starting point of the Datadog monitor trigger to see what exactly happened or changed in the system to determine the root cause of the problem faster. In this simple case, just before the Datadog alert is triggered, we can see a health change event indicating that there are not enough copies available for the application. Until then, a new version of the app is deployed. It could be that during the deployment process, usability is not guaranteed, or a code change affects the application and introduces a bug or significant change. By zooming in on the details of the deployment, we can figure out exactly what triggered the alert in seconds.

When data + automation isn't enough

Systems are becoming more complex, with many different elements of services, applications, servers, infrastructure, versions, and so on, all at a scale that was unheard of. The tools that get organizations to where they are today may not be enough to power tomorrow's systems and stacks.

Once people had logs, then traces, and then metrics, all of which were brought together into dashboards to provide visual indications of our operational health. Over time, more and more tools have been added to this chain to help drive and manage the vast amounts of data, alerts, and information that are pouring in.

Change perception will be a key part of enhancing future stack capabilities and providing an additional actionable layer of insight on top of existing monitoring and observation tools. This approach will ultimately help us recover quickly, maintain today's stringent SLAs, and reduce costly, painful, and potentially lengthy downtime.

About the Author:

Mickael Alliel is a self-taught developer who later became a DevOps engineer with a passion for automation, innovation, and creative problem solving. Alliel loves to challenge himself and try new techniques and methods. He is currently developing the next-generation K8s troubleshooting platform at Komodor and is also a connoisseur of French cuisine.

https://www.infoq.com/articles/beyond-monitoring-change-intelligence/

Why change awareness is important for the troubleshooting of modern applications

Read on