Resolve concurrency issues in event-driven microservices

Author | Hugo Rocha

Translated by | Hirakawa

Planning | Yan Garden

Franklin D. Roosevelt once said that we tend to think too much about the good fortunes of the early birds, but we don't think much about the bad luck of the early bugs. I never play the lottery. The odds of losing a lottery are staggering; in fact, the odds of becoming a saint or president of the United States are greater than winning a lottery (such as EuroMillions in Europe or Powerball in the United States).

Concurrency of event-driven services is often a guaranteed negative lottery win, although the probability of concurrency may be low for a particular concurrency issue. However, it all comes down to the number of attempts, and since the amount of events handled by the service is so large, an unlikely event almost becomes something that is bound to happen. For example, we once had a problem with a probability of happening about one in a million. The service processes about a hundred messages per second, which means the problem occurs about three times per hour. By design, event-driven services need to cope with enormous scale and throughput, making concurrency issues particularly prone.

Concurrency issues, or race conditions, are unexpected behaviors that occur when a line of code runs in parallel, which is not the case if the code is running single-threaded. Dealing with concurrency issues often doesn't come naturally to programmers, and we're used to thinking about our code in a single-threaded way. Detecting and ensuring the safety of code running in parallel often requires someone with extensive experience and specialized training. Also, concurrency issues are not obvious and are often only exposed in production environments, because the throughput of a local or development environment is very different from that of a real environment.

Mars Rover

For example, nasa has very strict coding guidelines and a very detailed and meticulous quality assurance process. After all, debugging something outside of Earth is not quite the same as analyzing most production problems (although sometimes it feels like something strange is happening). A short-lived error that is likely to be overlooked by most developers is often a symptom of a race condition. NASA doesn't let go of similar problems, it may even trace it beyond applications, and developers may even have to dig deep into the operating system layer to find the root cause. In fact, that's a few million dollars of risk. But even with such a tireless process, race conditions are often inevitable. For example, I remember the episode where NASA lost contact with the rover due to race conditions.

The inevitability of concurrency issues and the high throughput of event-driven services make it all the more imperative to develop a well-thought-out strategy to address concurrency at its root. An important property of event-driven services is the ability to scale out by adding multiple instances of the same service. This approach invalidates traditional concurrency processing because different requests may be sent to different instances, so a memory lock, such as a mutex, lock, or semaphore, is made. Typically, distributed systems employ external tools to manage distributed concurrency, such as Consul or Zookeeper. However, for event-driven services, an essentially different concept can be introduced to handle concurrency. End-to-end message routing is a very efficient and extensible approach that handles concurrency issues by design (using architectural solutions), rather than implementation (resorting to external tools or in service implementations).

Over the years, with the help of RabbitMQ and Kafka, we have tried several different approaches in several different production use cases. We ultimately decided to deal with concurrency issues by design when possible, not through implementation. Here are some of the solutions we've used across the board in production that hopefully give you some inspiration for dealing with concurrency issues.

Example of a concurrency problem

Let's illustrate this with an example. Imagine we have a product online sales platform where users can subscribe to notifications for "new" and "hot replenishment" products. Whenever the inventory of the desired product increases, users can receive notifications by mail, SMS, etc. Services that hold product and inventory information send an event each time inventory changes. The subscription service must know when product inventory changes from 0 to 1 and send notifications when changes. The following diagram illustrates this situation.

The subscription service handles ProductStockIn events and reacts when product inventory changes. Because subscriptions are valuable only when inventory changes from 0 to 1, the service keeps the current inventory of each product in an internal state. The ProductStockIn event stream includes the following actions:

1. Product and service release events;

2. Subscribe to the service to handle events;

3. Get the local inventory and check if the inventory changes from 0 to 1;

4. Get the current subscription information;

5. Send notifications for each subscription;

6. Update local inventory data.

In a single-threaded mindset, this approach makes sense and doesn't create any problems. However, in order to fully optimize service resources and achieve reasonable performance, we should add parallelism to the service. What happens if the service handles two or more events? A race condition causes the service to publish the same subscription twice. If the service processes two inventory change events (for example, inventory from 0 to 1 and from 1 to 2) and runs validation for step 3 at the same time, it passes in two events, creates a race condition, and therefore sends the same notification twice.

To deal with this problem, simply lock the thread execution with traditional concurrent processing methods (such as locks, mutexes, semaphores, etc.). However, the traditional approach only works with single-instance services, as shown in the following figure.

Because locks in memory are only shared by the instance that made the lock, other instances can still handle other events at the same time. Two inventory change events of the same product can be handled by different instances, and even if both instances lock down their execution, they are only valid within their respective instances, and nothing can prevent concurrency problems between the two instances. Since an important property of event-driven services is the ability to scale horizontally, such traditional approaches can be said to be rather inadequate in this case.

An alternative to local locks is to use a database to prevent concurrency issues. A typical pessimistic approach to dealing with currencies (more on the pessimistic and optimistic approaches below) is to wrap the operation in a transaction. In general, however, there is no straightforward way to guarantee transaction consistency in the presence of external dependencies without getting into the realm of distributed trading that we most want to and should avoid. Using transactional consistency is also limited by the technology that supports it, and many NoSQL databases do not offer the same guarantees as traditional relational databases.

Pessimistic approach vs optimistic approach

There are two ways to deal with concurrency: the pessimistic approach and the optimistic approach.

A pessimistic concurrency strategy prevents concurrency by blocking parallel access to required resources. This type of policy assumes concurrency and therefore pre-restricts access to resources. This type of strategy is suitable for use cases with high concurrency, where two processes are likely to access the same resource at the same time.

An optimistic concurrency strategy assumes that concurrency does not exist. This type of strategy is to provide a policy to handle a failed operation, throw an error, or retry the operation when a concurrency problem occurs. Optimistic concurrency is most effective in environments with a low chance of concurrency.

Pessimistic concurrency affects performance and limits the overall concurrency of the solution. Optimistic concurrency can provide good performance because it doesn't lock in anything, just reacts to failure. In a low concurrency environment, it's almost as if there were no concurrency processing strategies. However, when the likelihood of concurrency is high, the cost of retrying operations is typically much higher than restricting access to resources. In these cases, it is best to use pessimistic concurrency.

Anatomy of a Kafka theme

Kafka is a popular event streaming platform. If you use it for simple publish-subscribe and event flow use cases and don't pay much attention to how it works inside, you might miss out on some of the powerful features by using its event routing features.

Published events are sent to the topic. Kafka topics (similar to queues, but keep each event continuously even after consumption, like a distributed event log) are divided into different partitions. The following image is an analysis of a Kafka topic:

When an application publishes an event to a specific topic, it is stored in a specific partition. To assign events to partitions, Kafka hashes keys to compute partitions, and when there is no key, it loops between partitions. Note, however, that using keys, we can ensure that all events with the same key are routed to the same partition. As we will see, this is a key property.

Consumers handle events from the topic. Typically, event-driven services can scale out, and we can increase their throughput by adding instances of the same service. Therefore, a service, such as the subscription service we discussed in this example, can have multiple instances consumed from the same topic at the same time, which is susceptible to the concurrency issues we discussed earlier. A partition has and only one service instance consumes.

Kafka guarantees the order of each partition, but not the order of the topics. That is, if you post a message to a topic, there is no guarantee that consumers will receive the messages in order (although it is likely to be received sequentially, unless network partitioning or rebalancing occurs, which is not common). However, Kafka guarantees the order of messages in a single partition. Each partition is consumed by only one instance in a consuming group.

Kafka is a distributed event streaming platform with the keyword "distributed". Partitions are assigned to one machine, which means that a topic can be physically stored on several machines (along with its fault-tolerant copies). This enables high scalability and high availability. However, if you've been dealing with distributed systems long enough, you probably know how hard it is to guarantee order on several machines, so it only guarantees order within partitions and not within the entire topic.

However, it is not all for nothing, and it provides the following three features:

A partition has and only one service instance consumes.

Events with the same routing key are routed to the same partition.

Order can be guaranteed in a partition.

The above three features lay the foundation for a truly useful solution. It can provide the tools to consume events sequentially without concurrency issues, as we'll see next.

Concurrency is handled by design

As mentioned above, we can apply pessimistic or optimistic solutions to deal with concurrency. However, there is a completely different approach, which is to deal with concurrency through design. Instead of applying policies to handle concurrency, we design the system to have no concurrency at all. Of course, this is a very ideal approach, but it is often not feasible in non-event-driven solutions. Leveraging the three features we discussed earlier, event-driven services are the primary beneficiaries of handling concurrency through a design approach.

In event-driven services, a very efficient way to handle concurrency by design is to use the ability to route events to specific partitions. Since each partition is consumed by only one instance, we can route each set of events to a specific instance based on a routing key. With the right routing keys, we can design our systems to avoid concurrency within the same entity.

For example, how do we apply this philosophy to the examples of the products and subscription services we are discussing? Let's say we use the product ID as the routing key. According to the features we just discussed, all events of the same product will be routed to the same partition, since a partition is consumed by only a unique instance; all events of the product will be handled by only one instance, as follows:

All inventory event guarantees for product 251 are consumed by subscription service instance #1 and are consumed only by that instance. Since no other instance can handle events from the same product, we can use traditional methods to handle concurrency issues, that is, using in-process concurrency processing strategies such as locks. We turn distributed concurrency problems into in-process concurrency problems, which is relatively simple to deal with. Inside the subscription service, we can even use the same strategy to route events to specific threads. This end-to-end event routing eliminates concurrency in a highly scalable and sustainable way.

Because Kafka guarantees order within a single partition, events are also ordered. Therefore, we also avoid the complexity of dealing with disordered events.

By designing to solve concurrency problems, we designed the system to be completely unconcurrency. This is more performant and has fewer errors because it does not involve specific resource locking like the pessimistic approach, nor does it involve retrying operations like the optimistic approach. This also benefits the development of new features, as developers don't have to think about the edge of concurrency; we can assume that concurrency doesn't exist at all.

brief summary

Concurrency in distributed systems is a tricky problem, and both pessimistic and optimistic approaches are an option, but they often mean a performance penalty. While useful in some use cases, they affect the scalability of microservices because they involve locking or retrying. Event-driven services and the ability to route events to specific service instances provide an elegant way to eliminate concurrency in the solution— by design, laying the foundation for true horizontal scalability.

View the original English text:

https://itnext.io/solving-concurrency-in-event-driven-microservices-79bbc13b597c

Resolve concurrency issues in event-driven microservices

Read on