How do I resolve common concurrency issues?

Written by | Beekums

Translated by | Rayden

Planning | Wang Qiang

Concurrency errors are notorious and often lead to very frustrating bugs. The bugs in most software are consistent. If you do X first, then Y, then Z, you're going to get Bug A.

But with concurrency, you encounter race conditions. It's a bug, and if you do X and then do Y, you probably have a 10% chance of getting Bug A. The occurrence of errors is intermittent, which makes it difficult for you to find the root cause of the error because you cannot reliably reproduce it. It also makes it difficult for you to prove that you actually solved the problem. If Bug A has only a 10% chance of occurring, you'll need to try to reproduce the bug multiple times to be confident that you've fixed it.

Dealing with concurrency issues was a livelihood early in my career. I love using multithreading and fixing competing conditions that advanced developers miss, and this kind of work can boost my own confidence tremendously.

Then I went to an interview and was asked concurrent questions. The results are quite good.

It was then that I realized that I was good at some type of concurrency problem that happened to be the cause of most concurrency problems.

Simple concurrency problem practice

First, let's talk a little bit about what concurrency is. Then, we move on to a simple concurrency problem, and then a more complex one.

Concurrency is basically about having multiple independent pieces of code run at the same time. Let's start with assumptions and move on to a real situation.

Let's say I need to make 5 different requests to an API. Each request takes 100 milliseconds to complete. If I wait for one to finish before starting the next one, I will wait 500ms.

How do I resolve common concurrency issues?

If I execute all 5 web requests at the same time, I will end up waiting 100 milliseconds plus a small amount of extra overhead.

This is a considerable performance difference and why people often use concurrency.

That sounds like a simple concept, right? This is because it is a simple concept.

The problem is the execution process. Those API requests take about 100 milliseconds each, not an exact 100 milliseconds.

This means that you will make API requests sequentially, but the returns will be out of order:

Each time you run code that executes an API request, the return order is different.

You get performance improvements through concurrency, but you give up consistency.

Bugs occur if the code that handles these API request responses uses shared data.

Let's look at a more detailed example and see how this happens. Dynomantle's search bar suggests a bug.

The problem with it is: every time you enter a character, an API request is made. This is to allow you to see prompts smoothly as you type. You enter "i" and notes/bookmarks/messages starting with "i" will pop up. You enter "in" and the list will smoothly change to something that starts with "in".

When you know what you're searching for, how long does it take to enter 5 characters? 2 seconds? 1 second? Half a second?

I still need to optimize this service, but now it takes half a second to a second to process each API request.

Having users wait a second after typing each character before typing the next character is a poor user experience. So I make an API request as the user types each character. The problem is that the order in which requests are returned is inconsistent. A request with 2 characters may be returned after a request with 5 characters.

Search suggestions are stored as a list. Whenever a response comes in, the entire list refreshes. In this case, when the last request returns, the entire list is refreshed with the correct suggestions, but when the old request returns, the list is populated with incorrect suggestions.

Fortunately, this is a very easy problem to solve because the requests are made sequentially.

1) Generates a timestamp or hash value each time a request is made, which is used as the request ID.

2) Set the request id to an additional variable with a list of suggestions. Because we submit requests in order, this is always the last request.

3) Pass the request ID in the success function of each API call.

4) When the response arrives, verify that it is the response to the expected request.

Unfortunately, if users type very quickly, they may see a delay in the list of suggestions to update. Even if the user doesn't use a 2-character suggestion, seeing the list of suggestions appear can give the impression that the app is doing something instead of just waiting.

Solving this problem requires a small modification to the above code.

We will continue to use timestamps instead of hashes.

Next, we'll store the last request ID received, not the last request ID made.

Finally, we refresh the list only if the request id of the response is higher than the last request id received. Because we use a timestamp as the request ID, all requests are ordered, and the larger the id, the newer the request.

Note: This mechanism works on the premise that the user does not type multiple characters in the same millisecond. If they do, they are pasting content, at which point we only need to make an api request.

It's also important to note that this only applies to the way Javascript handles concurrency. It is not true concurrency. Each function executes and completes before the other function runs.

Try similar code in Java and you'll feel terrible. Because multiple calls to suggesReceived() may be executed at the same time. This means that both the suggested response to "in" and "inv" can pass the check in the if statement and then execute the rest of the function.

The behavior you see will be very inconsistent, depending on the length of the rest of the function and the time of the two function calls. For it to work properly in a true concurrent programming language, you need to find out how to use locks in that language. If you're dealing with concurrency across multiple servers, you can also consider Redis' distributed locks.

When one function has a lock, the lock can prevent other functions from executing. If we need to use locks in Javascript, it should look like this:

Of course, there is a risk in doing so, and if we never unlock it, then other functions will not execute. If we use multiple locks in multiple functions, it may happen that both functions are waiting, and they are both waiting for the lock that the other has locked. Our program is stuck now because neither function can be executed. This is called a deadlock situation.

The search suggestion bug in Dynomantle is a simple concurrency problem because it's in Javascript. Let's explore a more complex problem, which occurs in Java, but its lessons should be helpful for many other problems.

Complex concurrency problem practices

My first job after graduating from college was to develop a web management application. For example: You are visiting a company and connecting to a customer wifi network. Our app will allow system administrators to configure your access based on login credentials, location in the office, time of day, etc. They can enable or block a port based on company policies.

Since we support multiple protocols, concurrency comes into play. We support 802.1x for wifi, but we also support authentication based on the Ethernet port to which the user is connected, the Kerberos authentication protocol, and some other protocols.

As soon as you turn on the computer, it tries to connect at the same time using as many protocols as configured.

Suppose an administrator sets a less accessible policy for Ethernet port access, through which you might access ports 80 and 443 (basically just web browsing). If you use Kerberos for authentication, you can gain broader network access. The problem here is that the order doesn't matter. If a user passes multiple authentications, the administrator can configure which protocol has priority.

When I started this project, the code handed to me stored the status of authentication in a database table with only one row of MAC address for each person:

primary_key - int

mac_address - varchar(255) and a unique key

authentication_protocol - enum

status - enum

(The real datasheet is much more complicated, but this was 15 years ago, so please forgive me)

authentication_protocol the protocol with the highest column store priority and success. If all authentication attempts fail, it still stores the protocol with the highest priority.

I've simplified the problem, in fact we need thousands of lines of code to coordinate all the different protocols, figure out which one has the highest priority, handle those protocols that have multiple authentication steps, handle all sorts of locks, and also deal with some of the quirks of various switch and router manufacturers in terms of firmware. Customers are very unhappy because it rarely works properly and users often get the wrong network policies assigned to them.

In the first few months of my career, I spent most of my time fixing one bug after another. Eventually I realized that the problem wasn't that our user needs were complex, it was that we built a bad data model that made the code particularly complex.

The solution is simple. Take the same database table as above, for example. Now add a line for each MAC address and protocol. Previously, each mac address had only one row in the database table, and the program coordinated which protocol was displayed in that row, and then one row was added for each protocol after modification.

Concurrency is still happening, but you no longer need to reconcile what data is actually saved during concurrency. Each thread/process has its own data, which they are responsible for modifying while other threads/processes do not have permission to modify it. When determining a user's network access, simply look up all the rows for that user and select the relevant rows.

There are no locks and no shared data to modify.

The code ends up being simpler because you can ignore most concurrency. The developers are happy. The code is working properly and the customer is happy.

In a practical situation, the administrator only set 2-3 policies for one person, so I ended up increasing the size of the table by 2-3 times. However, this is a linear growth. The database can easily handle linear growth. Increasing from 1000 rows to 3000 rows is irrelevant. Even if it is increased from 1 million rows to 3 million rows, it can be easily handled by modern hardware.

Increasing from 1 billion rows to 3 billion lines can be problematic. However, before you reach 1 billion rows, you should have started scaling your database to support 3 billion rows.

All of this is a lengthy statement: it's worth it to increase the size of the table by a factor of 3, because it allows us not to worry about concurrency.

This type of problem is a common concurrency problem. Much of the data seems to need to be accessed and modified simultaneously by different concurrent processes, which in most cases is not true. Fine-tuning your data model and taking advantage of the fact that storage is cheap can save your team a lot of work.

https://blog.professorbeekums.com/2021/solving-concurrency-problems/

How do I resolve common concurrency issues?

Read on