Preface

Recently, I have nothing to do, I have read a lot of articles related to high concurrency, and I suddenly felt that I thought of a project I did a few years ago, which has something to do with high concurrency.

Here I will share with you while recalling the details of the landing, which may bring you some inspiration.

1. Requirements and background

First of all, let's introduce the requirements, first of all, the project is a voluntary filling system, since it will involve high concurrency, I believe you can also guess what the general voluntary filling is.

The core function is two, one is to fill in the candidates' volunteers, and the other is to maintain the candidates' data for the teacher.

Originally, we were not responsible for this project, but last year, the company's team in charge of this project was seriously complained by Party A, saying that many candidates were stuck in use, and even attributed the responsibility for not filling in the volunteer to the system.

Party A clearly requested that if this situation occurs again this year, all of the company's projects in the province will be at risk of being replaced.

Discussions went on and on, and when the company finally put the task on our head, it was already a few months later, and there was less than half a year left in the critical appointment stage.

Although the direct leader told us not to have a psychological burden, we did a good job of praise, and it was not ours to do a bad job, but we obviously felt the pressure from him, after all, one can be on the news if one is not careful.

II. Analysis

Now that I have started to do it, it is useless to say that those who have and don't have it, and directly start to analyze the requirements.

First of all, the business logic is not complicated, and the difficulty lies in concurrency and data accuracy. After communicating with the customer, after having a general understanding of the concurrency requirements, I sorted it out.

The login interface of the candidate side and the query interface of the candidate's volunteer information require 4W QPS
Candidates need 2W TPS to save the voluntary interface
Enrollment information inquiry: 4W QPS
4k QPS is required for the teacher's side
There is no restriction on import and other interfaces, and it can be processed asynchronously, as long as all information is updated within 20 minutes, and the fault recovery time must be within 20 minutes (mandatory requirements)
The data on the examinee's side is required to be absolutely accurate, and there can be no omissions, errors, and other data that are inconsistent with the examinee's operation
Data desensitization and anti-counterfeiting
Resources are limited, and several physical machines are provided

There are so many big requirements, mainly because it is necessary to achieve such a high concurrency under limited resources, and it is necessary to think and think, and the general CRUD cannot meet the requirements at all.

3. Program discussion

Next, I will start from the point where we cut into the problem at that time, from the preliminary design to the whole process of project implementation and thinking, step by step to show how the project can be realized.

First of all, we didn't design the table, we didn't design the interface, but we tested first. What to test?Test whether the middleware we need or may use meets the requirements.

MySQL

The first is MySQL, a single-node MySQL test its read and fetch performance, and create a new user table.

Inserting data and querying data concurrently, the TPS is about 5k, and the QPS is about 1.2W. When querying, it is a query with ID, and the query of the index column is not as good as the ID query, and the difference is about 1K.

There is a slight concurrency gap between insert and update, but it is basically negligible, and the biggest problem affecting update performance is indexing. If there is an index in the table, the TPS will be reduced by 1k-1.5k.

At present, the conclusion is that MySQL does not meet the requirements, and other architectures, such as MySQL master-slave replication, write and read separation, can be considered.

After testing, if you still give up, the master-slave replication structure will affect the update, which will drop by a few hundred, and the TPS written alone cannot meet the requirements.

At this point, the conclusion is that the MySQL direct solution is definitely not feasible.

Redis

Since MySQL direct query and write do not meet the requirements, it is natural to think of adding redis cache. So we started testing the cache, also from single-node REDIS.

The get instruction QPS reached an astonishing 10w, and the set instruction TPS also had 8W, which was expected and pleasantly surprised, as if seeing the dawn.

However, Redis is prone to data loss, so you need to consider a high-availability scenario.

Implementation scenario

Since redis meets the requirements, all data is taken from redis, and persistence is still handed over to MySQL, and messages are sent first when writing to the database, and then asynchronously written to the database.

Finally, it is basically a redis + rocketMQ + MySQL solution. It seems simple, and we thought so at the time, but the reality is that we were naïve.

Here we mainly start with the most important and most demanding interface for saving volunteer information.

1) Fault recovery

The first thing that comes to mind is, what should I do if these nodes go down?

MySQL is relatively simple when it hangs up, and its own mechanism determines that even if it hangs up, it can still recover data after restarting, which can be ignored.

In general, rocketMQ may lose data if it hangs up, but after testing, it is found that there is indeed a phenomenon of message loss under high concurrency. The reason is that in order to be more efficient, it adopts the mode of asynchronous disk placement by default, and here in order to ensure that the message is absolutely not lost, it is modified to synchronous disk mode.

Then the most critical Redis is the most critical Redis, no matter which mode, Redis hangs down under high concurrency, there will be a risk of data loss. Data loss was particularly fatal to this project, taking precedence even higher than concurrency requirements.

Therefore, the problem came to how to ensure that the Redis data is correct, and after discussion, it was decided to start the Redis transaction.

The process of saving an interface becomes the following steps:

Redis enables a transaction to update Redis data
rocketMQ同步落盘
Redis commits the transaction
MySQL is a different database

Let's take a look at the possible problems with this interface.

In the first step, if Redis starts a transaction or fails to update Redis data, an error is reported on the page, which does not affect the data accuracy

In the second step, if rocketMQ orders are placed in error, then there will be two scenarios.

Situation 1, the order fails to be placed, and the message fails to be sent, it seems that it has no impact, and the error can be reported directly.

In the second case, if the message is sent successfully, but the message fails to be sent (for whatever reason), the MySQL and Redis data will eventually be inconsistent.

How to deal with it?How do I know if there is a problem with redis or MySQL?In this case, if the candidate does not continue, then the wrong data will not be updated correctly.

Considering this problem, we decided to introduce a timestamp field and start a scheduled task at the same time to compare the inconsistencies between MySQL and redis, and fix the data autonomously.

First, the timestamp is recorded in Redis, and this timestamp is also included in the message and recorded in the table when it is stored in the database.

Then, the scheduled task is executed every 30 minutes to check whether the timestamp in Redis is smaller than that of MySQL and update the data in Redis if it is less than that. If it is greater than that, it will not be processed.

At the same time, another layer of optimization is done here, executing a scheduled task in the early morning, comparing the timestamp in redis is greater than the timestamp in MySQL, and this data exists for two consecutive days and there is no update operation, which will prompt us to manually operate and maintain.

Then in the third step, the message is successfully submitted but the Redis transaction fails to be submitted, which is the same as the second step and will be processed by the second scheduled task.

In this way, even if Redis crashes, there will be no data loss.

The first round of stress testing

After the interface was realized, I went to the stress test with anticipation and confidence, and the result was also a good start.

First of all, there is really no problem with the accuracy of the data, and no matter which link is suddenly killed, the final consistency of the data can be guaranteed.

However, TPS is only less than 4k, is it because there are fewer nodes?

So a few more nodes were added, but there was still no improvement. The question is simpler.

Re-analysis

After this stress test, a key question was raised, what exactly affects the interface TPS???

After some discussion, the first conclusion is that the response time of an interface depends on the sum of its slowest response time.

So I used arthas to look at it to see where the bottom is slow?

As a result, the slowest step was the redis modification of the data, which was completely different from the test. So for this step, we continue to explore in depth.

The conclusion is:

Redis itself is a very good middleware, and concurrency is indeed possible, and there is no problem with the test during selection.

The problem is that in the IO, we store the candidate's information in redis with json strings (why not save it as other data structures, because we have tested several available data structures in advance and found that redis has the highest performance in saving json strings), and although the size of a single test taker data is not large, the uplink bandwidth is full under high concurrency.

Therefore, in response to this situation, we compress the string with gzip and save it to Redis before saving it to Redis.

Why use gzip compression, because our volunteer information is an array, a lot of duplicate data is actually field names, gzip and several other compression algorithms, comprehensive consideration of the compression rate and performance, at that time chose this compression algorithm.

For strings that exceed the limit, we will also unpack them into multiple (no more than three) key storage.

Continue with the stress test

After another round of pressure testing, the effect is very good, TPS has come from 4k to 8k. Not bad, but it's far from enough, the goal is 2W, and it's not half of it.

A few nodes were added, and there was an effect, but not much, and in the end it could not pass 1W.

Continuing to analyze in depth, why is it slow? Finally, I found that it was stuck on the rocketMQ synchronous disk.

The efficiency of synchronous disk placement is too low, so a wave of pressure testing found that this is indeed the case.

No matter how you go to the synchronous disk, it will be stuck where rocketMQ writes the disk, and because the string has been compressed earlier, there is no bandwidth problem. The problem is that it suddenly stalls, and I don't know what to do with rocketMQ.

At the same time, another colleague also brought bad news when testing the query interface, the query interface could not be up at about 1W2, the reason was still stuck in the bandwidth, even if the string was compressed, the bandwidth was still full.

After thinking about it for a long time, I finally decided to adopt a more conventional processing method, that is, data partitioning, since the performance of a single rocketMQ service is not up to standard, then expand horizontally and add a few more rocketMQ.

Different candidates access different MQs, and redis can also partition data, fortunately redis has a hash slot architecture to support this method.

The remaining problem is how to solve the way of candidate partitioning, and at first I considered the partition that calculates the remainder according to the ID, but later found that the data distribution of this partition method is extremely uneven.

Later, it was slightly changed, and the data distribution was more uniform according to the remaining partition of the last few digits of the certificate number. With a general solution, continue to start the stress test after a meal.

One point small surprise

After the stress test, the results were unsatisfactory again, and both TPS and QPS did not increase but decreased, and continued to be checked through arthas.

Finally, it was found that when the Redis hash slot is accessed, the primary node will first calculate the key slot, and then transfer the request to the corresponding node for access, which reduces the performance by 20%-30%.

Therefore, the code is re-modified, the hash slot is calculated in Java memory, and then the Redis of the corresponding slot is directly accessed. In this way, the QPS reached an astonishing 2W, and the TPS also reached about 1W2.

Not bad, but it's only 2W, and if you want to go up, there is a bottleneck again.

But this time I had a lot of experience, and I immediately found out what the problem was, and the problem came to nginx, and it was still the same problem, bandwidth!

Now that we know the reason and it is more convenient to solve it, we will put two nodes nginx on the only physical machine with large bandwidth, and go out through VIP proxies, and access different addresses according to the candidate's partition information.

Stress testing

I can't remember the first few rounds of stress testing, but this time the results are quite satisfactory, the main query interface QPS has come to an amazing 4W, and even some interfaces have come to 6W or even higher.

The victory is already in sight, the only problem is that the TPS can't go up, and the maximum 1W4 can't run.

What is the reason? After checking the main performance indicators of each REDIS, it was found that the performance bottleneck of Redis was not reached (the uplink bandwidth was 65%, and the CPU usage was only about 50%).

The same is true for MQ, where the problem is likely to be the java service. After a wave of analysis, it was found that the CPU basically ran to 100%, and the maximum number of links per node was basically full, but the bandwidth was still left.

The reason is that the built-in container tomcat of SpringBoot used at that time, no matter how configured it, the maximum number of connections is also supported by more than 1k points.

Then a very simple formula can come out, if the response time of a request is 100ms, then 1000 * 1000 / 100 = 10000.

That is to say, the maximum concurrency supported by a single node is 1W, and now the interface response time we save is 300ms, so the maximum concurrency is more than 3k, and there are currently 4 partitions, it seems that the TPS of 1W4 seems to have found a source.

The next step is to optimize the response time of the interface, which is basically a step by step, optimizing everything that can be optimized, and finally controlling the response time within 100ms.

So it stands to reason that the current TPS should come to an astonishing 4W.

Stress test again

With apprehension and excitement, I entered the stress test link again, and as a result, TPS came to an amazing 2W5.

At that time, I was really excited, but after calming down, it was also strange, according to normal logic, the TPS here should be able to reach 3W6.

In order to find out where there are still undiscovered pits (for fear of surprise after the launch), we did further analysis, and finally found some clues on the log.

The link timeout is reported when the request is linked to Redis, and the response time of 0.01% of the interface is higher than the average.

So we set our sights on the number of Redis connections, continued a round of monitoring, and finally found the answer in business implementation.

The API that saves a volunteer at one time needs to perform five Redis operations, including obtaining locks, obtaining candidate information, obtaining volunteer information, modifying volunteer information, deleting locks, and Redis transactions.

In contrast, the query interface only handles two operations, so for one operation to save volunteers, a single-node Redis can support more than 6K concurrency at most.

In order to verify this point, we tried to remove Redis transactions and lock operations, and did a stress test in the control group, and found that the concurrency did improve as expected (in fact, there is one thing to worry about, that is, the lock grabbing timeout).

Ready to call it a day

At this point, it seems that the high concurrency requirements of the project have been completed, and the rest is just to improve the details.

So once again, the stress test was ushered in, and this time it lived up to expectations, and the two important interfaces met expectations.

After that, it began to really enter the business implementation link, and after the entire function was completed, after a month and a half with two weeks of overtime, it finally ushered in the test.

Post-test questions

After the function was tested, the first problem appeared in Redis again, when the high concurrency suddenly killed one of the Redis nodes.

Because the hash slot method is used, if a node is suspended, it will be very troublesome and inefficient to recalculate the slot when recovering, and if it is not recovered, then the concurrency will be seriously affected.

Therefore, after discussion, it was decided to manually partition Redis, and the partitioning logic is the same as that of MQ.

However, doing so will have a certain impact on the management side, because the management side is a list query, so the management side needs to obtain data from multiple nodes at the same time.

Therefore, the management side wrote a separate set of scheduling logic for obtaining data partitions.

The second problem is the performance of the management interface, although the requirements of the management side are not as high as the candidates, but they can't bear it, they are pagination, 10 at a time, and they need to be spliced with various data.

However, with the previous experience, I quickly knew what the problem was, the key is the number of connections to Redis, in order to reduce the number of links, the pipeline is used to splice multiple instructions.

Online

When everything is ready, it's ready to go live. Let's talk about the application layout, 8+4+1+2 nodes of Java services, including 8 nodes for candidates, 4 for management, 1 for scheduled tasks, and 2 for consumer services.

3 ng, 4 candidate ends, and 1 management end.

4 RocketMQ.

4 redis.

Two MySQL services, one master and one slave, and one scheduled task service.

1 ES service.

In the end, it was successfully launched, although there were individual online problems, but there was no danger overall, and the number of concurrency in the formal feedback was far from reaching the limit of our system, and the horizontal expansion scheme that we began to prepare was not used, and countless times we rehearsed the downtime of each node and increased partitions, and generally restored the system within 10 minutes, but fortunately, it did not come in handy.

At last

After the whole project was done, I felt more and more deviating from the high concurrency mode in the interview, and to be honest, it was also a helpless move. I think the main reason for the deviation is that the project has higher requirements for data accuracy and high concurrency.

But after the baptism of this project, I also gained a lot in it, I knew how to monitor service performance indicators, and then I also deepened my understanding of middleware and various technologies.

Although I was tired after doing it, I was also very happy, after all, I still felt a sense of accomplishment after analyzing the performance bottlenecks and completing the project requirements with limited resources.

As a digression, although the project successfully restored the company's image in the province and was praised by the head office and leaders, in the end, it was like that, and there was nothing substantive at all, which was the main reason for me to leave this company. But in retrospect, the experience was truly memorable and helped me a lot in my later work.

From the previous CRUD, it has become possible to solve interface performance problems. In the past, I may have been at a loss or tried my luck, but now I will slowly explore the optimization plan according to the clues.

I don't know if my experience with this project resonates with you, I hope you find this article helpful.

Author丨Blue Bird 218

Source丨 juejin.cn/post/7346021356679675967

The DBAPLUS community welcomes contributions from technical personnel at [email protected]

Event Recommendations

The 2024 XCOPS Intelligent O&M Manager Annual Conference will be held on May 24 in Guangzhou, where we will study how emerging technologies such as large models and AI agents can be implemented in the O&M field, enabling enterprises to improve the level of intelligent O&M and build comprehensive O&M autonomy.

A high-concurrency project to the ground

Conference details: 2024 XCOPS Intelligent O&M Manager Annual Meeting - Guangzhou Station

A high-concurrency project to the ground

Event Recommendations