Java hematemesis reviews the experience of improving interface concurrency once, which is very practical

Recently in the development of a punch-in interface, in fact, only need to make some judgment, save the clock-in results, it is expected that more than 1,000 people in the same period of online clock-in, but after the first writing, the stress test effect is very bad, you can see only a dozen concurrency, drink water to spray out, so simple interface can take so long, I estimate that within 100ms can be accurate, then there are hundreds of concurrency. So began my path to optimization.

Look at the main code, after the controller receives the parameters in, executes the service method and returns, that is, query the activity, judge the activity time, accumulate points, save the results, and the concurrency is so poor I really can't think of it, making people shake their heads and shake their heads depressed, I should be fine, there is a problem with the deployed machine? But they are all okay servers.

@Override
  public RelationActivityRecord clock(String userId, ClockDto clockDto) {
    // 判断活动是否有效
    RelationActivity relationActivity =
        relationActivityService.getRelationActivity(clockDto.getActivityId(),
                clockDto.getActivityType());
    Date now = new Date();
    if (now.before(relationActivity.getStartTime())) {
      throw new BadRequestException(ErrorEnum.WRONG_ARGUMENTS, "活动还未开始，请耐心等待~");
    }
    if (now.after(relationActivity.getEndTime())) {
      throw new BadRequestException(ErrorEnum.WRONG_ARGUMENTS, "活动已结束，可在主页查看个人作品哦~");
    }
    // 本次打卡获得积分
    int score= 0;
    // 当天的0点时间
    Date toDay = DateUtil.beginOfDay(now);
    if (Objects.equals(clockDto.getActivityType(), 1)
        && Objects.equals(clockDto.getClockType(), 1)) {
      // 查询累计打卡次数
      CurrentClockRecordVo clockRecordStat = this.currentRecord(userId,
              clockDto.getActivityId(), clockDto.getActivityType(), 
              );
      // 当天第一次的打卡，可以攒一个积分
      if (clockRecordStat.getLastTime() != null && clockRecordStat.getLastTime().before(toDay)
              && clockRecordStat.getNum() < 100) {
        score = 1;
      }
    } else if (Objects.equals(clockDto.getActivityType(), 1)) {
      // 其他的活动之类...
      throw new BadRequestException(ErrorEnum.WRONG_ARGUMENTS, "活动暂未开放");
    } else {
      throw new BadRequestException(ErrorEnum.WRONG_ARGUMENTS, "未知活动");
    }
    // 保存打卡记录
    RelationActivityRecord newClockRecord = new RelationActivityRecord();
    newClockRecord.setStudentId(userId);
    newClockRecord.setActivityId(clockDto.getActivityId());
    newClockRecord.setActivityType(clockDto.getActivityType());
    newClockRecord.setClockType(clockDto.getClockType());
    newClockRecord.setCreateTime(now);
    newClockRecord.setContent(clockDto.getContent());
    newClockRecord.setEnergy(energy);
    newClockRecord.setVideoId(clockDto.getVideoId());
    newClockRecord.setIntro(clockDto.getIntro());
    this.save(newClockRecord);
    // 有积分可以攒到学生账户上
    if (score> 0) {
      activityStudentService.addScore(clockDto.getActivityId(), userId, score);
    }
    return newClockRecord;
  }

Calm down and drink saliva again, encounter problems first look for external reasons, is it springboot's default tomcat maximum connections and other default values are too small, the project is not reconfigured, but think about a video the other day said that the default is 8192 maximum number of connections, then do not change the configuration is also quack enough. For the sake of caution, write a few random execution statements, do not connect to the database interface pressure test, at the beginning there are more than 1000 concurrency, and then after the accumulated queued requests are more, the concurrency gradually drops to 600, normal and reasonable, the processed requests will be queued No way, unless the configuration of the number of threads is increased:

Suddenly remembered, the project was started instead of using Tomcat, but replaced with Undertow, is there a problem with this guy? But the search engine, saying that it is more aggressive than Tomcat, at least 8000 concurrency by default, seems to be excluded.

Could it be that the JVM stack setting is too small, processing stuttering, and looking at the gc number is a bit much. The startup configuration copied from other projects, only 512M I go, immediately change to 4G to try. However, the actual effect has not changed. In fact, although there are a lot of GCs, the actual time spent does not add up to 3S, which can be ruled out as its problem. (There is also a problem here, that is, even after the change is bigger, the project will still have 3 full-gc and several young-gc at a start, and I can't understand why this is the case at once, and experienced friends can give some troubleshooting directions)

For this reason, I judge that the probability is that the interface takes too long and cannot be initiated.

The code has no other operations except some judgment statements, object conversion, query inserts and update the database. Is there a problem with object conversion, but there is object conversion throughout, is it possible to have a problem? It is estimated that there is really a problem with the database. In order to improve the query speed, I only added a compound index, which is very unreasonable to be so slow.

Then I'll use Skywalking to probe it!

This.. Although in the case of stress testing, it took 70ms to allocate resources and wait, but it took 308ms when the interface was executed, how can you only see the time consumption of the query statement in skywalking, but the query statement only took a few milliseconds.

Then I will add p6spy to see, the execution of each statement takes time.

Astonishing, frightening, these two insert update statements actually take two hundred milliseconds.

What's going on, I want to think, can't figure it out, turn to search engines and big models:

Noting point 2, I didn't expect that I would have thought that the interface could tolerate data inconsistencies caused by no transactions to improve concurrency. Since that said, let's try adding a transaction. I don't know if I don't know, Iplus is shocked, and it takes time to add basic 0.

Search again for why, some people say that after adding transactions, they will not insert a write disk once as before, and then under the intervention of the transaction, save enough to write the disk again, so that the efficiency is high. Then here or can also optimize whether this batch can save more points, but also to measure when it comes to writing disks, the single interface request at that time will be slow.

Then there is another question, is my transaction added too big, after all, it is to query first, and then insert and update data, you can separate them, then try separately, query without adding transactions, insert updates to propose a method plus transactions, but, strange, the result concurrency effect is worse. It seems that there is something going on here. At the same time, it was found that the pressure test here at the beginning was the same user, and it stands to reason that the real situation is different users, and the same user may compete more fiercely for resources, so the user userId was changed to randomly generate one.

At this time, I feel that the 8.x version of SkyWalking is really difficult to provide more comprehensive consumption information, thinking that I saw SkyWalking 9.x version coming not long ago, then download one and try it. Here I downloaded the latest version at the beginning, but the result was that it couldn't be started and kept flashing, and what port occupation was changed, big pit. Then directly select version 9.0 and start normally. Sure enough, version 9.0 is different, the interface layout is comfortable, and the monitoring data is also alive, which is very good.

Moreover, the time-consuming release of the trace link is finally a lot more comprehensive!! The database connections are all out. It can be seen that before looking at the transaction without adding, each query and write must obtain a database link separately!! Insert statement monitoring can also come out normally!

The query does not add transactions, the query and update use transactions, you have to get three database connections, the first connection is to check selelct 1 to determine whether the database is normal, can be removed.

Querying and inserting updates in the same big transaction, only getting a database connection once, is naturally more efficient.

Here it is basically determined that putting a transaction in a transaction is more efficient and reduces the acquisition of database connections.

In addition, looking at skywalking link tracking, it was found that getConnection was particularly time-consuming, and it took several seconds to get a connection.

At present, the connection pool uses druid, the parameter is the default configuration, check the maximum number of connections in the default configuration is not high, then the problem is estimated to be here, try it. At present, the PG database setting is 250 links, if it is a production environment, the database is exclusive, the maximum number of connections between YML and PG configuration can be larger, try 200 at present, if the concurrency can be improved, it means that here is also a point that can be optimized, of course, the actual configuration depends on the situation, otherwise the database will be crotched.

Set 200 concurrent per second for 30 seconds, the effect is not bad, an average of more than 700ms a request, and finally concurrency to about 250.

The undertow and database connection configurations here need to be gradually adjusted according to the actual situation to achieve the best effect of the stand-alone machine.

To summarize:

Concurrency is low, there are configuration problems, and there are code problems.
The addition of transactions needs to pay attention to the effect after the addition, sometimes it is not what you think, at first feel that the scope of the transaction is not appropriate, and it turns out that the effect is better than not adding transactions.
A good tool is too important, such as SkyWalking 9.0.
Multi-point analysis.
To sum up this experience, there are more problems in basic single-machine concurrency.