BIGO优化Apache Pulsar系列-Bookie负载均衡

background

Based on powerful audio and video processing technology, global audio and video real-time transmission technology, artificial intelligence technology, and CDN technology, BIGO has developed and invested in a variety of audio and video social and content products, including Bigo Live, Likee, etc. At present, BIGO has 100 million monthly active users around the world, and its products and services have covered more than 150 countries and regions. With the rapid growth of business, the scale of data carried by the BIGO message queue platform has increased exponentially, and downstream services such as online model training, online recommendation, real-time data analysis, and real-time data warehousing have put forward higher requirements for the real-time and stability of messages.

Apache Pulsar is a top-level project of the Apache Software Foundation, as a next-generation cloud-native distributed message streaming platform with strong consistency, high throughput, and low latency. Pulsar is designed with a hierarchical architecture that separates compute and storage, with the storage layer supported by the Apache BookKeeper project, and a single BookKeeper server node called a bookie.

In order to achieve read-write isolation, a ledger disk and a journal disk will be assigned to a bookie during the deployment process, and the journal disk will be written sequentially when data is written, and the response can be successful immediately after successful persistence, so as to achieve low-latency writing.

It can be seen that the journal disk has high requirements for write speed and low requirements for storage space. Ledger disks have high storage space requirements and low write speed requirements. Theoretically, SSDs can achieve better write performance for journal disks, and large-capacity HDDs for ledger disks can meet the requirements. However, the data written by the message queue is all hot data, the amount of data written is large, and the storage time will not exceed 7 days, and the flash memory medium used by the SSD has a limited number of programmable and erasure times, so in practice, based on cost considerations, SSD disks are not used for journal disks. We all use large-capacity HDD disks, one bookie has one disk as a Ledger disk, and three bookies share one disk as a Journal disk.

Note: Although multiple bookies share a disk as a journal disk, the space usage of the journal disk is also extremely low, and there will be a waste of storage space resources, so we will remove the journal disk to release more storage space for the scarce ledger disk, and it can also further alleviate the problem of high write latency.

challenge

As the service traffic continues to grow, the pressure on the cluster increases, which is directly reflected in the increase in the P99 write latency of the cluster.

However, high write latency is not necessarily caused by insufficient hardware resources in the cluster, but also due to various other reasons, such as the traffic of a single topic partition is too large, reaching hundreds of MB, and the maximum sequential write speed of a single HDD disk is more than 100MB, and the journal disk is still shared by three bookies, at this time, all partitions of the journal disk service will definitely have the problem of high write latency, the solution is very simple, expand the number of topic partitions, You can distribute traffic to multiple journal disks.

There may also be a problem with the system design itself: the flow pressure between journal disks is uneven. We collect statistics on the write throughput statistics of journal disks, with the horizontal axis being the traffic throughput interval and the vertical axis being the number of journal disks in the traffic throughput pressure range, as shown in the following figure:

Instance: instance4 device:sdw, value: 154.90 MB/s

Instance: instance7 device:sdw, value: 130.20 MB/s

Instance: instance5 device:sdw, value: 124.98 MB/s

Instance: instance1 device:sdt, value: 120.98 MB/s

Instance: instance15 device:sdw, value: 119.92 MB/s

......

Instance: instance10 device:sds, value: 47.44 MB/s

Instance: instance13 device:sds, value: 40.51 MB/s

Instance: instance12 device:sdu, value: 37.99 MB/s

Instance: instance14 device:sds, value: 23.67 MB/s

It can be seen that the load distribution of the write throughput of the journal disk is quite uneven. The throughput of the most stressful journal disk can reach 150MB/s, which is close to the upper limit of HDD performance, but there are many journal disks with a throughput of less than 50MB/s. This constitutes our basic motivation for doing bookie write load balancing.

Architectural design

For example, when BookKeeper creates a ledger, it selects bookie based on the placement policy, for example, using RackawareEnsemblePlacementPolicy to force different data replicas to be placed on different racks to ensure rack-level disaster recovery. Alternatively, you can use RegionAwareEnsemblePlacementPolicy to force copies of different data to be placed on different racks belonging to different regions to implement region-level fault recovery. After these placement strategy selections, there will still be quite a few candidate bookies as the ensemble for the new Ledger, which gives us room to implement the load balancing algorithm.

We only do load balancing in common write scenarios, and do not add load balancing for read and recovery write scenarios. Here's why:

• Read load balancing

On the one hand, the current read latency problem is not serious, and on the other hand, when reading the ledger, you can only choose from three copies (or even two copies) to read the Bookie, and the space for selection is not large, so the load balancing in terms of read is not considered.

·recovery

When auto recovery, the default is to copy the data to the local computer by the bookie that grabs the task, and it is not recommended to add the load balancing mechanism.

Compare Broker Load Balancer

At present, BookKeeper does not have a load balancing framework, unlike brokers, which already have embedded load balancers, and users who are not satisfied with the algorithm can design and implement new algorithms by themselves. Therefore, we have to build a new framework before we can implement load balancing. The following compares the load balancing mechanism of Pulsar broker to guide the design of BookKeeper's load balancing architecture.

Load unit

·broker的负载单元是bundle

The life cycle of a bundle is persistent, or the load it exerts can be considered persistent; After the broker uses AvgShedder to make a load balancing decision, it basically does not need to make a decision anymore because the traffic is stable for a certain period of time.

Note: AvgShedder is a load balancing algorithm implemented internally by BIGO, and the code has been contributed to the community.

https://github.com/apache/pulsar/pull/18186

The long-term and short-term load information of a bundle is counted through time aggregation, and the load information includes the number of messages written/consumed and the throughput.

·bookie的负载单元是ledger

The life cycle of a ledger is relatively short (but at least 10min, as will be seen later), and it no longer exerts any write pressure when it is closed; Bookie's creation of a ledger is an ongoing affair that requires constant decision-making. The load information of a ledger can include its estimated space size, estimated write throughput size, and can be used by decision makers.

Resource units

The resource unit of a broker is a single machine/node

Because only one broker is deployed on a machine, various load information on the machine (including CPU usage, NIC usage, etc.) can be regarded as one-to-one load information of the broker.

The resource unit of a bookie should be a disk

Because a journal disk is actually used by multiple bookies, the journal disk is mainly concerned about the write throughput and write latency indicators, and whether the writes of different bookies in the same journal disk are balanced or not does not matter.

For ledger disks, a bookie has an exclusive ledger disk, and the ledger disk is mainly concerned about its disk space usage.

When we select a bookie, we need to consider the write throughput of a bookie's journal disk and the disk usage of the ledger disk, so we can use a binary to represent the load information of a single bookie.

Frequency of decision-making

Brokers basically only need to make a decision when brokers join and exit

Of course, if the load balancing algorithm is not good, or the configuration is not appropriate, frequent bundle unloading and loading may occur, which will cause the problem of client traffic jitter, which is a problem of the broker load balancing algorithm, which will not be discussed in depth here.

Bookie needs to make decisions constantly

A topic is composed of a list of ledgers, and when the threshold is reached, the broker will actively close a ledger and create a new ledger for the topic. Therefore, the cluster needs to constantly create new ledgers.

The frequency of decisions will affect the design of the architecture, and specifically, whether we choose centralized or distributed decision-making.

- If each broker makes its own decisions (distributed decision-making), because each broker gets the same data and executes the same algorithm, in this case, the traffic will be skewed to a certain number of bookies, and the load balancing effect may be poor. Of course, this problem can be alleviated at the algorithm level, that is, a randomization factor can be introduced into the algorithm to alleviate the problem of skew as much as possible.

— If centralized decision-making is achieved, then there is no problem of skew.

It can be seen that the higher the frequency of load balancing decisions in the cluster, the higher the risk of choosing the architecture of distributed decision-making.

We collect the statistics of an online cluster, as shown in the figure below, the statistical chart of a cluster at the peak of traffic, each column corresponds to the number of ledgers created within 1s, which shows that even if there is no cluster restart, it can usually reach the creation speed of more than a dozen ledgers.

The graph below shows the creation frequency of low traffic peaks, which is slightly lower.

This decision-making frequency is relatively low, so we ended up choosing an architecture of distributed decision-making, where each broker can make decisions.

The speed at which the load changes

The broker can consider the traffic to be constant for a certain period of time

As mentioned earlier, after the broker uses AvgShedder to make a load balancing decision, there is basically no need to make a decision, because the traffic is basically stable for a certain period of time, with a daily cycle and periodic fluctuations.

The load of the bookie is constantly changing

Because the load unit of a bookie is a ledger, and a ledger has a life cycle, when the threshold is reached, the ledger will be triggered to shut down and a new ledger will be created. A partition has only one ledger service at a point in time, so the pressure exerted by a ledger is the pressure exerted by a partition, and the difference between a journal disk serving less or more serving a ledger is obvious, and it may even be that a journal disk is almost completely occupied by the traffic of a certain partition.

If the load of the bookie changes rapidly, then it is not possible to load balance because the load data just collected may have expired.

Collect logs (log data from 11:00 a.m. in the morning), collect the time interval between the two creation of ledgers for each topic/partition, and obtain the life cycle length of a ledger, and the distribution of it is as follows, the x-axis is the lifecycle length of the ledger, and the y-axis is the number of ledgers. (The figure below only shows the data below 800s, and it will be ugly to show too many content pictures.) ）

Note: Because the number of ledgers with a lifetime of 600s is significantly higher than that of other time lengths, the log axis is used for the y-axis.

There must be a reason why the life length of such a number of ledgers is 600s, that is, 10min, and there are no less than 600s.

查看broker配置managedLedgerMinLedgerRolloverTimeMinutes默认为10min,即一个ledger最快也要10min才能切换到下一个ledger。

The ledger switchover is triggered when the managedLedgerMinLedgerRolloverTimeMinutes time is reached, and one of the other three thresholds is triggered.

Therefore, the life cycle of a ledger is not very short, and its traffic pressure on a bookie should last at least 10min (regardless of the situation that the bookie rolling restart causes the ledger to close), if you feel that the bookie traffic change speed is still too fast, you can increase managedLedgerMinLedgerRolloverTimeMinutes to 15min.

Of course, although the life cycle of a ledger is a core factor in determining whether the write pressure of a journal disk is stable, we still have to look at the real fluctuations in the write throughput of a journal disk, because a journal disk not only serves one Ledger, and in a complex system with multiple Ledgers dynamically changing in the real environment, the change period of the write throughput of the journal disk is not equal to the life cycle of a Ledger.

As shown below:

Since the time interval between Prometheus pulling up the monitoring data is 15s, it can only be estimated that the data within one minute is still relatively accurate. Therefore, it is feasible to load balance bookie.

Load information reporting method and storage location

broker

The load information of the brokers is reported to ZK by each broker. However, because each broker and bundle corresponds to a znode, especially the number of bundles reaches more than 1k, and each broker will read all the information at present, resulting in greater pressure on zk every time the data is pulled, which is greatly alleviated by reducing the frequency of reporting and pulling (once every 3min).

The latest version of the load balancer of the broker has migrated the load data to the topic for storage, and the load balancer itself is not a core module, so the resources of zk cannot be abused, and the cluster will face the risk of being completely shut down when the zk latency is raised.

bookie

Does Bookie report the data on its own? Or is it pulled by an external component? To whom do you write the report? ZK or Broker?

As mentioned earlier, the load data only has disk usage, write throughput/latency, and the logic of collecting is relatively simple, and it is not necessary for bookie to participate, in fact, we only need to know all the disk information, and then know the mapping relationship between disk and bookie, so there is a lot of room for choice.

Here are some possible ways to do this:

1. The most traditional way is to report to ZK.

On the one hand, frequent data updates will put pressure on ZK, and broker and bookie clusters are strongly dependent on ZK, and once ZK has a problem, the whole cluster will hang, so new risks will be introduced; Moreover, the read latency of ZK will reach the second level (especially in the case of replicate, this delay may be unacceptable for the load balancing of Bookie), so this solution can actually be ignored.

2. Each bookie writes its own load data to a corresponding ledger, and the decision-maker reads these ledgers.

Note: Because a ledger can only have one writer, it is not possible to write multiple bookies to a ledger at the same time, and each bookie writes to its own ledger.

Advantages: It can take advantage of the excellent architecture design of Bookie, and the latency can reach a few milliseconds; And in the case of multiple decision-makers, multiple decision-makers can read the ledger, which can make good use of the read-write cache.

Concern: Each bookie has its own ledger, so hundreds of ledgers will be added to write load information, which may increase the pressure on cluster metadata.

3. Each bookie maintains its own load data only in memory and provides an interface for external access.

Advantages: The load data does not need to be stored, because the load data changes quickly, and it is not necessary to store it as a reference to the load change speed chart later.

Worry: If the interface is exposed to external calls, will the latency be high at a higher request frequency? For example, in the case of distributed decision-making, each broker needs to send requests to each bookie, and when the cluster is large, it may encounter problems such as high request pressure and high latency.

In order to keep the architecture of the cluster simple, and our cluster size is not very large, we ended up doing it this way.

4. An external component captures the load of the disks of all machines in the cluster, maps the relationship between the disks and the bookie based on the information on zk, and provides an interface for external access.

Merit:

- Minimal pressure and resource consumption, the component only needs to collect the disk load for each machine, and each broker only needs to request the component once, so the delay may also achieve a desirable effect.

- It's also the easiest to implement.

Shortcoming:

- Add cluster components and make the architecture more complex.

- If you want to collect the internal indicators of bookie, it is difficult for external components to collect this kind of data, but at present, some indicators inside bookie should not be considered, the indicators are enough, too much is not only easy to lead to complex logic, but also easy to cause poor results, Pulsar load balancing to consider direct memory usage is a negative example, the new version of Pulsar code has been turned off by default to consider direct memory.

https://github.com/apache/pulsar/pull/21168

policymaker

broker

The decision maker of the broker is the leader broker. It collects the load information of all brokers and all bundles, and then makes centralized decisions, and because brokers only need to make decisions once, it makes the optimal decision at one time, and then needs to trigger the threshold AvgShedder many times before triggering the load balancing again to avoid wrong decisions.

bookie

The decision-maker of Bookie can be implemented in the BookIe client, or an external component advisor role can be introduced, and the advantages and disadvantages of the two designs are discussed below:

1. Bookie Client里实现

Advantages: There is no need to introduce external components, which avoids increasing the complexity of the system.

Shortcoming:

- Each broker makes decisions, and there will be the problem of traffic skew as mentioned earlier;

- Moreover, each broker has to read traffic, which will cause pressure on the load data storage side, especially when using zk for storage.

- 侵入bookie client的代码。

- You need to increase resources to frequently pull and update load data, otherwise the load balancing effect is poor.

2. Introduce the advisor role

Merit:

- It doesn't need high availability, it's just a best-effort service, it provides suggestions to the bookie client, helps to pick ensemble, and if it hangs or times out, the bookie client can directly use the original logic to get to the bottom.

- Only minor changes are made to the bookie client's code, and there is little risk.

- The round-trip time of packets in the same data center is only a few tens of us, and the latency is acceptable.

- Centralized decision-making avoids the problem of traffic skew, and the same load data only needs to be read once.

Disadvantages: Adding a component is troublesome for O&M.

We ended up implementing it in the Bookie client.

Architectural design

Based on all the previous analyses and trade-offs, we end up with the following architecture: bookie opens a new thread to collect data and provides it to the outside through the interface, and then the broker also opens a new thread to continuously collect data by accessing the interface (e.g., every 3 seconds or 5 seconds), and executes the load balancing algorithm locally to make decisions each time the ledger is created.

The preceding figure describes the architecture of the Broker and Bookie load information collection, and the use of this load information by the broker to execute the load balancing algorithm locally.

The diagram above illustrates the process of creating a Ledger.

When requesting a bookie client to create a ledger, you can provide some data, that is, the estimated space size of the ledger and the estimated write throughput size. Since we only consider the broker write side (not auto recovery writes), the org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl (a class in the Pulsar project) in the broker is responsible for stringing multiple ledgers into a topic/partition, so only ManagedLedgerImpl can know this information. You need to collect this part of the data in ManagedLedgerImpl, and pass this part of the data as a parameter when ManagedLedgerImpl calls the bookie client to create a ledger.

However, this was not achieved due to the large amount of changes made to add an input parameter to the API for creating Ledger.

Technical Risks

Support for BookKeeper to add load balancing is not very complicated, it does not involve changes to the distributed copy protocol, and only needs to insert a step when the Bookie client creates the ledger - ask the load balancer to provide a more reasonable ensemble set. This step of insertion does not introduce risks, and even if it fails, it is only necessary to use the original randomization logic. As shown in the figure below:

Current logic

Modified logic

earnings

The bookie load balancing feature has been launched into the production environment and has achieved good results, with the P99 write latency dropping from seconds or even tens of seconds to less than 1s without changing the cluster traffic pressure. We've contributed the code to the community, and interested readers can pick it up for themselves: https://github.com/apache/bookkeeper/issues/4247

To weigh the quality of a load balancing algorithm, the more direct way is to look at the indicators we care about, specifically here is the P99 write delay, and a little deeper, we can look at the overall load distribution of the cluster, specifically here is the journal disk write throughput standard deviation, the journal disk write throughput extremely poor, and the journal disk write throughput top 10.

Here's a look at the before and after comparisons of these indicators (online on 1.10):

· P99 write delay

It can be seen that the write latency drops significantly from a peak of more than 10 or 20 seconds to less than 1s.

·journal盘写入吞吐标准差

journal盘写吞吐标准差峰值从25MB/s降到21MB/s。

·journal盘写入吞吐极差

写吞吐极差峰值从150MB/s降到130MB/s。

·journal盘写入吞吐top 10

The top 10 journal disk write throughput also decreased significantly.

summary

BIGO has gained significant benefits from adding a load balancing mechanism to Bookie, eliminating the need for cluster expansion to alleviate the pressure on some overloaded nodes, and significantly reducing costs. Moreover, adding machines to expand capacity does not necessarily solve the problem of high latency, because the essence of the problem is not insufficient resources, but load imbalance.

Adding a load balancing mechanism to Bookie would be challenged: this is not a minor feature development, the community is not doing this, can we do it? Is there any benefit from doing it? After detailed research and analysis, these questions can be answered one by one based on sufficient data and facts, and finally achieve the expected results.

Author: BIGO programmer

Source-WeChat public account: BIGO Technology

Source: https://mp.weixin.qq.com/s/doNz8tbE-GNZmqzY5PQqaA

BIGO优化Apache Pulsar系列-Bookie负载均衡

Read on

OpenAI interface management & distribution system with support for Azure, AnthropicClaude, GooglePaLM2, and Zhipu Ch

The generation and principle of load balancing, and the introduction of typical load balancers

The Mystery of Network Architecture: Understanding Docker Networking Practices Docker networking is a feature that allows Docker containers to communicate with each other or with external networks. It packs

Learn about load balancers, reverse proxies, and API gateways in one article

荣耀Magic6系列大跳水，16GB+512GB低至3792元，骁龙8Gen3+4320Hz

Coincidentally, the pure-blooded Hongmeng and Mate70 series are scheduled for September?

The ultra-long battery life shocked everyone! The vivo S19 series will be officially unveiled, and the portrait shooting will be upgraded

MediaTek Dimensity 7300 series unveiled: positioning the Dimensity 7050 upgrade, a reduced version of the Dimensity 7200

Hongmeng 4.2 first model, Huawei Pura70 series ushered in good news, and the channel began to reduce prices

The vivo S19 series debuted tonight with a 7.19mm thick 6000mAh super large battery

Two or three thousand can you buy a flagship? The king of the Honor 200 series stalls is only priced at 2699 yuan

Heading towards the almighty god of war, what is the charm of the vivo S19 series that starts at 2499 yuan?

vivo S19 series photo appreciation: still thin and elegant, returning to the choice of straight and curved

Starting from 2499 yuan, vivo S19 series is officially released: a portrait photography artifact with super battery life?

Create the top of the mid-range mobile phone portrait! The vivo S19 series was officially released

3 new products start at 799 yuan, one article to understand the vivo S19 series press conference

The vivo S19 series is officially released: the starting price starts at 2499

vivo S19 series portrait photography, 6000mAh battery takes into account 7.19mm thickness, starting from 2499 yuan

The starting price is 2499 yuan, and the "light of portrait" vivo S19 series is officially released

Lightning News|To be the best mobile phone for shooting people, vivo S19 series was officially released