How to improve low latency performance in large-scale HBase clusters

Author | Bryan Beaudreault

Translated by | Sambodhi

Planning | Tina

HubSpot's data infrastructure team, which processes more than 2.5PB of low-latency traffic per day, has seen firsthand how important Locality is to HBase's performance. Read on to learn more about these issues: what is Locality, why it's so important, and how we can make maintaining Locality a non-issue in our growing HBase cluster.

Some of HubSpot's largest datasets are stored in HBase, an open-source, distributed, versioned, non-relational database that mimics Google's BigTable. We have nearly 100 production HBase clusters, including more than 7,000 RegionServers in both Amazon Cloud Technologies regions. These clusters can handle more than 2.5PB of low-latency traffic per day, and because each region of Amazon Cloud Technologies is made up of multiple data centers, we believe that Locality is key to maintaining these delays.

What is Locality?

HBase data is stored in HDFS, and by default, there are 3 ways to copy your data.

If possible, the first copy is written locally to the client (HBase RegionServer).

The second copy is written to a host on a different rack than the first copy.

The third copy is written to a different host on the second rack.

All of this is good practice, but HBase data is also split into contiguous chunks called regions. Zones must be able to move quickly between hosts to maintain availability in the event of a hosted RegionServer crash. To ensure speed, when an area moves, the underlying data block does not move. HBase can still easily fetch data remotely from any of the 3 replica hosts that are still available, providing data for that region.

In a single, highly optimized data center, access from remote hosts has minimal impact on latency. In the cloud, these replica hosts may not be in the same building as the requested RegionServer, or even in the same region. When running a latency-sensitive application, this additional network hop can have a significant impact on end-user performance.

Locality is a measure of how much of RegionServer's data is stored locally for a given amount of time, and it's one metric we monitor very closely at HubSpot. In addition to network latency, HDFS has a "short-circuit read" feature. When the data is local, a short-circuit read allows the client (HBase) to read the data file directly from disk without being processed by a centralized HDFS data node. This will reduce latency that might come from the TCP stack or the DataNode process itself.

Over the past few years, we have been conducting regular performance tests to confirm the impact of Locality on latency. Here are some of the results of our most recent test

How to improve low latency performance in large-scale HBase clusters

All four charts have the same time window. At the beginning of this time window, I scrambled to process locality on the cluster. As you can see, the single Get latency (top left) wasn't greatly affected, but throughput (top right) dropped significantly. MultiGets are visibly impacted in terms of latency (lower left) and throughput (lower right). We found that, in general, MultiGets are particularly sensitive to delayed regression because they hit multiple RegionServers, usually at least as slowly as the slowest target.

Here's an example of a production accident we encountered a few months ago:

This cluster has a relatively small dataset, but the load is heavier. The left axis (light purple) is the 99th percentile delay, while the right axis (blue dashed line) is Locality. In this event, Locality was about 10% with a latency of between 2 and 8 seconds. Using the tools we'll discuss in this article, we solved this problem around 11 o'clock – in just a few minutes, we raised the Locality to 100% and reduced the latency to less than 1 second.

Resolve Locality issues

Locality can go down for different reasons, all of which stem from the movement of the region:

RegionServer can crash, so all of its regions are randomly distributed across the cluster.

Balancer may move some areas to better distribute the request load.

You might scale up or down your cluster, causing the region to move to accommodate the new scale.

All three of these reasons seem common to us. When Locality falls, you have two options:

Use Balancer to move zones back to where they have good Locality. This is rarely a good choice.

Rewrite the data locally, using "major compaction".

Data in HBase is initially written to memory. When the data in memory reaches a certain threshold, it is brushed to disk, forming an immutable StoreFile. Because StoreFile is immutable, updates and deletions do not directly modify the data. Instead, they are written to the new StoreFile, along with other new data. Over time, you'll create a lot of StoreFiles that need to be reconciled with the old data when they're read. This instant audit slows down reads, so background maintenance tasks are performed to merge StoreFiles. These tasks are called "compaction," and they are divided into two types: minor compaction and major compaction.

Light compaction simply merges smaller, adjacent StoreFiles into larger StoreFiles to reduce the need to look for data between many files. High compaction is the rewriting of all StoreFiles within an area, merging all updated and deleted data into a single StoreFile.

Going back to the definition of Locality, our goal is to ensure that the new managed server has a local copy of each block in the StoreFile. With existing tools, the only way to do this is to rewrite the data, which goes through the block placement strategy described above. To do this, high compaction can be very important because they involve rewriting all the data. Unfortunately, they are also very expensive:

Compaction must read the data for the entire area, filter out the irrelevant data, and then rewrite the data.

Reading data includes decompression and decoding, which requires CPU. Filtering also requires CPU. Writing data requires encoding and compaction of new data, which requires more CPU.

Writing data also involves 3x replication. Therefore, compacting an area of 10GB results in a 10GB read + 30GB write.

Whatever our Locality goals, this cost is reflected in every high degree of compaction and has an impact on long-tail latency. Typically, you only want to run highly compacted operations during non-working hours to minimize the impact on the end user. However, Locality has the biggest impact during peak periods, so it means that you may have hours of pain while you wait for off-peak compaction work to begin.

In the case of Locality, there's also a hidden cost — it's quite possible that there's only a portion of a Certain Area's StoreFile, with a very bad Locality. A high degree of compaction compacts all storeFiles, so if only 1GB of a 10GB area is non-local, then from a Locality perspective, that's a waste of 9GB of effort.

Below is a diagram of one of our clusters where we tried to fix locality by compacting:

This diagram shows a relatively large HBase cluster, with each line being the Locality of a single Regionserver in the cluster. It took us about 6 hours before we ran out of time to slowly raise locality to a peak of only about 85%. A few hours later, an incident occurred that destroyed part of our work, and depending on the load pattern of the cluster, we had to wait until the next night to continue the compaction operation.

Over the years, this scenario has been repeated over the years. As we got bigger, we found that compaction didn't do a good job of locality repairing SLOs fast enough. Our separation is a better way.

Cut costs and turn hours into minutes

I've been experimenting with HBase intermittently for years, and solving Locality with compaction is always disappointing. I've known a long time about tools like HDFS Balancer and Mover, which can do low-level block movement. It would be attractive if there was a similar tool that could take advantage of low-level block movement to solve Locality problems for the following reasons:

You're just moving bytes from one host to another without having to deal with compaction, encoding, or expensive filtering.

You're doing a 1-1 transfer instead of writing a new block. So 3x the cost of replication doesn't work.

You can specifically choose to move only replications of non-local blocks, regardless of all other blocks.

With dfsadmin -setBalancerBandwidth, the block transfer bandwidth can be edited in real time and scales well with the size of the cluster.

In this project, I want to see if we can develop similar products that improve the locality of low latency applications.

Existing components

In order to build our own block Mover, the first step I had to take was to understand all the various components of moving blocks, reading blocks, and computing Locality. This section is a bit esoteric, so if you just want to know about our solutions and results, skip to the next section.

Dispatcher

At its core, Balancer and Mover is Dispatcher. Both tools pass the PendingMove object to the Dispatcher, which handles the replaceBlock call on the remote DataNodes. A typical block size for an HBase cluster is 128MB, while regions are typically multiple GIGabytes. Therefore, a region may have a dozen to hundreds of blocks, a RegionServer may host 50 to hundreds of Regions, and a cluster may have hundreds or thousands of RegionServers. Dispatcher's job is to execute many of these replaceBlock calls in parallel, and it tracks progress as the remote DataNode makes a copy of the data. This Dispatcher doesn't process the selected copies to be moved, so I had to create a process to detect low Locality areas and convert them into PendingMove objects.

Replace the block

The remote DataNode receives replaceBlock calls from dispatcher, which include the blockId, source, and destination. The data then flows from the source to the target through the proxy DataNode.

When the target DataNode finishes receiving blocks, it notifies NameNode with RECEIVED_BLOCK status updates. The status update includes a reference to the DataNode used to host the block. NameNode updates its in-memory block records and marks the old DataNode as PendingDeletion. At this point, the Locality call to get the block will include both the new and old DataNodes.

When the next old DataNode report, NameNode responds, "Thanks, please delete this block now." When DataNode finishes deleting the block, it again issues a DELETED_BLOCK status update to NameNode. When NameNode receives this update, the block is removed from its memory records. At this point, the call to block Locality will include only the new DataNode.

Use DFSInputStream to read data

When HBase opens each StoreFile, it creates a persistent DFSInputStream that serves all ReadType.PREAD reads of the file. When STREAM reads in, it opens an additional DFSInputStream, but PREAD is the most sensitive to latency. When a DFSInputStream is opened, it gets the current block position of the first few blocks of the file. When reading data, the location of these blocks is used to decide where to get the block data from.

If DFSInputStream attempts to serve data from a crashed DataNode, the DataNode is added to a deadNode list and excluded from future data requests. DFSInputStream will then retry from one of the other block locations outside the deadNodes list. A similar process occurs if DFSInputStream attempts to fetch blocks from a DataNode that no longer provides block services. In this case, the ReplicaNotFoundException will be thrown, and the DataNode will also be added to the deadNode list.

A single StoreFile can range from a few kilobytes to a dozen gigabytes. Since the block size is 128MB, this means that a single StoreFile may have hundreds of blocks. If a StoreFile has a low Locality (few local replicas), the blocks are scattered across the rest of the cluster. As DataNode is added to the deadNode list, you'll be more likely to encounter a block where all places are in the deadNode list. At this point, DFSInputStream rolls back (defaulting to 3 seconds) and re-fetches the position of all blocks from NameNode.

Unless there is a more systematic problem, all of these error handling will result in a transient increase in latency without causing an exception from the client. Unfortunately, the effects of the delay are noticeable (especially the 3-second rollback), so there's room for improvement here.

Report and make decisions based on Locality

When you open a StoreFile on a RegionServer, the RegionServer calls NameNode itself to get all the block positions in that StoreFile. These locations are accumulated into each StoreFile's HDFSBlockDistribution object. These objects are used to calculate localityIndex and reported to customers through JMX, RegionServer's web UI and management interfaces. RegionServer itself also uses localityIndex for certain compaction-related decisions, and it reports localityIndex for each region to HMaster.

HMaster is an HBase process that runs HBase's own Balancer. Balancer attempts to balance regions across an HBase cluster based on a number of cost functions: read requests, write requests, number of stored files, storage file size, and so on. A key metric it tries to balance is Locality.

Balancer works by calculating the cost of the cluster, pretending to move a region to a random server, and then recalculating the cost of the cluster. If the cost is reduced, the move is accepted. Otherwise, try a different move. To balance with Locality, you can't simply use the LocalityIndex reported by RegionServers, because you need to be able to calculate how much Locality would cost if a region were moved to a different server. So Balancer also maintains its own cache of HDFSBlockDistribution objects.

LocalityHealer

After learning about the existing components, I started working on a new daemon, which we affectionately call LocalityHealer. After delving into the Mover tools, I came up with a design to implement how the daemon works. The key to this work is two parts:

Discover which areas need to heal.

Convert these regions to PendingMoves of dispatcher.

HBase provides an Admin#getClusterMetrics() method that polls the status of a cluster. The return value consists of a bunch of data, one of which is regionMetrics for each region in the cluster. This RegionMetric includes a getDataLocality() method, which is exactly what we want to monitor. Thus, the first component of this daemon is a monitoring thread that constantly polls which getDataLocality() is below our threshold areas.

Once we know which areas we need to heal, we have a complex task of turning it into a PendingMove. A PendingMove requires a block, a source, and a destination. What we have so far is a list of regions. Each zone consists of 1 or more column families, each with 1 or more StoreFiles. Therefore, the next step is to recursively search the directory for that area on HDFS, looking for StoreFile. For each StoreFile, we get the location of the current block, select a copy for each block as the source, and create a PendingMove for each block, targeting the currently hosted RegionServer. We chose the mobile source to ensure that we follow blockplacementPolicy and minimize network traffic between different racks.

Once we hand over all the generated PendingMoves to Dispatcher, we just have to wait for it to finish. When it's done, we wait for another grace period for our Locality monitor to notice the updated Locality metric, and then repeat the whole process. This daemon continues this loop until it closes. If Locality is 100% (as it often is now), it sits idle until the monitor thread notices a drop.

Ensure that reads benefit from the newly improved Locality

Therefore, the daemon runs and guarantees that all DataNodes on the RegionServer will host a block copy for all zones on the RegionServer. But DFS InputStream can only get the position of the block at the beginning or under certain conditions, and there is a persistent DFS InputStream in every StoreFile. In fact, if you keep tracking, you may find that if the blocks keep moving, we'll find a lot of ReplicaNotFoundExceptions. This is actually a pain that is best avoided.

Fast delivery with V1

When I first built the system in March, I decided to use callback functions to refresh the read data in HBase. I'm most familiar with HBase, which is also the least resistant method. I pushed new RPC endpoints to our internal forks for HMaster and RegionServer. When LocalityHealer finishes processing all the StoreFiles for a zone, it calls these new RPCs. RegionServer's RPC is particularly tricky and requires some complex locking. Finally, what it does is reopen the storage file and then transparently close the old storage file in the background. This reopening process creates a new DFSInputStream with the correct block position and updates the reported Locality value.

Since then, this deployed system has been a great success, but we are currently working on a major version upgrade that needs to be made to work in the new version. It turned out that the problem was much more complicated, so I decided to try to design a different approach for this part. The rest of this blog mentions the new v2 method, which has been fully deployed since October.

Iterate and adapt

While investigating our major version upgrades, I found that HDFS-15199 added a feature to the DFSInputStream that periodically re-reads the block position when open. This seems to be exactly what I want, but while reading the implementation, I realized that the re-fetch is built directly on the read path, and it happens whether it is needed or not. For the original goal of this problem, which is to refresh the position every few hours, this seems fine, but I only need to refresh every few minutes at most. In HDFS-16262, I took this idea and made it asynchronous and conditional.

Now, DFSInputStream will only reacquire the location of a block if there is a deadNode or any non-local block. The process of re-acquiring takes place outside of any locks, and the new location is quickly swapped with the locks. This should have a very small impact on reads, especially relative to the semantics of existing locks in DFSInputStream. By using the asynchronous method, I think it can be refreshed on a 30-second timer so that we can adapt quickly to the block movement.

Load testing

This new way of refreshing block positions asynchronously means that a bunch of DFSInputStreams affect NameNode at different times. If Locality is good, the number of requests should be zero or close to zero. In general, when you run LocalityHealer, you can expect the Locality of your overall cluster to be almost always above 98%. So under normal circumstances, I wouldn't be worried about this. One of the things I care about is what it would look like if we had a complete failure and The Locality was almost zero.

We tend to split large clusters rather than make them too large, so our largest cluster has about 350k StoreFiles. In the worst case, all of these files are requested to NameNode every 30 seconds. This means about 12,000 times per second. I have a hunch that this won't be a big problem because the data is completely in memory. We run our NameNode with 8 CPUs and enough memory to cover the capacity of the block.

HDFS has a built-in NNThroughputBenchmark that simulates exactly what I expect from the workload. I first tested a 4-CPU NameNode in our QA environment, using 500 threads and 500,000 files. This single load test instance is capable of driving 22k req/s, but there is still 30%-40% CPU idle time on NameNode. That's more than twice as much as in our worst-case scenario, and it's very promising.

I was curious about what prod could do, so I ran it on a NameNode with an 8-block CPU. It's easy to push 24k req/s, but I noticed that the CPU is almost idle. On the test hosts I used, I've reached the baseline's maximum throughput. I started another concurrent test on the same NameNode on another host and saw the total throughput jump to over 40k req/s. I continued to scale up and eventually stopped at more than 60k req/s. Even at this level, idle CPUs still exceed 30% to 45%. I believe that for our NameNode, such a load will not have any problems.

Relieves pain

The locality healer deployed earlier did create some minor troubles at runtime. This all goes back to the ReplicaNotFoundException I mentioned earlier, which sometimes leads to expensive rollbacks. When I first did this work, I submitted HDFS-16155, which increased exponential fallback and allowed us to reduce 3 seconds to 50 milliseconds. This is great for solving this problem, making it very manageable, and it's worth it for the long-term improvement of Locality.

As part of my investigation of HDFS-16262, I learned more about the process when a block is replaced, the process is ineffective. I briefly introduced this when describing the components above, but also made me realize that I could completely eliminate this pain. What if I could add a grace period around the "Please delete this block" message from NameNode? The result of this idea was HDFS-16261, where I achieved such a grace period.

With this feature, I configured a 1-minute grace period on our cluster. This gives the 30-second refresh time in DFSInputStream enough time to refresh the position of the blocks before removing the blocks from their old positions. This eliminates replicaNotFoundExceptions, as well as any associated retries or expensive fallbacks.

The updated Locality is reflected in the indicator and Balancer

The last piece of the puzzle here is an update to the localityIndex indicator I mentioned, as well as Balancer's own cache. This section is covered by HBASE-26304.

For Balancer, I took advantage of the fact that RegionServers report their localityIndex to HMaster every few seconds. This is used to build the ClusterMetrics object that you query when you call getClusterMetrics, and it is also injected into Balancer. The solution to this problem is simple: when injecting new ClusterMetrics, compare it to the existing one. For any region, its reported localityIndex has changed, which is a good sign that our HDFSBlockDistribution cache has expired. Use it as a signal to flush the cache.

The next step is to ensure that RegionServer reports the correct localityIndex first. In this case, I decided to export the STOREFile's HDFSBlockDistribution from the underlying persistence DFSInputStream that supports PREAD reads. DFSInputStream exposes a getAllBlocks method that we can easily convert to HDFSBlockDistribution. Previously, the block distribution of The StoreFile was calculated when the StoreFile was opened and never changed. Now we derive from the underlying DFSInputStream, which automatically changes over time as the DFSInputStream itself reacts to block movements (as described above).

Results

Case study: Managing Locality across 7000 servers

First, I'm going to let the data tell the story. We started rolling out LocalityHealer to some of the more problematic clusters in mid-March and completed the rollout to all clusters in early May.

The chart shows the Locality value for the 25th percentile of all our production clusters from March 1, 2021 to June 1, 2021. Prior to March, we saw many drops below 90%, with some clusters continuing to decline to almost 0 percent. As we start rolling out LocalityHealer, these declines become less frequent and severe. Once LocalityHealers were fully rolled out, they were completely eliminated.

We like to keep Locality above 90%, but notice that when Locality is below 80%, real problems start to emerge. Another way to look at the problem is that within an interval, Locality is less than 80% of the number of RegionServers.

This graph shows the same time period as above, and you can see that we used to have hundreds of servers below 80% Locality at any given moment. Since the beginning of May, this value has been 0 and remains there today.

The best thing about these charts is that it is automatic. Unfortunately, we don't have metrics for alert volume because of Locality, but anyone on the HBase team can tell you that they used to be called out almost daily because of the Locality of a cluster. This has always been a nasty alarm because the only thing you can do is start the high compaction that takes hours to complete. Over time, we reduced the sensitivity of Locality alerts to avoid alert fatigue, but this negatively affected the stability of the cluster.

With LocalityHealer, we don't think about Locality anymore. We can make our alerts very sensitive, but they will never be sent out. Locality is always close to 100%, and we can focus on other operational issues or value work.

Case study: Quickly resolve timeouts due to poor Locality

Here's also an example of a specific cluster that hosts only 15TB. You can see that near the start of the timeline, the brown line is a new server that starts with Locality 0. Before time ran out, compaction took about 7 hours to reach about 75% Locality. Later that night, more servers were added, but it was too late for them to start compacting (due to other tasks, such as backups, running in the early hours of the morning).

When we reached the peak traffic the next day, the HBase team was called in by a product team who ran into a timeout issue that resulted in a 500-millisecond situation for the customer. At this point, the HBase team has two options:

Start compaction, which will further increase latency and take more than 8 hours to resolve the issue.

Try the newly deployed LocalityHealer, which is not yet running as a daemon.

They chose the latter, which allowed the locality of the entire cluster to reach 100% in 3 minutes. Zooming in, you can see the impact below.

In this example, I summarize the first chart by plotting a single average position (left axis, blue line). I superimpose the 99th percentile delay of the cluster (right axis, yellow line). Throughout the morning, we saw more and more timeouts (500 milliseconds, indicated by a gray dotted line). We knew that this would become critical when we reached the peak of traffic, so we ran LocalityHealer at 11:30. The Locality jump to 100% immediately reduces latency fluctuations and timeouts.

Conclusion

LocalityHealer has changed the game for HubSpot to manage key performance metrics for fast-growing clusters. We are currently working to contribute this work to the open source community, under the umbrella issue of HBASE-26250.

That's what our data infrastructure team does every day.

About the Author:

Bryan Beaudreault, Principal Engineer at HubSpot. He has led several teams at HubSpot, including creating a data infrastructure team, and led HubSpot to achieve 99.99% uptime across multiple data stores in a highly multi-tenant cloud environment. Later, back on the product side, we worked on automating conversations for one of HubSpot's upcoming products.

https://product.hubspot.com/blog/healing-hbase-locality-at-scale

How to improve low latency performance in large-scale HBase clusters

Read on

Have Serverless's promises been fulfilled?

The ceiling of the entry platform for machine learning, and the free practical classic tutorials, there is really no one left

After the year-end shopping season, look at the next stop of cross-border e-commerce: cloud-based and digital refined operations

SAIC's car went to sea "evolutionary theory" 丨BWest Finance X Amazon Cloud Technology

The science and technology innovation industry cluster in the Greater Bay Area is the trend of the times, and construction robots are rising with the trend

The new year is like a spring breeze!

Kunshan, the strongest county-level city in the country, unveiled a new bottom card: the electronic information industry innovation cluster

Advanced Kubernetes deployment strategy

China Construction Eighth Bureau helped Wuhan accelerate the construction of a trillion-yuan world-class automobile industry cluster

Help Wuhan accelerate the construction of a trillion-dollar world-class automobile industry cluster

Scale Kubernetes to over 4k nodes and 200k pods

How will "optical communication" benefit from the perspective of eastern and western calculations|?

Baidu Intelligent Cloud & Nvidia's new generation of high-performance AI computing cluster online sharing meeting will be held next week

Perspective on the East and West| Explain the "Mission and Obligation" of the Eight Nodes

Ape Man's innovative independent brand road: there is no "shortcut" in overseas markets | CBN X Amazon Cloud Technology

CentOS stopped maintenance, read the upgrade migration path| Q recommended

How does software reconstruct the value ecology of automobiles? Amazon Cloud Technology replied

Metaverse and games, what innovative practices have been explored?

Chinese smart home companies go to sea, who will provide "GPS tips"?

The "County Power" of China's Auto Industry Map

Amazon Cloud Technology empowers the core service capabilities behind "software-defined cars."

The new trend of China's digital economy: from digital going to sea to going to sea

Amazon Cloud Technology Li Xiaomang: Interpreting the trend of 82% of unicorn companies planning to go to sea within two years

Amazon Cloud Technology pushes cloud digital intelligence integration service, showing the results of smart lake warehouse

Fuyuan: Promote the development of aluminum industry clusters

Cover of Zhejiang University Research Summit Issue: Micro aerial robot clusters like birds freely passing through dense forests

Facing the six major challenges of "going to sea", Chinese enterprises and giants are sailing together

In-depth interpretation: Distributed system toughness architecture ballast stone OpenChaos

For large model training, Tencent released a super computing power cluster!

Amazon Web Services AIGC Family Bucket Bedrock exploded, open customization, full of privacy