laitimes

Reveal the secret of adaptive circuit breaker current limiting

author:Flash Gene

Preface

Adaptive circuit breaker and throttling are common mechanisms used in distributed systems to protect the system from service avalanches and bursts. It can automatically adjust the current limiting strategy according to the load and performance indicators of the system to ensure that the system can provide stable and reliable services, which has been explored and practiced in the industry.

Circuit breaker current limiting usage scenarios

Current limitation

The main application scenarios of throttling are scenarios that deal with traffic surges (such as event promotions and external traffic attacks). As shown in the figure below, when system traffic spikes, if not throttled, the system may be fully loaded, causing the service to crash and eventually cause the service to be unavailable.

Reveal the secret of adaptive circuit breaker current limiting

Even when there is a burst of traffic, the service will always only carry some requests, and the system load can be controlled, ensuring that the service can be stable and available.

Reveal the secret of adaptive circuit breaker current limiting

Fusing

The main application scenario of circuit breaker is to prevent service avalanches (such as downstream timeouts and DB failures) when the dependent services fail. As shown in the following figure, when a downstream service failure causes a timeout, due to a long-term non-response, the upstream service request will be blocked, resulting in an increase in system resources (such as the tomcat thread pool and connection pool, etc.), and if the fault persists, the request blocking will continue to increase, resulting in the exhaustion of the upstream system thread pool resources, resulting in the unavailability of the upstream service, and will cause cascading faults upward, and finally the service avalanche of the entire link due to the failure of the underlying service.

Reveal the secret of adaptive circuit breaker current limiting

Through the circuit breaker, after the downstream failure occurs, the downstream call can fail quickly, so as to avoid the system resources being exhausted due to the downstream failure, so that the system remains available, and at the same time, the downstream detection will be carried out regularly after the circuit breaker, and when the downstream is restored, the circuit breaker will be exited and the downstream call will be automatically resumed.

Reveal the secret of adaptive circuit breaker current limiting

Problems with traditional circuit breakers and current limiting

Traditional circuit breaker current limiting is mainly based on client modes such as hystrix and sentinel, which have some drawbacks:

  1. Manual evaluation is prone to negligence, and it is usually only after the system fails that the intervention is carried out, which may have a negative impact on the business.
  2. Thresholds are difficult to set and costly to configure, but the cost of stress testing in production is high, and there are many variables in the rules, so there is a certain threshold for getting started.
  3. Threshold settings are outdated, and because the system is in constant iteration, thresholds can also become inappropriate as versions iterate, leaving the system unprotected.
  4. Poor cross-language support: The existing components only support the Java ecosystem, but have limited support for Python, Go and other language ecosystems.

The letter also has the advantage of adaptive fuse and current limiting

  1. It can be used as a fallback strategy, and the adaptive strategy has a high heuristic value, so it can be configured and enabled in advance as a fallback strategy.
  2. The threshold for use is low, the policy is adaptive, there are no complex configuration items, no need for stress testing, only need to choose to enable and disable, and the policy is mainly based on the load of the system, and it will not be invalid with the iteration of the version.
  3. Zero-cost access, cross-language support, using mesh to implement, the specific current limit and circuit breaker intervention are carried out on the sidecar, so the application does not need client access, can be applied to various language ecological systems.

Adaptive circuit breaker current limiting strategy

Adaptive throttling

  • The core logic of the policy: The resource watermark is adaptive, and the QPS is adjusted based on the error between the current CPU and the target value, so that the CPU is close to the target value.
  • Core algorithm: PID algorithm
  • Industry practices: Taobao NOAH, Ant MOSN, etc
  • Algorithm Overview: The PID algorithm uses feedback to detect deviation signals and control the controlled quantity through deviation signals. The controller itself is the sum of proportion, integration, and derivation. It is widely used in quadcopters, balance cars, car cruise control, temperature controllers and other scenarios. The algorithm details are shown in the figure below, where Kp: proportional gain, which is used to control the adjustment amplitude, Ki: integration time constant, which is used to compensate for error, Kd: differential time constant, which is used to suppress fluctuations, e(t): error between CPU and baseline, u(t): adjusted qps, where the baseline is taken as 80%.
Reveal the secret of adaptive circuit breaker current limiting

Adaptive fuse

  • The core logic of the policy is to calculate the probability of being rejected by the downstream to control whether the request is fused
  • Core algorithm: SRE algorithm
  • Industry practices: Bilibilikratos, go-zero, Xiaomi, QQ music microservices, etc
  • Algorithm Overview: SRE algorithm is an elastic circuit breaker algorithm proposed by Google, called Handing Overload, different from the traditional circuit breaker algorithm, SRE algorithm does not have a half-open state, nor is it fully open, and controls the sending of traffic by calculating the rejection rate of downstream services, while protecting itself from being dragged down by the downstream, it releases requests to the downstream as much as possible to maximize the integrity of the business. The algorithm details are shown in the figure below, where requests: the total number of requests in a time window, Accepts: the number of successful requests, K: multiplier, the smaller the request, the smaller the request, the easier it is to be dropped, and K is recommended in the range [1.5, 2].
Reveal the secret of adaptive circuit breaker current limiting

Adaptive fuse and current limiting scheme

  • Metrics collection and dumping: First, OTEL is used to collect circuit breaker and current limiting indicators, and during the collection process, OTEL will regularly discover nodes through the adaptive circuit breaker and current limiting platform, and only instances with the circuit breaker and current limiting function enabled will collect indicators. The collected metrics are pushed to Kafka in batches, consumed by the data dump module, and stored in Redis after preprocessing the monitoring data.
  • Adaptive computing engine: Performs adaptive circuit breaker and current limit triggered scans based on the enabled rules, obtains the corresponding monitoring metrics from Redis according to the default circuit breaker and current limit adaptive policy to determine circuit breaker and current limiting triggers, and also makes many conditions for false triggers in the judgment rules, such as filtering out load increase scenarios caused by non-traffic traffic. At the same time, the adaptive strategy will periodically calculate and update the threshold of the circuit breaker and current limit until the recovery conditions are met.
  • Rule conversion: The rule conversion module listens for the issuance of circuit breaker and current limit instructions to build an EnvoyFilter CRD, converts the circuit breaker and current limit instructions into flow control rules for sidecar testing, and converts the circuit breaker and current limit instructions to EnvoyFilter, which will be delivered to the sidecar through the XDS protocol on the control plane of the mesh.
  • Sidecar: Executes flow control at the inlet (current limiting) or outlet (circuit breaker) based on the flow control XDS configuration.

Problems and optimizations

1. XDS delivery performance issues

After adding an EnvoyFilter, the xDS push of all pods in the same namespace is triggered, and the EnvoyFilter will be frequently changed during the adaptive tuning process, resulting in a high push frequency and a significant increase in the load of the Istio control plane.

Reveal the secret of adaptive circuit breaker current limiting

2. Adaptive circuit breaker implementation problems on Envoy

Since there is no open-source experience to learn from the implementation of adaptive circuit breaking on envoy, the SRE circuit breaker algorithm needs to be similar to the probability-based method of rate limiting, and there is no probabilistic rate limiting strategy for envoy natively. After research and testing, the effect of probability current limiting is simulated by combining the ultra-long token delivery frequency with the effective ratio of current limiting, and the error is stable within 2% after multiple stress tests, which meets the actual circuit breaker requirements.

Reveal the secret of adaptive circuit breaker current limiting

3. The problem of API monitoring indicator dimension explosion

Because the adaptive circuit breaker scenario requires a high granularity of the circuit breaker target and needs to be in the API dimension, the circuit breaker scenario is mainly aimed at the scenario of the egress call, and the downstream does not return packets when the circuit breaker occurs, spring servletRequest.getAttribute(HandlerMapping.BEST_MATCHING_PATTERN_ATTRIBUTE) cannot be used to obtain the interface path, in this case, when there are variables in the path, the dimension explosion will occur. Therefore, by developing a template to match EnvoyFilter, when the interface is enabled circuit breaker, a template matching EnvoyFilter will be issued, and when a call is made to the downstream, the interface will be matched through the template matching EnvoyFilter, and the adaptive circuit breaker will only act on the interface requests that have passed the template matching.

Reveal the secret of adaptive circuit breaker current limiting

In addition, many difficult problems have been overcome and optimized, such as algorithm tuning, ISTIO monitoring metric customization, Envoy refinement limiting and benchmarking issues, and API Server performance issues

Adaptive fuse current limiting platform

Adaptive fuse current limiting settings

There are no complicated rule parameter settings for adaptive throttling, you only need to enable or disable the function switch, and it supports instance-level grayscale.

Reveal the secret of adaptive circuit breaker current limiting

The adaptive circuit breaker setting supports API and site dimensions, and the API and site lists are automatically extracted based on monitoring information, and there are no additional rule parameter settings, and users only need to pay attention to the opening and closing of switches.

Details and snapshots of adaptive circuit breaker current limiting

Reveal the secret of adaptive circuit breaker current limiting

In addition to the basic site instance information, the adaptive throttling record details also contain the cause and condition description of the trigger recovery, and snapshots of monitoring metrics before and after the trigger and recovery, which can be used to easily troubleshoot and analyze problems. As shown in the figure, when the traffic causes the CPU to spike, the adaptive throttling intervenes to reduce the CPU load until it approaches the set target watermark, thus ensuring service availability. At the same time, due to the rapid failure of some requests, the overall throughput is actually improved.

Reveal the secret of adaptive circuit breaker current limiting

Similar to the adaptive throttling details, the adaptive circuit breaker record details are also composed of snapshots of basic information and monitoring metrics, through which it can be found that when a large number of downstreams occur, the adaptive circuit breaker begins to intervene, and most of the requests are quickly failed, preventing the service from being dragged down by the downstream.

However, unlike traditional circuit breakers, adaptive circuit breakers always release requests to the downstream as much as possible, and when the downstream timeout is recovered, the traffic can also be quickly recovered, ensuring that its own services are not dragged down by the downstream while maximizing the integrity of the business.

At last

At present, we have completed the phased implementation of adaptive throttling and circuit breaker, completed the pilot and opening of some sites, and will further explore in the future, such as refining adaptive throttling to the interface level, only for interfaces that cause load increases, further refining the granularity of indicators, improving the sensitivity of adaptive policy triggering and recovery, etc., and making the service guarantee mechanism more accurate, efficient, and stable.

About the Author:

Chasen is currently an expert in the development of basic frameworks

Source-WeChat public account: Auction yard

Source: https://mp.weixin.qq.com/s/m9ujAciRoIrs5GmiYSX-TQ