introduction
Why do I need to introduce a circuit breaker isolation mechanism such as Hystrix in a project, and in what scenarios can it be used? In a distributed system, a single application usually has multiple different types of external dependent services, which usually depend on various RPC services internally and various HTTP services externally. These dependent services will inevitably fail to call, such as timeouts, exceptions, etc., and how to ensure the stability of their own applications when external dependencies have problems is the job of service assurance frameworks like Hystrix. As shown in the following figure, application X depends on services A, B, and C, A and B provide services normally, and C services fail, which is how to avoid the impact of C services on A and B services, and also introduces a concept of isolation.
Hystrix
Hystrix [hɪst'rɪks], meaning porcupine in Chinese, has the ability to protect itself because of the thorns on its back. The Hystrix mentioned in this article is a fault-tolerant framework open source of Netflix, which is also self-protecting.
Hystrix Design Goals
• Protect and control latency and failures from dependencies that are typically accessed over the network • Prevent the ripple effect of failures • Fail fast and recover quickly • Fallback and graceful degradation • Provide near real-time monitoring and alerting
Design principles followed by Hystrix
• Prevent any individual dependency from exhausting resources (threads) • Prevent queuing by instantly cutting off overload and failing fast • Provide fallback whenever possible to protect users from failures • Use isolation techniques such as bulkheads, swimlanes, and circuit breaker patterns to limit the impact of any one dependency • Ensure failures are detected in a timely manner with near real-time metrics, monitoring, and alerts • Ensure timely recovery from failures by dynamically modifying configuration properties • Prevent execution failures for entire dependent clients, not just network communication
Main process
•Use the command pattern to wrap all calls to external services (or dependencies) in a HystrixCommand or HystrixObservableCommand object and place that object in a separate thread for execution; •Each dependency maintains a thread pool (or semaphore) that rejects requests (rather than queuing them) when the thread pool is depleted. • Log request successes, failures, timeouts, and thread rejections. •When the service error percentage exceeds the threshold, the fuse switch is automatically turned on and all requests to the service are stopped for a period of time. • Downgrade logic is executed when a request fails, is rejected, times out, or is circuit breaker. • Monitor metrics and configuration modifications in near real-time.
Command对象封装请求
Class Structure:
Command execution method:
There are 4 ways to execute a Hystrix command.
execute()和queue() 适用于HystrixCommand对象,而observe()和toObservable()适用于HystrixObservableCommand对象。
•execute()—the method is blocking and receives a single response from the dependent request (or throws an exception if an error occurs). •queue()—Returns a Future object containing a single response from the dependency request. •observe()—subscribe to an Observable object returned from a dependency request that represents the response. •toObservable()—returns an Observable object that executes the Hystrix command and emits a response only if you subscribe to it.
核心代码AbstractCommand:
public Observable<R> toObservable() {
...
final Func0<Observable<R>> applyHystrixSemantics = new Func0<Observable<R>>() {
@Override
public Observable<R> call() {
if (commandState.get().equals(CommandState.UNSUBSCRIBED)) {
return Observable.never();
}
return applyHystrixSemantics(_cmd);//1.关键步骤,命令处理
}
};
...
private Observable<R> applyHystrixSemantics(final AbstractCommand<R> _cmd) {
...
if (circuitBreaker.attemptExecution()) {//1.【断路器相关处理】,之后HystrixCircuitBreaker中展示
..
if (executionSemaphore.tryAcquire()) {//2.获取信号量,如果是THREAD线程池策略,【直接返回true】,这里需要注意,不然流程将进行不下去
try {
executionResult = executionResult.setInvocationStartTime(System.currentTimeMillis());
return executeCommandAndObserve(_cmd)//3.核心执行方法
.doOnError(markExceptionThrown)
.doOnTerminate(singleSemaphoreRelease)
.doOnUnsubscribe(singleSemaphoreRelease);
} ...
}
Circuit breaker implementation logic
下面的图展示了HystrixCommand和HystrixObservableCommand如何与HystrixCircuitBroker进行交互。
The looper opens and closes in the following situations:
• Assumes that the request in the loop meets a certain threshold (HystrixCommandProperties.circuitBreakerRequestVolumeThreshold()) • Suppose the percentage of errors that occur exceeds the set threshold for errors to occur: HystrixCommandProperties.circuitBreakerErrorThresholdPercentage() • The loopback state changes from CLOSE to OPEN • If the looper is open, all requests will be fused by the looper. •After a certain amount of time, HystrixCommandProperties.circuitBreakerSleepWindowInMilliseconds(), the next request will be approved (in a semi-open state), if the request fails, the looper will return OPEN during the sleep window, and if the request is successful, the looper will be set to a closed state, restarting the 1-step logic.
Hystrix 断路器状态:
The fuse has three states: CLOSED, OPEN, and HALF_OPEN The fuse is closed by default, when the fuse is triggered, the state changes to OPEN, and after waiting for the specified time, Hystrix will release the request to check whether the service is on, during which the fuse will become HALF_OPEN semi-open, and the fuse detection service will continue to change to CLOSED to close the fuse.
Circuit breaker implementation class:
Core Code:
public boolean allowRequest() {
if (properties.circuitBreakerForceOpen().get()) {
// properties have asked us to force the circuit open so we will allow NO requests
return false;
}
if (properties.circuitBreakerForceClosed().get()) {
// we still want to allow isOpen() to perform it's calculations so we simulate normal behavior
isOpen();
// properties have asked us to ignore errors so we will ignore the results of isOpen and just allow all traffic through
return true;
}
return !isOpen() || allowSingleTest();
}
Here the code judges the logic
1. Determine whether to force the fuse to be opened, if it is, return false, command cannot be executed 2. Determine whether to force the fuse to be closed, if yes, return true, command can be executed 3. Determine whether the fuse is turned on circuitOpened.get() == -1 means that it is not opened, then return true, command can be executed. 4. At this point, it is proved that the fuse has been turned on, then determine whether you can try to request, and if you can, the status of the fuse will be changed to HALF_OPEN at the same time
Fusing parameters:
Isolation:
Hystrix employs bulkhead patterns to isolate dependencies from each other and limit concurrent access to any of them.
Isolation method:
•Thread pool isolation: Requests are concurrent and time-consuming (usually computationally large or database reads): Thread pool isolation is used to ensure that a large number of container threads are available, and will not be blocked or waited for due to service reasons, and will fail to return quickly. •Semaphore isolation requests are concurrent and time-consuming (generally small computational or read cache): Semaphore isolation is used: because the return of such services is often very fast, it will not occupy the container thread for too long, and it reduces some of the overhead of thread switching, improving the efficiency of the caching service
Thread pool semaphore thread request thread and invoking provider thread are not the same threadRequesting thread and invoking provider thread are the same threadOverheadQueuing, scheduling, context switching, etc. No thread switching, low overhead, asynchronous support, no concurrency support, support: maximum thread pool size, support: maximum semaphore, upper limit, passing header, unsupported, supported timeout, unsupported
Timeout implementation
HystrixCommand里有个 TimedOutStatus 超时状态
Implementation process:
There are two threads, one is the hystrixCommand task execution thread, and the other is the thread waiting for the hystrixCommand judgment timeout, now the two threads see who can replace the hystrixCommand state first, as long as any thread superscripts the hystrixCommand, it means that the timeout judgment is over.
Timeout implementation class
HystrixObservableTimeoutOperator.call(),TimerListener的实现
TimerListener listener = new TimerListener() {
@Override
public void tick() {
if (originalCommand.isCommandTimedOut.compareAndSet(TimedOutStatus.NOT_EXECUTED, TimedOutStatus.TIMED_OUT)) {
// 标记事件,可以认为是开的hook,这里暂忽略
originalCommand.eventNotifier.markEvent(HystrixEventType.TIMEOUT, originalCommand.commandKey);
//取消原Obserable的订阅
s.unsubscribe();
final HystrixContextRunnable timeoutRunnable = new HystrixContextRunnable(originalCommand.concurrencyStrategy, hystrixRequestContext, new Runnable() {
@Override
public void run() {
child.onError(new HystrixTimeoutException());
}
});
timeoutRunnable.run();
}
}
//获取配置的超时时间配置
@Override
public int getIntervalTimeInMilliseconds() {
return originalCommand.properties.executionTimeoutInMilliseconds().get();
}
};
Application monitoring
Monitoring metrics:
Solid circle: contains two meanings, the color indicates the health of the instance, and the health degree decreases from green, yellow, orange, red; The size varies according to the size of the request traffic, the larger the traffic, the larger the solid circle, and vice versa.
Curve: This curve counts the changes in request traffic within 2 minutes, and analyzes the upward and downward trends of traffic.
Implementation logic:
After subscribing to the completion event of an execution, the execution result is summarized to HystrixThreadEventStream. As the name suggests, it's a stream of events.
The next operation is also easier to guess, we need a subscriber to subscribe to this event to summarize. Eventually, the result of the processing will be written to two streams, HystrixThreadPoolCompletionStream and HystrixThreadPoolCompletionStream.
统计实现:HealthCountsStream(订阅者)
处理的结果会写到HystrixThreadPoolCompletionStream和HystrixThreadPoolCompletionStream。 最核心的统计实现逻辑HealthCountsStream。
Glide Window:
Class Diagram:
Core Code:
protected BucketedRollingCounterStream(HystrixEventStream<Event> stream, final int numBuckets, int bucketSizeInMs,
final Func2<Bucket, Event, Bucket> appendRawEventToBucket,
final Func2<Output, Bucket, Output> reduceBucket) {
super(stream, numBuckets, bucketSizeInMs, appendRawEventToBucket);
Func1<Observable<Bucket>, Observable<Output>> reduceWindowToSummary = new Func1<Observable<Bucket>, Observable<Output>>() {
@Override
public Observable<Output> call(Observable<Bucket> window) {
return window.scan(getEmptyOutputValue(), reduceBucket).skip(numBuckets);
}
};
this.sourceStream = bucketedStream //stream broken up into buckets
.window(numBuckets, 1) //emit overlapping windows of buckets
.flatMap(reduceWindowToSummary) //convert a window of bucket-summaries into a single summary
.doOnSubscribe(new Action0() {
@Override
public void call() {
isSourceCurrentlySubscribed.set(true);
}
})
.doOnUnsubscribe(new Action0() {
@Override
public void call() {
isSourceCurrentlySubscribed.set(false);
}
})
.share() //multiple subscribers should get same data
.onBackpressureDrop(); //if there are slow consumers, data should not buffer
}
Ring array data structure:
Data Structure Classes:
class ListState {
/*
* The reason why data here uses AtomicReferenceArray instead of a normal array is because data needs it
* Referencing across threads in different ListState objects requires visibility and concurrency guarantees.
*/
private final AtomicReferenceArray<Bucket> data;
private final int size;
private final int tail;
private final int head;
private ListState(AtomicReferenceArray<Bucket> data, int head, int tail) {
this.head = head;
this.tail = tail;
if (head == 0 && tail == 0) {
size = 0;
} else {
this.size = (tail + dataLength - head) % dataLength;
}
this.data = data;
}
}
Sentinel
Sentinel is an open-sourced lightweight and highly available traffic control component for a distributed service architecture that mainly takes traffic as the entry point to help users protect the stability of services from multiple dimensions such as flow control, circuit breaker degradation, and system load protection.
Hystrix vs Sentinel
Hystrix's focus is on isolation and circuit breaking, where calls that time out or are fused will fail quickly, and can provide a fallback mechanism.
Sentinel 的侧重点在于:
• Diversified flow control • fuse degradation • system load protection • real-time monitoring and console
There is still a big difference between the problems solved by the two.
Comparison between the resource model and the execution model
Sentinel provides a variety of ways to configure rules. In addition to registering rules directly into the in-memory state via the loadRules API, users can also register a variety of external data sources to provide dynamic rules. Users can dynamically change the rule configuration based on the current real-time situation of the system, and the data source will push the changes to Sentinel and take effect immediately
Contrast in isolation design
Thread pool isolation can fragment machine resources.
The more thorough isolation of the thread pool mode allows Hystrix to deal with the queuing and timeout of different resource thread pools separately, but this is actually a problem to be solved by timeout circuit breaker and flow control, and if the component has the ability of timeout circuit breaker and flow control, thread pool isolation is not so necessary.
Hystrix's semaphore isolation overhead is small, but it works well. However, the downside is that you can't automatically downgrade slow calls, you can only wait for the client to time out on its own, so cascading blocking can still occur.
Sentinel can provide semaphore isolation through flow control in the number of concurrent threads pattern. Combined with the circuit breaker degradation mode based on response time, it can automatically degrade when the average response time of unstable resources is relatively high, preventing too many slow calls from occupying the number of concurrent calls and affecting the entire system.
Comparison of circuit breaker degradation
Both Sentinel and Hystrix support circuit breaker degradation based on the failure rate (exception rate).
Sentinel also supports circuit breaker degradation based on average response time, which automatically shuts down when service response times continue to spike, rejecting more requests until a certain period of time has passed. This prevents cascading blocking caused by very slow calls.
•Degradation Judgment Criteria•Average Response Time•Exception Ratio•Number of Exceptions•SystemRule: System Load Protection: Sentinel provides protection for the system dimension, and the load protection algorithm borrows the idea of TCP BBR to balance the system's ingress traffic and the system's load to ensure that the system can handle the most requests within its capabilities.
Sentinel控制台界面:
Sentinel之流量控制
Sentinel's "design philosophy" is to give coders the freedom to choose the angle from which they want to control the flow and flexibly combine them to achieve the desired effect.
We can achieve flow control from the following angles:
•Resource invocation relationship: throttling according to the caller Throttling according to the ingress of the invoking link - Link throttling Resource flow control with a relationship - associated traffic control • Running metrics: such as QPS, thread pool, system load, etc.; •Control effects: such as direct throttling, cold start, queuing, etc.
Sentinel之流量整形
Sentinel supports diverse traffic shaping strategies.
When the QPS is too high, the flow rate can be automatically adjusted to the appropriate shape. Commonly used are:
•Direct Reject Mode: Requests that exceed are rejected directly. •Slow start preheating mode: When the traffic surges, control the rate of traffic passing, let the passing flow increase slowly, and gradually increase to the upper limit of the threshold within a certain period of time, giving the cold system a time to warm up and avoid the cold system being overwhelmed. •Constant speed mode: The Leaky Bucket algorithm is used to implement the constant speed mode, which strictly controls the time interval between requests passing through, and at the same time, the accumulated requests will be queued, and the requests that exceed the timeout period will be rejected directly. Sentinel also supports rate limiting based on call relationships, including rate limiting based on callers, ingress based on call chains, and associated traffic limiting.
Comparison of real-time metric statistics implementations
Prior to Hystrix 1.5, a sliding window was implemented through a ring array, with locks and CAS operations to update the statistics for each bucket.
Hystrix 1.5 begins to refactor the implementation of real-time metric statistics, abstracting the metric statistics structure into the form of a reactive stream, which is convenient for consumers to use the metric information. At the same time, the underlying layer has been transformed into an event-driven model based on RxJava, which publishes corresponding events when the service call succeeds/fails/times out, and finally obtains a real-time stream of metric statistics through a series of transformations and aggregations, which can be consumed by the fuse or dashboard.
At present, Sentinel has abstracted the Metric indicator statistics interface, and the underlying implementation can be different, the default implementation is based on the sliding window of LeapArray, and implementation such as reactive stream may be introduced as needed in the future.
Comparison summary
Compare items | Sentinel | Hystrix | illustrate |
Quarantine policy | Semaphore Isolation (Current Limit for Concurrent Threads) (Analog Semaphore) | Thread pool isolation/semaphore isolation | Sentinel does not create thread pools where threads depend on tomcat or jetty containers, and the problem is that the number of threads running the container limits the upper limit of the sentinel setting. For example, if the tomcat thread pool is 10, it makes no sense to set 100 for sentinel, and the isolation is not good |
Circuit breaker de-escalation strategy | Based on response time, exception rate, number of exceptions | Based on the anomaly ratio | Failure fast is an essential feature |
Real-time statistics implementation | Sliding Window (LeapArray) | Sliding window (based on RxJava) | |
Dynamic rule configuration | Supports multiple data sources | Supports multiple data sources | |
Scalability | Multiple extensibility points | plug-in form | |
note | In the tank | In the tank | |
Current limitation | Based on QPS, it supports throttling based on call relationships | Limited support (number of concurrent threads or semaphore size) | Failure fast is an essential feature |
Flow shaping | Support preheating mode, constant mode, and preheating queuing mode | Not supported (queued) | |
System Adaptive Protection | Yes (Linux/UNIX only) | Not supported | Set a threshold for the maximum allowable processing capacity of a server |
Console | Provides an out-of-the-box console that allows you to configure rules, view second-level monitoring, machine discovery, and more | Simple monitoring views near real-time data | The console is a very competitive feature because it is easier to configure restricted data centrally, but the presentation of data and real-time performance is not as intuitive as hystrix. |
Configure persistence | ZooKeeper, Apollo, Nacos、本地文件 | Git/svn/local files | The Sentinel client uses a direct-link persistent store, and the application client references more dependencies, and the same store link may have multiple configurations |
Dynamic configuration | In the tank | In the tank | |
Black and white lists | In the tank | Not supported | |
Springcloud集成 | high | Very high | Spring boot使用hystrix集成度更高 |
Overall benefits | Centralized configuration settings and monitoring + more granular control rules | Beautiful interface + near real-time statistical results | After docker containerization deployment, sentinel may be more useful |
Author: Li Caiyun
Source-WeChat public account: Daojia trading platform technology
Source: https://mp.weixin.qq.com/s/TiuplYZBjV5u7h17G7fqhw