Exploration and application of the downlink CDN architecture of Bilibili

Background

As shown in the following figure, you can see that the edge CDN nodes work closely with the central scheduling service, which simply means that the scheduling service first performs traffic scheduling (responsible for balanced scheduling to each gateway component node), then the back-to-origin component performs back-to-origin convergence in the cluster, and finally goes to the corresponding back-to-origin node for back-to-origin. With the increase in business volume, the risks brought by this model are constantly exposed.

Exploration and application of the downlink CDN architecture of Bilibili

Note: The same cluster refers to the same data center, and the node refers to a physical machine or container

There are several drawbacks to this model:

It is difficult to coordinate between the load balancing strategy of the central scheduling service and the back-to-origin convergence strategy of the edge and the cluster (inconsistent selection of back-to-origin nodes, inconsistent judgment of hot and cold resources, etc.), resulting in frequent accidents
After an edge node fails, the central dispatching service takes at least 20 minutes from perception to fault events, traffic cut-off, and long-tail traffic drying up, which is even worse in the case of live streaming services
The resource utilization (CPU and cache) does not meet expectations, and there are problems such as many glitches, imbalances, and high back-to-origin rates, and SLO alarms are frequently triggered
It is difficult and complex to develop governance and improve user playback experience, such as resource warming, and you need to know in advance which gateway nodes the live stream will be allocated to

According to the above points and the characteristics of the point/live broadcast business, we put forward several basic requirements to meet the increasing traffic and the pursuit of playback quality

The gateway component must have traffic distribution capabilities and Layer 7 load balancing and Layer 4 load balancing
A set of component service status check mechanism in the cluster can detect overloaded components in time for capacity expansion and sensitively find faulty components for kicking out
All policy functions are converged to the central control component, avoiding two different traffic policies on different nodes of the same resource, and minimizing the changes of the original other components (maintaining the development thinking of a monolithic architecture and improving development efficiency).

New architecture design

Gateway component: provides user access protocols (H1, H2, and H3) and has functions such as authentication and convergence of access traffic

Cache component: focuses on the disk I/O technology stack, file system, and cache elimination policy

Back-to-origin components: focus on the back-to-origin protocol (private/general) and how to improve the back-to-origin rate

Central control components: focus on the traffic routing policies in the cluster (such as hot flow scattering, abnormal component kicking, etc.) and customized optimization policies for services

The gateway component has the ability to distribute traffic

Part can be split into Layer 4 load balancing and Layer 7 load balancing, of which Layer 4 load balancing is done by other groups of students (see:)

Layer 4 load balancing

It mainly solves the problem of unbalanced node load caused by unbalanced scheduling, and at the same time, it can cut off the faulty layer 7 nodes in time to improve the overall fault tolerance of the system

Layer 7 load balancing

It mainly solves the problem of kicking out the faulty component and retrying to other nodes when there is a problem with the cache component or the back-to-origin component (the back-to-origin component node will still be prioritized back to the previous back-to-origin component node: route 5-1 in the figure) to ensure the user's playback experience

Central control components

Now that the problem of traffic ingress has been solved, and some node failures have not affected the quality of service, let's focus on how the central control component (the central control component itself is multi-node, and the query interface is consistent with the data) strings together other component nodes

Load balancing policy

There are still some differences between live broadcast and on-demand business scenarios, overlapping some historical problems, this transformation will split the point/live load balancing strategy into two separate modules, and the follow-up plan will still be integrated

Live load balancing policy

When the number of resources (flows) occupies >=N, the traffic scattering policy will be triggered to divide the overflow traffic to other components, as shown in the figure in Component-1, which is cut into three medium rectangular bars

Next, we can see that the gateway component will synchronize the real-time bandwidth of each request to the central controller, which is also the data source for all policies. The central control component cleans, integrates, and observes the data, and adjusts the traffic link of the gateway component in real time, as shown in the green solid link in the figure

On-demand load balancing policy

Since the size of the VOD file is generally larger than the shard size of the cache component, the gateway component is sharded according to the shard size and load balanced by consistent hashing, which is rough but sufficient

Here we have modified the nginx http slice module, and the original http_slice_module has the following problems:

1) The offset after alignment will lead to read amplification (there is redundant data), which was originally designed for caching, but our architecture has a special caching component, and we do not need nginx to do caching

2) The next round of subrequests can only be initiated after the previous subrequest is over, which may lead to unexpected "waiting" (assuming that the user request is cut into N requests, you need to wait for the result of N-1 more http headers)

Our approach is to increase the prefetch window, which will trigger the sending of subrequest-3 requests when subrequest-2 receives the http header, and control that only two subrequests exist at the same time, so as to ensure that nginx memory is not over-bloated and does not put too much pressure on the backend node

Here's the code:

typedef struct {
 ...
    ngx_http_request_t*  prefetch[2];
} ngx_http_slice_ctx_t;
 
ngx_http_slice_body_filter(ngx_http_request_t *r, ngx_chain_t *in)
{
...
    if (ctx == NULL || r != r->main) {
        // 确认前面的预取子请求已经完成header处理，可进行下一个分片预取
        if (ctx && ctx->active) {
            rc = ngx_http_slice_prefetch(r->main, ctx);
            if (rc != NGX_OK) {
                return rc;
            }
        }
        return ngx_http_next_body_filter(r, in);
    }
 
...
    rc = ngx_http_next_body_filter(r, in);
 
    if (rc == NGX_ERROR || !ctx->last) {
        return rc;
    }
 
    if (ctx->start >= ctx->end) {
        ngx_http_set_ctx(r, NULL, ngx_http_slice_filter_module);
        ngx_http_send_special(r, NGX_HTTP_LAST);
        return rc;
    }
 
    if (r->buffered) {
        return rc;
    }
    if (ctx->active) {
        // 分片预取
        rc = ngx_http_slice_prefetch(r->main, ctx);
    }
    return rc;
}
 
ngx_http_slice_prefetch(ngx_http_request_t*r,ngx_http_slice_ctx_t* ctx)
{
   // control prefetch win
    if (ctx->prefetch[1]) {
        if (!ctx->prefetch[0]->done) {
            return NGX_OK;
        }
        ctx->prefetch[0] = ctx->prefetch[1];
        ctx->prefetch[1] = NULL;
    }
 
    if (ctx->start >= ctx->end) {
        return NGX_OK;
    }
     
...
    if (ngx_http_subrequest(r, &r->uri, &r->args, &ctx->prefetch[1], ps, NGX_HTTP_SUBREQUEST_CLONE)
        != NGX_OK) {
        return NGX_ERROR;
    }
    ngx_http_set_ctx(ctx->prefetch[1], ctx, ngx_http_slice_filter_module);
    ctx->active = 0;
 
...
    // init once
    if (!ctx->prefetch[0]) {
        ctx->prefetch[0] = ctx->prefetch[1];
        ctx->prefetch[1] = NULL;
    }
        ngx_http_slice_loc_conf_t* slcf = ngx_http_get_module_loc_conf(r, ngx_http_slice_filter_module);
 
    off_t cur_end = ctx->start + get_slice_size(ctx->start, slcf);
    // 判断对齐后末尾数值如果大于用户请求的末尾数值，则取用户的末位数值，避免读放大
    if (slcf->shrink_to_fit) {
        cur_end = ngx_min(ctx->end, cur_end);
    }
    gen_range(ctx->start, cur_end - 1, &ctx->range);
...
}

We did a simple scenario simulation, and we can see that the processing time of requests after the transformation is reduced by 40% compared with the processing time before the transformation, and we can respond to user requests faster

The end result

With the point/live load policy taking effect, you can see that the stability of the old architecture has been greatly improved

If we look at the SLO monitoring data, some clusters will trigger SLO alarms every weekend before the new architecture is full, and after the new architecture is gradually expanded, the SLO jitter will also be reduced

Let's talk about how the new architecture empowers businesses

Business mixing

VOD and live streaming services have different requirements for machine resources, VOD is more inclined to have more disks for I/O throughput and horizontal storage expansion, while live streaming is more inclined to have more CPU resources for protocol repackaging or data migration and copying. In the on-demand scenario, DMA can be used to reduce the CPU's participation in disk data copying (I have free CPU resources), while in the live broadcast scenario, it is basically not used for mechanical hard disks (I have free disk resources), and it is a good fit

In the case of on-demand live mixed running, the on-demand back-to-origin rate has been reduced, the live broadcast has more CPU resources and can run more streams, both on-demand and live broadcast can provide better and faster services to the outside world, and our operation students no longer need to worry about the transformation of node roles (on-demand or live broadcast?).

Direct/on-demand warm-up

In the old architecture mode, it was very difficult to enable preheating, because it was not known which node the user would access, so the only way to cast a wide net was used, which was inefficient, and could also increase the burden on components, and the preheating task could only be delivered manually (without knowing who was hot)

In the new architecture mode, each cluster selects a central control component to observe the cluster heat flow in real time, and triggers a warm-up policy to deliver the warm-up task to one of the gateway components

Maximize the use and orchestration of storage resources

Obviously, in the old architecture mode, the gateway component is only allowed to access the cache component of the node, so it can only ensure that the storage resources of a single node are mutually exclusive, not the storage resources of the entire cluster. Under the new architecture, we will re-plan the resource access that each node should be allocated to according to the storage capacity and disk I/O performance of each cache node, so as to ensure that the resources cached by each cache node are mutually exclusive as much as possible. Let's talk about the details, we know that the outer ring throughput of the mechanical hard disk is greater than that of the inner ring, so during the idle time of the cluster, can we do some re-orchestration of storage resources, such as moving hot resources to the outer ring in order to obtain a more stable service quality?

Author: Cai Shangzhi/Liu Yongjiang/Yang Chengjin/Zhang Jianfeng

Source-WeChat public account: Bilibili Technology

Source: https://mp.weixin.qq.com/s/ccUR8mEBFltwlDvnUof9yA