聊一聊拥塞控制算法 BBR

本文共6607字，预计阅读时间25分钟。

TCP BBR（Bottleneck Bandwidth and Round-trip propagation time）是由Google 设计，于2016年发布的拥塞算法。

以往的大部分拥塞控制算法都是基于丢包来作为降低传输速率的信号，而BBR 则基于模型主动探测。

目前已经在Google 内部大范围使用并随着linux 4.9 版本正式发布。

Bufferbloat

在讨论BBR 之前不得不先说下Bufferbloat，在BBR 出来之前的拥塞控制算法无一例外都和网络缓存耦合在了一起，所以都会存在发生Bufferbloat 的情况。

关于Bufferbloat 的定义[1]

Bufferbloat is a cause of high latency and jitter in packet-switched networks caused by excess buffering of packets. Bufferbloat can also cause packet delay variation (also known as jitter), as well as reduce the overall network throughput. When a router or switch is configured to use excessively large buffers, even very high-speed networks can become practically unusable for many interactive applications like voice over IP (VoIP), audio streaming, online gaming, and even ordinary web browsing.

Some communications equipment manufacturers designed unnecessarily large buffers into some of their network products. In such equipment, bufferbloat occurs when a network link becomes congested, causing packets to become queued for long periods in these oversized buffers. In a first-in first-out queuing system, overly large buffers result in longer queues and higher latency, and do not improve network throughput. It can also be induced by specific slow-speed connections hindering the ontime delivery of other packets.

大多数的拥塞控制算法都是基于丢包来测算带宽，有丢包发生便会降低传输速率。

中间网络设备引入大的缓冲区虽然可以降低丢包的发生，但是会导致TCP 对网络阻塞敏感度的降低，直到缓冲区满了并开始出现丢包，TCP 才开始降窗。

另外，由于缓冲区的增大，会导致延迟增加。试着想象一下如果这个缓冲区是一个非常大的的数据池，那么延迟将会上升到一个非常糟糕的地步。

你可能会觉得Bufferbloat 的产生似乎是一个错误。

那么它的产生历史缘由是什么呢？[2]

在以前的互联网上，交换机可能总是拥有比服务器更快的网卡，所以这些位于互联网中间层的服务器也可能比客户端更快，并且并不能对客户端发送信息包的速率有多大影响。

很显然今天已经不是这样的了！众所周知，今天的计算机相比于 5 年前的计算机在速度上并没有多大的提升（我们遇到了某些有关光速的问题）。所以我想路由器上的大型交换机并不会在速度上大幅领先于数据中心里服务器上的网卡。

这听起来有些糟糕，因为这意味着客户端更容易在中间层的连接中达到饱和，而这将导致互联网变慢（而且缓冲膨胀将带来更高的延迟）。

所以为了提高互联网的性能且不让每个路由上的任务队列都达到饱和，客户端需要表现得更好并且在发送信息包的时候慢一点。

以更慢的速率发送更多的信息包以达到更好的性能

(摘自Van Jacobson 在netdev 会议上的演讲)

如何理解这段话呢？

早期终端设备和网络核心交换设备性能上存在明显的悬殊，像思科这类厂商，他们的中间路由设备的处理能力是能够甩开一众终端设备的。

在这个时期，中间设备的缓存就显得不那么重要了，可能只需很少的缓存就够用了，对于终端设备而言，只需要考虑发送的尽可能的快就行！

随着时代的发展，终端处理器以及网卡性能的不断提升，不断缩小了和中间转发设备性能上的差距。

这个时候问题就出现了:

中间转发设备似乎不太来得及处理数据包了!

这就需要想办法尽快解决没有来得及处理的数据包，提升处理器性能或增加处理器都是办法，但这些方法都没有直接搞一个大缓存来的简单又有性价比。

这样一来，引入Buffer 其实违背了中间转发设备本身的角色定位，它的角色应该只是转发，而不应该成为一个缓存设备。

对于Bufferbloat，你可以鄙视它，甚至厌恶它，但它却是时代变迁下的产物。

BBR 原理介绍

BBR 的诞生解决了Bufferbloat 问题，关于它的诞生缘由，BBR 论文[3]是这样描述的：

These problems result from a design choice made when TCP congestion control was created in the 1980s—interpreting packet loss as "congestion."13 This equivalence was true at the time but was because of technology limitations, not first principles. As NICs (network interface controllers) evolved from Mbps to Gbps and memory chips from KB to GB, the relationship between packet loss and congestion became more tenuous.

Today TCP's loss-based congestion control—even with the current best of breed, CUBIC11—is the primary cause of these problems. When bottleneck buffers are large, loss-based congestion control keeps them full, causing bufferbloat. When bottleneck buffers are small, loss-based congestion control misinterprets loss as a signal of congestion, leading to low throughput. Fixing these problems requires an alternative to loss-based congestion control. Finding this alternative requires an understanding of where and how network congestion originates.

简单的理解上面这段话就是：

基于丢包的拥塞控制算法，在缓冲区缓存比较大的时候，会使得缓冲区被填满最终造成Bufferbloat。
我们知道并非所有的丢包都是由于网络阻塞，当缓冲区缓存较小时，基于丢包的拥塞算法会将丢包判定为网络拥塞，并降低吞吐量，不能准确的判断是否真正的阻塞。

为了解决上面两个问题，于是，BBR 诞生了。

BBR 的诞生，可以说是拥塞控制算法的分界点。

为何这样说呢？

在BBR 诞生前的如Reno/Cubic 等拥塞算法都是事件驱动，是一种被动的模型。

一旦丢包或RTT增加，即使不是真的因为拥塞导致，也会导致发送速率的下降。

而BBR的窗口调整是主动行为，有点像轮询，这种"轮询"是基于事实反馈的。

还有一个最大的不同之处就是：

BBR 会尽可能少的占用或不用缓存，这样的话就不会出现Bufferbloat 现象。

在介绍BBR 原理细节之前，我们先了解几个名词，这几个词将贯穿本文。

RTprop（Round-trippropagation time）：
最小时延，一个来回所以其实是两倍时延。
BtlBw（bottleneck bandwidth）：
瓶颈带宽
BDP = BtlBw * RTprop：
整条链路所能存储的数据值(不含路由缓存)。
BtlBufSize ：
链路上所有的中间路由缓存之和

BBR 论文中对RTprop 及BtlBw 之间关系[4]，用了一个比喻来说明：

If the network path were a physical pipe, RTprop would be its length and BtlBw its minimum diameter.

如果将网络通路比作物理管道，RTprop 就是它的长度，而BtlBw 则是它的最小直径。

所以我们自然而然能够想到：

如果当前实际发送速率乘以延迟得到的值越接近BDP 即链路所能存储的数据值，则说明算法的效率越高。

下图是这几个参数之间的关系

(摘自BBR 论文，笔者加了一些自己理解在上面)

关于这张图的理解如下：

数据刚开始传输时，这个时候显然数据量未填充满，也就是未达到BDP。
此时传输速率持续上升，RTT 处于最优状态时延最低，这是一个传输通畅的阶段。
当数据填充超过BDP 时，数据会使用到路由的缓存区，此时RTT 开始上升，但是发送速率还维持不变，这个阶段处于一个缓冲阶段，一旦越过BDP 分界线就说明拥塞开始出现了。
当数据填充超过（BDP+BtlBufSize）大小后，此时没有地方能够存储下没有来得及发送的数据，丢包开始产生！

注意一点的是图中笔者标记丢包临界点的位置，基于丢包的拥塞算法如cubic/Reno 等算法几乎都是以此为收敛点，会尽可能的填充满网络和缓存，认为此举是可以提高网络利用率和减少丢包。

一旦出现丢包，这类传统算法的做法是开始降低数据的发送量并再次向这个错误的收敛靠近，而BBR的目标收敛点是围绕图中笔者标记理想收敛点的位置。

关于图中的斜率，理解起来其实也比较简单：

假设某一时刻实际发送数据大小为S，实际带宽为R，那么R必定是 <= BtlBw，实际时延设为 T>=RTprop。
此时数据量S=T*R，所以这里T/S = 1/R >= 1/BtlBw，所以实际情况下斜率往往是比上半图中的更陡峭。
同理下半图的斜率 R/S = 1/T <= 1/RTprop，所以实际情况下下半图的斜率往往会更平缓些，图中的阴影部分是不可达的。

关于发送速率的计算方式[5]，BBR 论文中给出的测算方式如下：

Unlike RTT, nothing in the TCP spec requires implementations to track bottleneck bandwidth, but a good estimate results from tracking delivery rate.

When the ack for some packet arrives back at the sender, it conveys that packet’s RTT and announces the delivery of data inflight when that packet departed. Average delivery rate between send and ack is the ratio of data delivered to time elapsed: deliveryRate = Δdelivered/Δt.

This rate must be ≤ the bottleneck rate (the arrival amount is known exactly so all the uncertainty is in the Δt, which must be ≥ the true arrival interval; thus, the ratio must be ≤ the true delivery rate, which is, in turn, upper-bounded by the bottleneck capacity).

上面这段话简单理解为如下：

应答了多少数据，记为delivered；
应答delivered 这么多数据所用时间，记为interval_us。

二者相除得到带宽：bw = delivered/interval_us

从前面的图中我们可以发现，如果测量BtlBw 需要填满buffer，可测量RTprop 又需要排空buffer，也就是不能使用buffer，所以BtlBw 和RTprop 是无法同时测量的。

那么BBR是怎么做的呢？下面结合源码进行分析

Startup

/* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain

* that will allow a smoothly increasing pacing rate that will double each RTT

* and send the same number of packets per RTT that an un-paced, slow-starting

* Reno or CUBIC flow would:

static const int bbr_high_gain = BBR_UNIT * 2885 / 1000 + 1;

这个值主要用于计算pacing_rate

/* Initialize pacing rate to: high_gain * init_cwnd / RTT. */

static void bbr_init_pacing_rate_from_rtt(struct sock *sk)

{

struct tcp_sock *tp = tcp_sk(sk);

struct bbr *bbr = inet_csk_ca(sk);

u64 bw;

u32 rtt_us;

if (tp->srtt_us) {/* any RTT sample yet? */

rtt_us = max(tp->srtt_us >> 3, 1U);

bbr->has_seen_rtt = 1;

} else {/* no RTT sample yet */

rtt_us = USEC_PER_MSEC;/* use nominal default RTT*/

}

bw = (u64)tp->snd_cwnd * BW_UNIT;

do_div(bw, rtt_us);

sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain);

}

在Startup 阶段，pacing_rate 在每次收到一个ACK 后，都会提高bbr_high_gain 值。

BBR 计算拥塞窗口是用“当前采集到的速率”乘以“当前采集到的最小RTT” 来计算的，这就造成了“当前发送窗口”和“当前已经提高的速率”之间的不匹配。

所以，计算拥塞窗口的时候，gain 因子也必须是bbr_high_gain，从而可以吸收掉速率的实际提升。

ProbeRTT

我们知道要想达到理想收敛点，就需要找到最小时延和瓶颈带宽。

BBR是基于时延找最小RTT的吗？并不是，下面我们会讲述

最小RTT 有效时间是10s

/* Window length of min_rtt filter (in sec): */

static const u32 bbr_min_rtt_win_sec = 10;

/* Minimum time (in ms) spent at bbr_cwnd_min_target in BBR_PROBE_RTT mode: */

static const u32 bbr_probe_rtt_mode_ms = 200;

探测最小RTT需要最少4个数据包，这是为了兼顾延迟ACK。

/* Try to keep at least this many packets in flight,

* if things go smoothly. For smooth functioning, a sliding

* window protocol ACKing every other packet needs at least

* 4 packets in flight:

static const u32 bbr_cwnd_min_target = 4;

更新最小RTT 核心函数：bbr_update_min_rtt

该函数内容较长，下面我们将其拆分开来进行讨论

如本次RTT 小于最小RTT 或最小RTT 有效时间到，则更新最小RTT值

/* Track min RTT seen in the min_rtt_win_sec filter window: */

filter_expired = after(tcp_jiffies32,

bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);

if (rs->rtt_us >= 0 &&

(rs->rtt_us < bbr->min_rtt_us ||

(filter_expired && !rs->is_ack_delayed))) {

bbr->min_rtt_us = rs->rtt_us;

bbr->min_rtt_stamp = tcp_jiffies32;

}

最小RTT 时间过期且此时未处于PROBE_RTT 模式，则切到PROBE RTT模式，降低发送速率

if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&

!bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {

bbr->mode = BBR_PROBE_RTT; /* dip, drain queue */

bbr_save_cwnd(sk); /* note cwnd so we can restore it */

bbr->probe_rtt_done_stamp = 0;

}

如果此时处于PROBE_RTT 模式下，inflight 数据包小于等于bbr_cwnd_min_target即4个包。
同时满足一些其它条件后，开始更新本次最小RTT 侦测结束时间戳，也就是当前时间加上200ms，并将本次已delivered 数据赋值给下个RTT。
if (bbr->mode == BBR_PROBE_RTT) {

/* Ignore low rate samples during this mode. */

tp->app_limited =

(tp->delivered + tcp_packets_in_flight(tp)) ? : 1;

/* Maintain min packets in flight for max(200 ms, 1 round). */

if (!bbr->probe_rtt_done_stamp &&

tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) {

bbr->probe_rtt_done_stamp = tcp_jiffies32 +

msecs_to_jiffies(bbr_probe_rtt_mode_ms);

bbr->probe_rtt_round_done = 0;

bbr->next_rtt_delivered = tp->delivered;

} else if (bbr->probe_rtt_done_stamp) {

if (bbr->round_start)

bbr->probe_rtt_round_done = 1;

if (bbr->probe_rtt_round_done)

bbr_check_probe_rtt_done(sk);

}

}

ProbeBW

增益系数数组

/* The pacing_gain values for the PROBE_BW gain cycle, to discover/share bw: */

static const int bbr_pacing_gain[] = {

BBR_UNIT * 5 / 4, /* probe for more available bw */

BBR_UNIT * 3 / 4, /* drain queue and/or yield bw to other flows */

BBR_UNIT, BBR_UNIT, BBR_UNIT, /* cruise at 1.0*bw to utilize pipe, */

BBR_UNIT, BBR_UNIT, BBR_UNIT /* without creating excess queue... */

};

bbr_pacing_gain[0]
这个阶段gain 值是1.25，该阶段意味着BBR 会试图增加发送量，尽可能多的占用带宽，占满新进的多余带宽，目的是提升资源利用率，不过这个阶段可能会出现队列的产生以及RTT变长的情况。
bbr_pacing_gain[0] 退出条件是：
已经运行超过了一个最小RTT 时间并且要么发生了丢包，要么本次ACK到来前的inflight的值已经等于窗口值了。
bbr_pacing_gain[1]
bbr_pacing_gain[1] 这个阶段gain的值为0.75。
假设此时正处于增益阶段，此时一个新的连接发起，可能会导致正处于增益阶段的连接的inflight满载。
这个时候就需要切换到bbr_pacing_gain[1] 状态，在出让部分带宽后尽快进入平稳期，这个阶段体现了BBR的公平性。
这个阶段时间较短，一旦完成出让带宽就退出，最多停留不超过一个最短RTT时间。
bbr_pacing_gain[2].... bbr_pacing_gain[7]
这个阶段代表的是平稳期，至少停留6个RTT
进入PROBE_BW 状态后会反复的在上面三个步骤之间循环：
加速、减速、匀速。。。。

上述的过程，BBR 论文中通过如下图形来解释：

(摘自BBR论文)

图中标记(1) 指的是增益期，这个阶段增加发送量

图中标记(2) 指的是减损期，这个阶段减少排队降低inflight

图中标记(3) 指的是平稳期

Q：如果移除数组后面的六个元素，也就是移除平稳期，会出现什么样的情况呢？

A：一旦进入平稳期由于至少停留6个RTT 的限制，导致在这期间如果有富余带宽，BBR是无法抢占的。

所以如果移除到平稳期，带来的结果是能够更快的发现富余带宽，代价就是网络比较颠簸，对于一些如下载类的业务有益，但是不适用于如直播类业务。

bbr_pacing_gain 增益数组的作用，归纳下来就是：

BBR 会尽可能的抢占更多带宽，一旦有了新的连接之后，如果出现本次窗口估值等于上次的inflight 值，说明这个时候带宽被填满了。
此时旧连接的gain 值下降进而让出部分带宽给新连接，在至少6个RTT 周期内也就是平稳期，这个阶段旧连接不会抢占更多带宽，直到6个RTT 后，旧连接会试图抢占更多带宽，周而复始。

bbr_update_bw 函数

/* Estimate the bandwidth based on how fast packets are delivered */

static void bbr_update_bw(struct sock *sk, const struct rate_sample *rs)

{

struct tcp_sock *tp = tcp_sk(sk);

struct bbr *bbr = inet_csk_ca(sk);

u64 bw;

bbr->round_start = 0;

if (rs->delivered < 0 || rs->interval_us <= 0)

return; /* Not a valid observation */

/* See if we've reached the next RTT */

if (!before(rs->prior_delivered, bbr->next_rtt_delivered)) {

bbr->next_rtt_delivered = tp->delivered;

bbr->rtt_cnt++;

bbr->round_start = 1;

bbr->packet_conservation = 0;

}

bbr_lt_bw_sampling(sk, rs);

/* Divide delivered by the interval to find a (lower bound) bottleneck

* bandwidth sample. Delivered is in packets and interval_us in uS and

* ratio will be <<1 for most connections. So delivered is first scaled.

//计算当前实际带宽

bw = div64_long((u64)rs->delivered * BW_UNIT, rs->interval_us);

/* If this sample is application-limited, it is likely to have a very

* low delivered count that represents application behavior rather than

* the available network rate. Such a sample could drag down estimated

* bw, causing needless slow-down. Thus, to continue to send at the

* last measured network rate, we filter out app-limited samples unless

* they describe the path bw at least as well as our bw model.

* So the goal during app-limited phase is to proceed with the best

* network rate no matter how long. We automatically leave this

* phase when app writes faster than the network can deliver :)

//加入新的样本

if (!rs->is_app_limited || bw >= bbr_max_bw(sk)) {

/* Incorporate new sample into our max bw filter. */

minmax_running_max(&bbr->bw, bbr_bw_rtts, bbr->rtt_cnt, bw);

}

bbr_update_bw 函数做的事比较清晰：

更新RTT 周期、通过已delivered 数据 * BW_UNIT/采样时间得出带宽值、将带宽和min rtt 加入到新的RTT 和BW 样本。

通过对ProbeRTT 和ProbeBW 的分析，可以知道BBR 其实大部分时候都是在ProbeBW和ProbeRTT 状态之间切换，并且绝大部分时间是停留在ProbeBW 上。

根据代码可以得到如下一个简要的状态机[6]：

可以看出大部分时候都停留在ProbeBW 上，在ProbeRTT 停留的时间很短，这其实符合BBR 的初衷：尽可能的多占用带宽并且尽量不占用缓存。

根据前面的代码梳理，绘制了一张状态机的流转图

用一个更加精简的图来描述这个过程的话，如下：

BBR 优缺点

前面的讨论中，更多说的是其相比传统拥塞算法的优势，当理解其原理之后，也不难发现其不足之处。

BBR 优势

在长传即大RTT 场景下优势明显
有一种典型的场景便是"长肥管道"，在TCP 连接中，较长的距离意味着RTT 的增加，进而意味着窗口打开的更慢，对于像Cubic 这类AIMD 的拥塞算法而言，它们的慢启动过程会越长，这就是Reno 家族算法的典型弊端。
另外，对于Cubic 这类算法而言，更容易出现缓冲区被占满的情况，也就是我们前面提到的bufferbloat，这是为何呢？
一窗数据只有在得到ACK 确认后才会清除，RTT 越小意味着缓冲区能够越快腾出空间，发生bufferbloat概率也越低。反之RTT越大的话，越容易出现bufferbloat。
而BBR 设计之初，就没有将中间缓存考虑在内，从本文前面的收敛图可以看出

BBR的理想带宽收敛点是在缓冲区即将被填充的时候，此时BtlBW最大，RTT最小
而对于Reno家族算法而言，其收敛点是缓冲区被填满的时候，此时BtlBW 最大，但是RTT 也最大。
归纳下来：
对于长肥管道这种场景，Cubic 这类传统算法表现会异常脆弱，一旦有丢包就会立马缴械投降，执行MD 过程，这个过程是一个快速降窗的过程，前功尽弃。
而BBR 优势在于不会导致bufferbloat，且对丢包不敏感使其不会出现激进降窗行为。

抢占带宽能力强
BBR 其实大部分时候都是在ProbeBW 和ProbeRTT 状态之间切换，并且绝大部分时间是停留在ProbeBW 上。
ProbeBW 阶段会不断的试探剩余可用带宽，短时间内占据更多的带宽。

平稳发送
由于其平稳期至少待够6个RTT 周期才行，这使得该阶段的发送速率比较平稳，比较适合音视频领域比如视频直播这种对稳定性要求较高的场景。

BBR 不足之处

对带宽变化不敏感
从前面原理分析环节，我们知道BBR 减损期会让出部分带宽，一旦这个时候被其它流占据了话，BBR 要等到度过本地"漫长"的平稳期，等到下个增益周期才能发现。
也就是说BBR 在ProbeBW 状态下，只有在ProbeMore 周期才能感受到带宽的变化，后面的ProbeDrain 以及平稳期对于带宽的变化都无法感知，如果这个阶段实际带宽降低，BBR 需要多个RTT 周期才能收敛到实际的带宽位置。

ProbeRTT 阶段导致发送速率下降太多
这个阶段由于只发四个包，导致发送速率迅速下降，虽然时间很短，但是如果应用在一些对实时性要求很高的音视频场景下，就会出现卡顿现象。
解决方案有多种，根据业务需要可以减少ProbeRTT 时间甚至直接拿掉ProbeRTT。
抗RTT抖动能力差
一个典型场景是BBR 在WIFI 弱网下表现不佳，这里的弱网不是指的是网络质量差，速度慢。
而是指WIFI 场景下的稳定性相比于有线网络而言，逊色太多。由于无线网络是共享介质的，几乎无法在信号不冲突的前提下同时发送和接收。
所以在WIFI 场景下，接入的终端设备一多，RTT 抖动会越明显，这种情况下BBR 测算出的BDP很可能就不准确了，可能会低于实际带宽数据。

小结

实际项目上选择什么样的拥塞控制算法，需要根据实际业务场景以及网络环境来确定。

至此，我们关于BBR 的一个简单讨论到此结束。

参考文章

BBR 论文：https://www.cis.upenn.edu/~cis553/files/BBR.pdf

[引用1] https://en.wikipedia.org/wiki/Bufferbloat

[引用2] https://linux.cn/article-9935-1.html

[引用3-5] https://www.cis.upenn.edu/~cis553/files/BBR.pdf

[引用6] 从TCP拥塞本质看BBR算法及其收敛性

作者:pangdaibao

来源-微信公众号:酷派技术团队

出处:https://mp.weixin.qq.com/s/12XmLexC8DEIjNdPCAVEfQ