天天看點

MongoDB Oplog Stones 實作分析及啟動加載優化背景Oplog Stones 初始化Oplog 回收Oplog Stones 初始化的時間開銷分析Oplog Stones reload 優化

本文基于 4.2 代碼分析

背景

Oplog Collection 首先是作為一個 Capped Collection 來實作的,但是單純的 Capped Collection 會在每一次的寫操作之後,如果發現集合大小超出配置的最大值,就會同步的去進行删除文檔的操作。

删除文檔的步驟大約是,

  • 計算:設定一個 end cursor 指向 capped collection 頭部,然後不斷的調用 cursor->next(),通過累積 value 大小來确定需要删除的文檔數
  • 删除:
    • 需要删除的文檔數量小于等于 3 個,直接循環調用 cursor->remove() 删除文檔
    • 否則,設定兩個 cursor,一個 start,一個 end,start 直接設定在 capped collection 的頭部,然後調用 session->truncate(session, start, end) 來批量删除文檔,session->truncate() 實際最終也是調用 cursor->remove()

具體代碼可參考:

WiredTigerRecordStore::_cappedDeleteAsNeeded_inlock

可以看到确定需要删除的文檔數是這裡面比較耗時的部分,cursor->next() 來統計 value 大小,實際就是在不斷的做讀取。對于 Oplog Collection 來說,所有的使用者寫操作都會記錄 Oplog,當 Oplog 寫滿後,每一次的使用者寫操作都會觸發同步的删除操作,顯然效率很低。

是以 MongoDB 采用了一種标記删除位點,然後批量删除的政策來

解決

這個問題。

Oplog Stones 初始化

所謂的 Oplog Stone,實際上就是用 truncate point(删除位點) 在 oplog 上分隔的邏輯區域,而 truncate point 本質上就是 oplog entry 的 ts 字段,同時作為 RecordID,實際也是對應的 WiredTiger Table 中一條記錄的 key。

Oplog Stone 的資訊 MongoDB 并沒有做持久化,而是選擇每次重新開機的時候重新初始化。

Oplog Stones 整體初始化的邏輯還是比較簡單的,首先是根據 cappedMaxSize 計算需要多少個 stone,

// The minimum oplog stone size should be BSONObjMaxInternalSize.
    const unsigned int oplogStoneSize =
        std::max(gOplogStoneSizeMB * 1024 * 1024, BSONObjMaxInternalSize);

    // IDL does not support unsigned long long types.
    const unsigned long long kMinStonesToKeep = static_cast<unsigned long long>(gMinOplogStones);
    const unsigned long long kMaxStonesToKeep =
        static_cast<unsigned long long>(gMaxOplogStonesDuringStartup);

    unsigned long long numStones = maxSize / oplogStoneSize;
    size_t numStonesToKeep = std::min(kMaxStonesToKeep, std::max(kMinStonesToKeep, numStones));           

這裡有很多預設值參與計算,我們需要知道的是,oplog stone 最多 100 個,最少 10 個,如果 oplogSizeMB 配置值超過 2GB,在預設情況下,基本上就需要 100 個 stone(這個是根據 Capped Collection配置的最大值算出來的基準,如果 Capped Collection 實際還沒有寫滿,會根據實際大小來換算,stone 會更少,但是無論如何,上限不會超過 100)。

确定了 Oplog Stone 的個數,下面要做的就是确定每個 Oplog Stone 的邊界,分兩種情況,

  • Oplog 集合目前的 record 數量太少,小于 20 倍的需要 sample 的數量(對于 100 個 stone 來說,每個 stone 預設 sample 10 條記錄,是以 Collection record 數量低于 2w 條,就走全表的邏輯),直接通過全表掃描的方式來确定 Oplog Stone 邊界,這個邏輯很簡單,就是對掃描到的 oplog entry 累加大小,超過單個 oplog stone 大小上限就生成一個 stone,儲存下來,直到掃描結束。(代碼:

    WiredTigerRecordStore::OplogStones::_calculateStonesByScanning

  • 否則,就不能通過全表掃描了,效率太低。MongoDB 借助于 WiredTiger 提供的 random cursor 來進行采樣,進而快速确定每個 oplog stone 的邊界。(代碼:

    WiredTigerRecordStore::OplogStones::_calculateStonesBySampling

    1. 正常來說,100 個 oplog stone,采樣 100 次,似乎就可以确定所有 stone 的邊界了,但是為了保證邊界盡可能準确,MongoDB 采用了 oversampling 的方式,即,對于每一個 stone 采樣 10 次(預設),100 個 stone 就采樣 1000 次。然後,把這些采集到的 sample 按 key(opTime)排序,每個 oplog stone 使用每 10 個 sample 中最後一個的 key 作為其邊界,即,從 0 開始數,那麼 9th, 19th, 29th, ……号 sample 是順序對應的 oplog stone 的邊界。
    2. 另外,MongoDB 在後面的版本 疊代 中還使用了優化的 random cursor,可以保證更精确的采樣,通過

      next_random_sample_size

      告知 WT 需要 random 采樣的個數,WT 會把底層的 table 劃分為 next_random_sample_size 份,分别從其中擷取一個采樣,顯然這樣劃分後,每第 10 個 sample 可以更貼近預期的 oplog stone 邊界。

這個是程序啟動時 oplog stone 的初始化的方式,随着有新的寫入,還會建立新的 oplog stone,這個時候 oplog stone 的大小是可以保證精确的,因為在寫入的時候,可以很友善的統計到目前已經在最新的 stone 裡面寫了多少資料,這個數值是精确的。是以,如果初始啟動的時候因為 oplog stone 邊界不精确導緻 oplog 删除的過多或過少,并不是個大問題,這個會在後續的更新中把誤差抹掉。

Oplog 回收

有了 Oplog Stone 後,oplog 的回收就不需要再每次寫入時去計算要删除的文檔數,再同步去删除,隻需要在目前 oplog stone 寫滿後,建立新的 oplog stone 時,把最老的 oplog stone 删除掉即可。

這個按 stone 删除通常會一次性删除比較多的文檔,是以oplog 的删除動作是放在背景的

OplogTruncaterThread

來做的,删除時會直接調用 session->truncate 方法,使用 oldest oplog stone 的邊界作為 range truncate 的上界。

//WiredTigerRecordStore::reclaimOplog
            setKey(cursor, stone->lastRecord);
            invariantWTOK(session->truncate(session, nullptr, nullptr, cursor, nullptr));           

删除的時候還需要考慮 stable checkpoint 對 oplog 的依賴,具體邏輯後面再發文分析。

Oplog Stones 初始化的時間開銷分析

單純從代碼邏輯看,可以看到 Oplog Stones 的初始化的絕大部分時間都會花在 random cursor 的采樣上,因為其他的步驟都是簡單的 in-memory compute,幾個 ms 足以完成。總的初始化時長會和采樣的個數成正比,采樣個數最多是 1000 個,是以這個初始化時間是有上限的,并不是随 oplog 集合大小無限增長。

為了驗證上述結論,我們構造一個測試場景來證明。

先構造資料集,建立副本集 1,使用 mongo shell 自帶的 benchmark 工具生成資料,

benchRun(
  {
  "host": "127.0.0.1:9111",
  "ops": [
    {
      "ns": "test.foo",
      "op": "insert",
      "doc": {
        "x": {
          "#RAND_STRING": [
            64
          ]
        },
        "y": {
          "#RAND_STRING": [
            128
          ]
        }
      }
    }
  ],
  "parallel": 16,
  "seconds": $seconds
  }
)           

生成大約 1.5 億條 oplog,33GB,(有壓縮,實際是在 50G左右)

xdjmgset-dbfs1:PRIMARY> show dbs
admin    0.000GB
config   0.000GB
local   33.363GB
test    35.284GB
xdjmgset-dbfs1:PRIMARY> use local
switched to db local
xdjmgset-dbfs1:PRIMARY> db.oplog.rs.count()
150531637           

在代碼上加了一些日志用于檢視 cursor random 的時間開銷,以及每一次 random 采樣的開銷,

diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
index f2c3d1c220..7f029b788d 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp
@@ -460,8 +460,14 @@ void WiredTigerRecordStore::OplogStones::_calculateStonesBySampling(OperationCon
     // each logical section.
     auto cursor = _rs->getRandomCursorWithOptions(opCtx, extraConfig);
     std::vector<RecordId> oplogEstimates;
+    const std::uint64_t startWaitTime = curTimeMicros64();
     for (int i = 0; i < numSamples; ++i) {
+        const std::uint64_t startWaitTime = curTimeMicros64();
         auto record = cursor->next();
+        auto waitTime = curTimeMicros64() - startWaitTime;
+        LOG(1) << "WT cursor random sample " << i << ", "
+               << Timestamp(record->id.repr()).toStringPretty() << ", took "
+               << waitTime / 1000.0 << "ms";
         if (!record) {
             // This shouldn't really happen unless the size storer values are far off from reality.
             // The collection is probably empty, but fall back to scanning the oplog just in case.
@@ -471,6 +477,8 @@ void WiredTigerRecordStore::OplogStones::_calculateStonesBySampling(OperationCon
         }
         oplogEstimates.push_back(record->id);
     }
+    auto waitTime = curTimeMicros64() - startWaitTime;
+    LOG(1) << "WT cursor random sampling total took " << waitTime/1000.0 << "ms";
     std::sort(oplogEstimates.begin(), oplogEstimates.end());

     for (int i = 1; i <= wholeStones; ++i) {           

把副本集 1 重新開機,觀察日志,

020-10-27T15:34:09.058+0800 I  STORAGE  [initandlisten] Taking 755 samples and assuming that each section of oplog contains approximately 1991955 records totaling to 687194924 bytes
2020-10-27T15:34:09.058+0800 D1 STORAGE  [initandlisten] WT cursor random sample 0, Oct 26 19:54:57:7347, took 0.415ms
2020-10-27T15:34:09.063+0800 D1 STORAGE  [initandlisten] WT cursor random sample 1, Oct 26 19:57:47:6215, took 4.488ms
2020-10-27T15:34:09.067+0800 D1 STORAGE  [initandlisten] WT cursor random sample 2, Oct 26 15:37:47:1030, took 4.608ms
2020-10-27T15:34:09.072+0800 D1 STORAGE  [initandlisten] WT cursor random sample 3, Oct 26 15:44:15:4619, took 4.471ms
2020-10-27T15:34:09.076+0800 D1 STORAGE  [initandlisten] WT cursor random sample 4, Oct 26 15:46:51:2640, took 4.597ms
2020-10-27T15:34:09.081+0800 D1 STORAGE  [initandlisten] WT cursor random sample 5, Oct 26 15:49:22:10335, took 4.556ms
2020-10-27T15:34:09.086+0800 D1 STORAGE  [initandlisten] WT cursor random sample 6, Oct 26 15:52:03:10684, took 4.746ms
2020-10-27T15:34:09.090+0800 D1 STORAGE  [initandlisten] WT cursor random sample 7, Oct 26 15:54:14:4494, took 4.586ms
2020-10-27T15:34:09.095+0800 D1 STORAGE  [initandlisten] WT cursor random sample 8, Oct 26 15:56:46:1960, took 4.889ms
2020-10-27T15:34:09.100+0800 D1 STORAGE  [initandlisten] WT cursor random sample 9, Oct 26 15:59:18:7246, took 4.695ms
2020-10-27T15:34:09.105+0800 D1 STORAGE  [initandlisten] WT cursor random sample 10, Oct 26 16:02:05:4727, took 4.895ms
2020-10-27T15:34:09.110+0800 D1 STORAGE  [initandlisten] WT cursor random sample 11, Oct 26 16:04:30:5742, took 4.673ms
2020-10-27T15:34:09.115+0800 D1 STORAGE  [initandlisten] WT cursor random sample 12, Oct 26 16:06:45:1917, took 4.881ms
2020-10-27T15:34:09.119+0800 D1 STORAGE  [initandlisten] WT cursor random sample 13, Oct 26 16:08:50:5188, took 4.786ms
2020-10-27T15:34:09.124+0800 D1 STORAGE  [initandlisten] WT cursor random sample 14, Oct 26 16:11:13:7634, took 4.449ms
2020-10-27T15:34:09.129+0800 D1 STORAGE  [initandlisten] WT cursor random sample 15, Oct 26 16:13:25:6775, took 5.204ms
...
2020-10-27T15:34:12.463+0800 D1 STORAGE  [initandlisten] WT cursor random sample 752, Oct 26 15:56:15:232, took 4.923ms
2020-10-27T15:34:12.467+0800 D1 STORAGE  [initandlisten] WT cursor random sample 753, Oct 26 15:58:47:1953, took 4.399ms
2020-10-27T15:34:12.472+0800 D1 STORAGE  [initandlisten] WT cursor random sample 754, Oct 26 16:01:28:5317, took 4.598ms
2020-10-27T15:34:12.472+0800 D1 STORAGE  [initandlisten] WT cursor random sampling total took 3414.51ms           

可以看到這個執行個體采樣了 755 次,總共耗時 3414ms,每次采樣的時間都比較固定,在 4ms - 5ms 之間。

然後 MongoDB 的 serverStatus 指令本身也提供了一個 section 的輸出用于檢視啟動時,初始化 oplog stones 的總時長,和初始化方法(是 scan 還是 sampling),

xdjmgset-dbfs1:PRIMARY> db.serverStatus().oplogTruncation
{
        "totalTimeProcessingMicros" : NumberLong(3418164),
        "processingMethod" : "sampling",
        "totalTimeTruncatingMicros" : NumberLong(0),
        "truncateCount" : NumberLong(0)
}           

可以看到,其他部分的時間開銷在 4ms 左右,不到總時長的 1%。

為了驗證初始化時間和 sample 的個數成正比,同樣根據上述方法構造另外一個資料集,25GB,1.13 億條 oplog,

xdjmgset-dbfs1:PRIMARY> show dbs
admin    0.000GB
config   0.000GB
local   25.145GB
test    26.517GB
xdjmgset-dbfs1:PRIMARY> use local
switched to db local
xdjmgset-dbfs1:PRIMARY> db.oplog.rs.count()
113211477           

重新開機之後檢視日志輸出,

2020-10-27T15:43:02.121+0800 I  STORAGE  [initandlisten] Taking 568 samples and assuming that each section of oplog contains approximately 1991875 records totaling to 687195044 bytes
2020-10-27T15:43:02.121+0800 D1 STORAGE  [initandlisten] WT cursor random sample 0, Oct 27 12:33:29:5201, took 0.216ms
2020-10-27T15:43:02.125+0800 D1 STORAGE  [initandlisten] WT cursor random sample 1, Oct 27 12:36:06:5577, took 4.489ms
2020-10-27T15:43:02.130+0800 D1 STORAGE  [initandlisten] WT cursor random sample 2, Oct 27 12:38:30:1191, took 4.417ms
2020-10-27T15:43:02.134+0800 D1 STORAGE  [initandlisten] WT cursor random sample 3, Oct 27 12:40:51:1654, took 4.526ms
2020-10-27T15:43:02.139+0800 D1 STORAGE  [initandlisten] WT cursor random sample 4, Oct 27 12:43:12:9085, took 4.51ms
2020-10-27T15:43:02.144+0800 D1 STORAGE  [initandlisten] WT cursor random sample 5, Oct 27 12:45:36:3523, took 4.465ms
2020-10-27T15:43:02.148+0800 D1 STORAGE  [initandlisten] WT cursor random sample 6, Oct 27 12:48:09:6883, took 4.63ms
2020-10-27T15:43:02.153+0800 D1 STORAGE  [initandlisten] WT cursor random sample 7, Oct 27 12:50:09:6716, took 4.484ms
2020-10-27T15:43:02.157+0800 D1 STORAGE  [initandlisten] WT cursor random sample 8, Oct 27 12:52:24:1495, took 4.531ms
2020-10-27T15:43:02.162+0800 D1 STORAGE  [initandlisten] WT cursor random sample 9, Oct 27 12:54:39:3871, took 4.705ms
2020-10-27T15:43:02.167+0800 D1 STORAGE  [initandlisten] WT cursor random sample 10, Oct 27 12:57:15:3946, took 4.661ms
2020-10-27T15:43:02.171+0800 D1 STORAGE  [initandlisten] WT cursor random sample 11, Oct 27 12:59:36:5033, took 4.74ms
2020-10-27T15:43:02.176+0800 D1 STORAGE  [initandlisten] WT cursor random sample 12, Oct 27 13:01:52:6908, took 4.424ms
2020-10-27T15:43:02.181+0800 D1 STORAGE  [initandlisten] WT cursor random sample 13, Oct 27 13:04:22:2838, took 4.637ms
2020-10-27T15:43:02.186+0800 D1 STORAGE  [initandlisten] WT cursor random sample 14, Oct 27 13:06:42:6574, took 5.21ms
...
2020-10-27T15:43:04.771+0800 D1 STORAGE  [initandlisten] WT cursor random sample 567, Oct 27 12:17:32:2820, took 4.397ms
2020-10-27T15:43:04.771+0800 D1 STORAGE  [initandlisten] WT cursor random sampling total took 2650.65ms           

進行了 568 次 sample,總時間開銷 2650ms,而 2650ms 基本上等于

( 568.0/755) * 3414ms = 2568ms

,和 sample 個數成正比的結論可成立。

綜上,考慮到單次 random cursor sample 的開銷大約是 4-5ms,總 sample 上限是 1000,那麼 oplog stones 初始化時間的上限在 5s 左右(NVMe SSD)。

update:上面的分析忽略了一個事情,即每次 random cursor next,是需要把擷取 record id 的,因為這個就是 oplog 的 truncate point,但是同時也會把 value 讀上來,而且 WiredTiger 的讀取都是以 page(extent)為粒度的,如果平均每條 oplog 的 value 都很大,必然造成單個 disk page 也很大,重新開機的時候,顯然所有 page 的讀取都要自己讀 disk,那麼就會造成單次 random cursor next 耗時提升。

是以,經過進一步測試,最終的結論是,當單條 oplog 平均 size 不大時(幾百個位元組 - 幾 KB), oplog stones 初始化時間的上限在 5s 左右。當單條 oplog 平均 size 太大時(幾十 KB - 幾 MB),那麼因為 page size 太大,冷啟動讀取 disk page 開銷變大, oplog stones 初始化時間可能達到幾十秒甚至更高,具體時間和單條 oplog 平均 size 強相關。

Oplog Stones reload 優化

阿裡雲資料庫平台有一套完善的日志采集系統,通過對線上運作日志分析,仍然發現有不少執行個體寫入時的 Value 比較大,導緻的結果是 oplog entry 的大小超過幾百 KB,設定達到幾 MB,這種情況下,如果重新開機,初始化 oplog stones 的時間開銷就要達到幾十秒甚至更多,是以仍然有必要對這個問題進行優化。

優化前後對比

構造 1MB 大小的 oplog entry,重新開機 100 個 stone,994 個 sample,加載時間在 27.7s 左右。

xdjmgset-dbfs1:PRIMARY> db.serverStatus().oplogTruncation
{
        "totalTimeProcessingMicros" : NumberLong(27791069),
        "processingMethod" : "sampling",
        "totalTimeTruncatingMicros" : NumberLong(0),
        "truncateCount" : NumberLong(0)
}

2020-11-13T14:47:52.391+0800 I  STORAGE  [initandlisten] The size storer reports that the oplog contains 20378 records totaling to 21355524826 bytes
2020-11-13T14:47:52.391+0800 D1 COMMAND  [WT-OplogStonesSaverThread-local.oplogstones.rs] BackgroundJob starting: WT-OplogStonesSaverThread-local.oplogstones.rs
2020-11-13T14:47:52.391+0800 D2 STORAGE  [WT-OplogTruncaterThread-local.oplog.rs] no global storage engine yet
2020-11-13T14:47:52.391+0800 I  STORAGE  [initandlisten] Sampling the oplog to determine where to place markers for truncation
2020-11-13T14:47:52.391+0800 D2 STORAGE  [WT-OplogStonesSaverThread-local.oplogstones.rs] WT-OplogStonesSaverThread-local.oplogstones.rs: no global storage engine yet
2020-11-13T14:47:52.393+0800 I  STORAGE  [initandlisten] Sampling from the oplog between Nov 13 14:43:08:618 and Nov 13 14:47:21:3 to determine where to place markers for truncation
2020-11-13T14:47:52.393+0800 I  STORAGE  [initandlisten] Taking 994 samples and assuming that each section of oplog contains approximately 205 records totaling to 214833771 bytes
2020-11-13T14:47:52.572+0800 D1 STORAGE  [initandlisten] WT cursor random sample 0, Nov 13 14:43:28:232, took 178.246ms
2020-11-13T14:47:52.620+0800 D1 STORAGE  [initandlisten] WT cursor random sample 1, Nov 13 14:45:13:143, took 48.437ms
2020-11-13T14:47:52.661+0800 D1 STORAGE  [initandlisten] WT cursor random sample 2, Nov 13 14:45:35:242, took 40.899ms
2020-11-13T14:47:52.868+0800 D1 STORAGE  [initandlisten] WT cursor random sample 3, Nov 13 14:43:19:114, took 206.935ms
...
2020-11-13T14:48:20.182+0800 I  STORAGE  [initandlisten] WiredTiger record store oplog processing took 27791ms           

優化後,加載時間在 7ms 左右,減少了 3-4 個數量級

xdjmgset-dbfs1:PRIMARY> db.serverStatus().oplogTruncation
{
        "totalTimeProcessingMicros" : NumberLong(7261),
        "processingMethod" : "reloadingLocal",
        "totalTimeTruncatingMicros" : NumberLong(0),
        "truncateCount" : NumberLong(0)
}           

優化思路簡述

  1. 把記憶體中的 oplog stones 儲存在本地
  2. 背景線程負責在記憶體 oplog stones 變化時更新本地檔案。
  3. 重新開機加載從本地檔案加載儲存的 oplog stones 資訊,需要處理一些邊界條件。

最後,打一個小廣告,阿裡雲 MongoDB 目前上線了 Serverless 版本,讓您友善的以超低成本試用 MongoDB,活動期間首購 1 元包月,續費也有 5 折優惠,歡迎試用:

https://www.aliyun.com/product/mongodb