背景

在内部針對 MongoDB 4.4 的測試中，我們發現在帶寫壓力的情況下對主庫進行 unclean shutdown （kill -9），主庫在重新開機之後少了很多資料，通過分析日志可以看到在加載完 stable checkpoint，進行 replication recovery，即 apply oplog 之前，有很多在 oplogTruncateAfterPoint 之後的 oplog 被 truncate 掉了，導緻隻有很少的 oplog 被加載。

{"t":{"$date":"2021-01-27T16:40:22.601+08:00"},"s":"I",  "c":"REPL",     "id":21557,   "ctx":"SwitchWarmStandbyThread","msg":"Removing unapplied oplog entries after oplogTruncateAfterPoint","attr":{"oplogTruncateAfterPoint":{"":{"$timestamp":{"t":1611736805,"i":1799}}}}}
{"t":{"$date":"2021-01-27T16:40:22.651+08:00"},"s":"I",  "c":"REPL",     "id":21553,   "ctx":"SwitchWarmStandbyThread","msg":"Truncating oplog from truncateAfterOplogEntryTimestamp (non-inclusive)","attr":{"truncateAfterOplogEntryTimestamp":{"$timestamp":{"t":1611736805,"i":1799}},"oplogTruncateAfterPoint":{"$timestamp":{"t":1611736805,"i":1799}}}}
{"t":{"$date":"2021-01-27T16:40:22.656+08:00"},"s":"I",  "c":"REPL",     "id":21554,   "ctx":"SwitchWarmStandbyThread","msg":"Replication recovery oplog truncation finished","attr":{"durationMillis":55}}
{"t":{"$date":"2021-01-27T16:40:22.657+08:00"},"s":"I",  "c":"REPL",     "id":21544,   "ctx":"SwitchWarmStandbyThread","msg":"Recovering from stable timestamp","attr":{"stableTimestamp":{"$timestamp":{"t":1611736795,"i":1259}},"topOfOplog":{"ts":{"$timestamp":{"t":1611736805,"i":1799}},"t":1},"appliedThrough":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"oplogTruncateAfterPoint":{"$timestamp":{"t":0,"i":0}}}}
...
{"t":{"$date":"2021-01-27T16:40:25.860+08:00"},"s":"I",  "c":"REPL",     "id":21536,   "ctx":"SwitchWarmStandbyThread","msg":"Completed oplog application for recovery","attr":{"numOpsApplied":14901,"numBatches":19,"applyThroughOpTime":{"ts":{"$timestamp":{"t":1611736805,"i":1799}},"t":1}}}

在 4.2 及之前的版本中，oplogTruncateAfterPoint 隻用于備庫，用來保證 oplog batch 應用的原子性，最終目的是確定主備資料的一緻性，但是在 4.4 中主庫也會維護 oplogTruncateAfterPoint ，而這個問題根本原因就是由 oplogTruncateAfterPoint 在 4.4 中的作用變化導緻，下面簡要分析一下。

分析

在 4.2 中 oplogTruncateAfterPoint 隻有備庫會維護，用來保證 oplog batch write 的原子性，主庫是不需要維護的（有 replset.oplogTruncateAfterPoint集合，但是對應的 value 總是TImestamp(0,0)空值），是以主庫在 replication recovery 時不會進行 oplog 的 truncate。

但是在 4.4 中，主庫也會維護 oplogTruncateAfterPoint，這個做的原因是和 oplog hole 相關。

oplog hole 是因為事務在 WiredTiger 層面的送出順序和 oplog 順序不一緻導緻，複制時如果要保證主備一緻性需要避開這個 hole，這個大家應該都比較熟悉了，不做展開。

但是我們考慮一個場景，op1，op2，op3 三條 oplog，對應三個事務，在 WT 層面的送出順序為 op1，op3，op2，那麼當 op2 也成功送出後，1~3 這三條 oplog 對應備庫來說顯然是 no hole，可以放心複制。但是即使 no hole，我們也需要需要區分是 in-memory no hole ，還是 on-disk no hole。

因為 op2 對應的 WAL 在 unclean shutdown 的情況下，因為在尾部是可能丢失的。如果備庫在 in-memory no hole 的條件下就對 op 1~3 做了複制，那麼在主庫在重新開機的時候，不管是繼續做主還是降級為從，都會和副本集其他成員資料不一緻（缺失了 op2）。

在分析 4.2 和 4.4 是怎麼避免上面這個問題之前，我們需要先了解下 in-memory no hole 是怎麼維護的，這個在 4.2 和 4.4 中的方式是一樣的。

在更早的版本，比如 3.2，3.4 ，是通過在 server 層自己維護一個隊列來跟蹤 oplog 對應的事務的送出情況來确定到哪個點是 no hole 的。由于在 4.0 之後，server 層都使用了帶時間戳的事務，而這個時間戳實際上是 oplog 中的 ts 字段（ PS：這個時間戳在事務開始前就申請好了），是以可以依靠引擎層（WT）來告知我們截止到哪個時間點之前的事務都送出了，這個事情對于引擎來說很容易做到，也更應該由它們來做。這個時間點在 WT 中被稱之為 all_durable，但是千萬不要被這個坑爹的 “durable” 給欺騙了，它并不是真正的說事務被持久化了，見這段注釋，

/**
 * Manages oplog visibility.
 *
 * On demand, queries WiredTiger's all_durable timestamp value and updates the oplog read timestamp.
 * This is done asynchronously on a thread that startVisibilityThread() will set up.
 *
 * The WT all_durable timestamp is the in-memory timestamp behind which there are no oplog holes
 * in-memory. Note, all_durable is the timestamp that has no holes in-memory, which may NOT be
 * the case on disk, despite 'durable' in the name.
 *
 * The oplog read timestamp is used to read from the oplog with forward cursors, in order to ensure
 * readers never see 'holes' in the oplog and thereby miss data that was not yet committed when
 * scanning passed. Out-of-order primary writes allow writes with later timestamps to be committed
 * before writes assigned earlier timestamps, creating oplog 'holes'.
 */

4.2 和 4.4 正是依靠從 WT 查詢 all_durable timestamp 來獲得 in-memory no hole point，并設定 oplogReadTimestamp，所有需要需要讀 opog 的場景（複制，changeStream 等）都需要基于 oplogReadTimestamp 來讀，oplogReadTimestamp 之後的 oplog 對讀者不可見（可能有 hole(坑) :-)）。

既然從 WT 拿到的是 in-memory no hole 的時間戳，4.2 是如何避免上面那個 unclean shutdown 導緻的問題的？4.2 在從 WT 擷取到 all_durable ts 之後，還會等待這個 ts 之前的事務全部都持久化（WAL 刷盤）才會設定新的 oplogReadTimestamp，

// 4.2: src/mongo/db/storage/wiredtiger/wiredtiger_oplog_manager.cpp
        const uint64_t newTimestamp = fetchAllDurableValue(sessionCache->conn());

        // The newTimestamp may actually go backward during secondary batch application,
        // where we commit data file changes separately from oplog changes, so ignore
        // a non-incrementing timestamp.
        if (newTimestamp <= _oplogReadTimestamp.load()) {
            LOG(2) << "No new oplog entries were made visible: " << Timestamp(newTimestamp);
            continue;
        }

        // In order to avoid oplog holes after an unclean shutdown, we must ensure this proposed
        // oplog read timestamp's documents are durable before publishing that timestamp.
        sessionCache->waitUntilDurable(/*forceCheckpoint=*/false, false); // 這裡顯式等待

這個方式實際上是杜絕了備庫可能複制到比主庫更多資料的可能，但是這麼做的缺點是會導緻複制延遲的增加（畢竟需要等到事務送出并且落盤才能複制到備庫），而 unclean shutdown 顯然是一種很不常見的 case，不應該因為這個影響通用場景。是以在 4.4 中對這裡做了優化，主庫更新 oplogReadTimestamp 不再調用 waitUntilDurable 等待刷盤，而是 WAL 刷盤的過程中，異步的去更新 oplogTruncateAfterPoint，主庫上的 oplogTruncateAfterPoint 表示這之前的 oplog 是 on-disk no hole 的。

// 4.4: src/mongo/db/repl/replication_coordinator_external_state_impl.cpp
JournalListener::Token ReplicationCoordinatorExternalStateImpl::getToken(OperationContext* opCtx) {
    // If in state PRIMARY, the oplogTruncateAfterPoint must be used for the Durable timestamp in
    // order to avoid majority confirming any writes that could later be truncated.
    //
    // TODO (SERVER-45847): temporary hack for the ephemeral storage engine that passes in a nullptr
    // for the opCtx. The ephemeral engine does not do parallel writes to cause oplog holes,
    // therefore it is safe to skip updating the oplogTruncateAfterPoint that tracks oplog holes.
    if (MONGO_likely(opCtx)) {
        auto truncatePoint = repl::ReplicationProcess::get(opCtx)
                                 ->getConsistencyMarkers()
                                 ->refreshOplogTruncateAfterPointIfPrimary(opCtx); // 更新主庫 oplogTruncateAfterPoint
        if (truncatePoint) {
            return truncatePoint.get();
        }
    }

    // All other repl states use the last applied.
    return repl::ReplicationCoordinator::get(_service)->getMyLastAppliedOpTimeAndWallTime();
}

在真的出現 unclean shutdown 時，主庫重新開機時會把 oplogTruncateAfterPoint 之後的 oplog 清理掉，避免出現和副本集其他成員不一緻，而自己還不感覺的情況。

至此就分析清楚了，最開始問題的原因，因為測試過程中總是 kill -9 的（unclean shutdown），在帶寫壓力的情況下切換，oplogTruncateAfterPoint 肯定是沒有更新到最新的 oplog ts，導緻新主庫部分 oplog 沒有加載。

驗證：通過 kill（clean shutdown），而不是 kill -9 來觸發切換，可以看到資料不會變少。

影響

初看之下，主庫允許備庫複制自己還沒有完全持久化的 oplog，似乎副本集的持久化能力降低了，但仔細思考一下，持久化能力實際是增強了。

考慮一下上面那個 op 1-3 的例子，op2 最後送出，如果說在 op2 持久化之前，主庫 crash 了，在 4.2 裡面，新的主庫實際上是沒有 op1-3 的資料的（雖然對使用者傳回寫成功了），但是在 4.4 裡面因為不等 op2 持久化就複制，那麼新主庫有這個資料。在分布式環境下，複制也是持久化能力的一部分，更低的複制延遲就意味着在整個分布式系統層面更好的持久性。

oplogTruncateAfterPoint 在 MongoDB 4.4 中的變化背景分析影響

背景

分析

影響

繼續閱讀

golang結構體tag編譯錯誤的問題

Golang結構體中的tag

GO語言筆記--結構體1.說明2.工廠函數

Go語言之map與結構體初入map結構體初入（場景map的value通常為struct）

Go初探 (5) – 結構體與接口

go(7)結構體，方法，接口

Go學習筆記: 結構體、方法、接口

06結構體、接口結構體接口

go語言筆記（結構體、方法、接口）go語言筆記（結構體、方法、接口）

Go中結構體和接口的定義

Rich Domain Model

Error: couldn‘t connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error

Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Erro

couldn‘t connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error conne

Linux之父警告全球程式員：我剛釋出的5.12核心有bug，你們千萬别用

Ubuntu14.04 LTS下安裝mongodb