騰訊雲MongoDB多機房部署場景下就近通路原理詳解

背景介紹

為了保證服務可用性和資料可靠性，一些重要業務會将存儲系統部署在多地域多機房。比如在北京，上海，深圳每個地域的機房各存儲一份資料副本，保證即使某個地域的機房無法提供通路，也不會影響業務的正常使用。

在多機房部署時，需要考慮多機房之間的網絡延遲問題。以作者的ping測試結果為例，上海<-->深圳的網絡延遲約為30ms，而在一個機房内部，網絡延遲僅在0.1ms左右。

騰訊雲MongoDB在架構上，結合L5就近接入以及内部的“nearest”通路模式，實作了業務對機房的就近通路，避免了多機房帶來的網絡延遲問題。整體架構如下圖所示，其中mongos為接入節點，可以了解為proxy；mongod為存儲節點，存儲使用者的實際資料，并通過 1 Primary 多 Secondary的模式形成多個副本，分散到多個機房中

下面主要對騰訊雲MongoDB中nearest模式的實作和使用方式做詳細介紹。

2. 什麼是nearest通路模式

MongoDB中，副本集是指儲存相同資料的多個副本節點的集合。使用者可以通過Driver直接通路副本集，或者通過mongos通路副本集。如下圖所示

副本集内部通過raft算法來選主，通過oplog同步資料。

MongoDB預設讀寫都在Primary節點上執行，但是也提供了接口進行讀寫請求分離，充分發揮系統的整體性能。讀寫分離的關鍵在于設定請求的readPreference。這個參數辨別了使用者希望讀取哪種節點，目前可配置的類型共5種，如下所示

2.3 讀寫一緻性保證

有些讀者可能已經産生了疑問：如果Secondary節點從Primary同步資料可能存在延遲，如何保證在從節點能夠讀取到剛剛寫入的資料？解決方法是：設定寫入操作的WriteConcern，保證資料寫入到全部節點之後再傳回，此時再去從節點，肯定可以讀取到最新的資料。

寫操作是需要跨機房同步資料的，是以如果業務模型是寫多讀少，需要謹慎考慮。

3. nearest實作原了解析

如果業務通過mongos接入（騰訊雲MongoDB架構常用方式），則mongos側來完成到mongod的就近通路。如果業務直接接入副本集，則在driver層會完成到mongod的就近通路。

下面會結合mongos（騰訊雲MongoDB代碼），mgo-driver，以及官方最新釋出的go-driver，來分析如何實作nearest通路，并給出一些使用上的建議。

mongos <code>每隔5秒</code>會對叢集中的每個副本集啟動探測線程，執行 <code>isMaster指令</code>并采集自己到每個節點的網絡延遲情況，采集方式如下所示：

然後根據本次采集的延遲進行平滑更新, 核心如下所示：

節點選取的算法，可以參考SetState::getMatchingHost方法。大緻的選取流程為：按照每個節點的延遲升序排序 -> 排除延遲太高的節點（比最近節點的延遲大15ms）-> 随機傳回一個符合條件的節點。

<code>case ReadPreference::SecondaryOnly:</code>

<code>case ReadPreference::Nearest: {</code>

<code> BSONForEach(tagElem, criteria.tags.getTagBSON()) {</code>

<code> uassert(16358, "Tags should be a BSON object", tagElem.isABSONObj());</code>

<code> BSONObj tag = tagElem.Obj();</code>

<code> std::vector<const Node*> matchingNodes;</code>

<code> // 如果是SecondaryOnly模式，需要進行過濾</code>

<code> if (nodes[i].matches(criteria.pref) && nodes[i].matches(tag)) {</code>

<code> matchingNodes.push_back(&nodes[i]);</code>

<code> // don't do more complicated selection if not needed</code>

<code> if (matchingNodes.empty())</code>

<code> continue;</code>

<code> if (matchingNodes.size() == 1)</code>

<code> return matchingNodes.front()->host;</code>

<code> // order by latency and don't consider hosts further than a threshold from the</code>

<code> // closest.</code>

<code> // 對候選節點按延遲進行排序</code>

<code> std::sort(matchingNodes.begin(), matchingNodes.end(), compareLatencies);</code>

<code> int64_t distance =</code>

<code> matchingNodes[i]->latencyMicros - matchingNodes[0]->latencyMicros;</code>

<code> if (distance >= latencyThresholdMicros) {</code>

<code> // this node and all remaining ones are too far away</code>

<code> // 剔除延遲超過門檻值（預設15ms，可配置）的節點</code>

<code> matchingNodes.erase(matchingNodes.begin() + i, matchingNodes.end());</code>

<code> break;</code>

<code> // of the remaining nodes, pick one at random (or use round-robin)</code>

<code> if (ReplicaSetMonitor::useDeterministicHostSelection) {</code>

<code> // only in tests</code>

<code> return matchingNodes[roundRobin++ % matchingNodes.size()]->host;</code>

<code> // normal case</code>

<code> // 從剩餘的候選節點中，随機選取一個傳回</code>

<code> return matchingNodes[rand.nextInt32(matchingNodes.size())]->host;</code>

<code> return HostAndPort();</code>

可以注意到mongos代碼中有一個 <code>預設的15ms配置</code>，含義為：如果有一個節點的延遲比最近節點的延遲還要大15ms，則認為這個節點不應該被nearest政策選中。但是15ms并不是對每一個業務都合理。如果業務對延遲非常敏感，可以根據自己的需要來進行設定方法是在mongos配置檔案中添加下面配置選項：

mgo driver <code>每隔15秒</code>會通過 <code>ping指令</code>采集自己到mongod節點的網絡延遲狀況，并将最近6次采集結果的最大值作為目前網絡延遲的參考值。代碼如下所示：

<code> time.Sleep(delay) // 每隔一段時間（預設15秒）采集一次</code>

<code> socket, _, err := server.AcquireSocket(0, delay)</code>

<code> start := time.Now()</code>

<code> _, _ = socket.SimpleQuery(&op) // 執行ping指令</code>

<code> delay := time.Now().Sub(start) // 并統計耗時</code>

<code> server.pingWindow[server.pingIndex] = delay</code>

<code> server.pingIndex = (server.pingIndex + 1) % len(server.pingWindow)</code>

<code> server.pingCount++</code>

<code> var max time.Duration</code>

<code> if server.pingWindow[i] > max {</code>

<code> max = server.pingWindow[i] // 統計最近6次（預設）采集的最大值</code>

<code> socket.Release()</code>

<code> server.Lock()</code>

<code> if server.closed {</code>

<code> loop = false</code>

<code> server.pingValue = max // 将最大值作為網絡延遲統計，作為後續選擇節點時的評估依據</code>

<code> server.Unlock()</code>

<code> logf("Ping for %s is %d ms", server.Addr, max/time.Millisecond)</code>

<code> } else if err == errServerClosed {</code>

<code> return</code>

和mongos相同，會排除延遲太高(>15ms)的節點。但是差別在于不是随機傳回一個滿足條件的節點，而是盡量傳回目前壓力比較小的節點（通過目前使用的連接配接數來判定），這樣可以盡量做到負載均衡。代碼如下所示：

官方go driver <code>每隔10秒</code>會通過 <code>isMaster</code>指令采集自己到mongod節點的網絡延遲狀況：

<code>now := time.Now() // 開始統計耗時</code>

<code>// 去對應的節點上執行isMaster指令</code>

<code>isMasterCmd := &command.IsMaster{Compressors: s.cfg.compressionOpts}</code>

<code>isMaster, err := isMasterCmd.RoundTrip(ctx, conn)</code>

<code>delay := time.Since(now) // 得到耗時統計</code>

<code>desc = description.NewServer(s.address, isMaster).SetAverageRTT(s.updateAverageRTT(delay)) // 進行平滑統計</code>

采集完成後，會結合曆史資料進行平滑統計，如下：

以Find指令為例，go driver會生成一個 <code>複合選擇器</code>，複合選擇器會依次執行各項選擇算法，得到一個候選節點清單：

其中對于節點延遲的選擇主要依賴于 <code>LatencySelector</code>。大緻流程為：統計到所有節點的最小延遲min-->計算延遲滿足标準：min+15ms(預設)-->傳回所有滿足延遲标準的節點清單。核心代碼如下：

<code>func (ls *latencySelector) SelectServer(t Topology, candidates []Server) ([]Server, error) {</code>

<code> if ls.latency < 0 {</code>

<code> return candidates, nil</code>

<code> switch len(candidates) {</code>

<code> default:</code>

<code> min := time.Duration(math.MaxInt64)</code>

<code> for _, candidate := range candidates {</code>

<code> if candidate.AverageRTTSet { // 計算所有候選節點的最小延遲</code>

<code> if candidate.AverageRTT < min {</code>

<code> min = candidate.AverageRTT</code>

<code> if min == math.MaxInt64 {</code>

<code> return candidates, nil</code>

<code> // 用最小延遲加門檻值配置（預設15ms）作為最大容忍延遲</code>

<code> max := min + ls.latency</code>

<code> var result []Server</code>

<code> if candidate.AverageRTTSet {</code>

<code> if candidate.AverageRTT <= max {</code>

<code> // 傳回所有符合延遲标準（最大容忍延遲）的節點</code>

<code> result = append(result, candidate)</code>

<code> return result, nil</code>

最後根據選擇得到的候選清單，随機傳回一個正常節點作為目标節點。核心代碼如下：

<code> // 根據前面介紹的“複合選擇器”，得到候選節點清單</code>

<code> suitable, err := t.selectServer(ctx, sub.C, ss, ssTimeoutCh)</code>

<code> return nil, err</code>

<code> // 随機選擇一個作為目标節點</code>

<code> selected := suitable[rand.Intn(len(suitable))]</code>

<code> selectedS, err := t.FindServer(selected)</code>

<code> switch {</code>

<code> case selectedS != nil:</code>

<code> return selectedS, nil</code>

<code> // We don't have an actual server for the provided description.</code>

<code> // This could happen for a number of reasons, including that the</code>

<code> // server has since stopped being a part of this topology, or that</code>

<code> // the server selector returned no suitable servers.</code>

關于上述15ms的預設配置，官方go driver也提供了設定接口。對于延遲敏感的業務，可以通過這個接口配置ClientOptions，降低門檻值。

4. 總結

MongoDB通過nearest模式支援多機房部署場景中用戶端driver->mongod以及mongos->mongod的就近讀。本文結合騰訊雲MongoDB核心代碼和常用的go driver代碼對nearest的原理進行分析，并給出了一些使用建議。

騰訊雲MongoDB多機房部署場景下就近通路原理詳解

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method