SpringBoot：ElasticSearch-路由(_routing)機制

前言

當我們建立索引時，需要設定索引名、分片數、副本數三個參數。

索引名就是類似資料庫名，分片數指存儲資料的幾個空間，副本數指資料備份。

當我們向索引庫記憶體儲一條資料時，資料會存到某個分片中，Elasticsearch 如何知道一個文檔應該存放到哪個分片中呢？

首先這肯定不會是随機的，否則将來要擷取文檔的時候我們就不知道從何處尋找了。實際上，這個過程是根據下面這個算法決定的：

shard_num = hash(_routing) % num_primary_shards

其中 _routing 是一個可變值，預設是文檔的 _id 的值，也可以設定成一個自定義的值。 _routing 通過 hash 函數生成一個數字，然後這個數字再除以 num_of_primary_shards （主分片的數量）後得到餘數。這個分布在 0 到 number_of_primary_shards-1 之間的餘數，就是我們所尋求的文檔所在分片的位置。這就解釋了為什麼我們要在建立索引的時候就确定好主分片的數量并且永遠不會改變這個數量：因為如果數量變化了，那麼所有之前路由的值都會無效，文檔也再也找不到了。

場景模拟

假設你有一個100個分片的索引庫。當一個請求在叢集上執行時會發生什麼呢？

1. 這個搜尋的請求會被發送到一個節點

2. 接收到這個請求的節點，将這個查詢廣播到這個索引的每個分片上（可能是主分片，也可能是複本分片）

3. 每個分片執行這個搜尋查詢并傳回結果

4. 結果在通道節點上合并、排序并傳回給使用者

因為預設情況下，Elasticsearch使用文檔的ID（類似于關系資料庫中的自增ID），如果插入資料量比較大，文檔會平均的分布于所有的分片上，這導緻了Elasticsearch不能确定文檔的位置，

是以它必須将這個請求廣播到所有的N個分片上去執行這種操作會給叢集帶來負擔，增大了網絡的開銷；

路由的作用

在ElaticSearch裡面，路由功能算是進階用法，大多數時候我們用的都是系統預設的路由功能，路由的作用就是将同類型資料存儲到相同的分片中，通過檢索查詢時，可以快速定位某個分片擷取資料。

通過上面那個例子說明：

上面場景的問題很明顯，由于資料分散到多個分片，導緻資料查詢的效率加大，優化思路也比較明确，那就是按照相同類型的資料存儲到一個分區中，然後查詢時，直接查詢對應類型的資料即可。

實操示範

建立索引庫

# 先建立一個名為route_test的索引，該索引有2個shard，0個副本
PUT route_test/
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0
  }
}

檢視分片資料

# 檢視shard 可以看到docs下都是0 表示兩個分片都沒有資料
GET _cat/shards/route_test?v
index      shard prirep state   docs store ip         node
route_test 1     p      STARTED    0  230b 172.19.0.2 es7_02
route_test 0     p      STARTED    0  230b 172.19.0.5 es7_01

不指定路由添加資料

插入一條資料A

// 插入第1條資料
PUT route_test/_doc/a?refresh
{
  "data": "A"
}

檢視分片資料

# 檢視shard 可以看到docs下的第0個分片的資料為1
GET _cat/shards/route_test?v
index      shard prirep state   docs store ip         node
route_test 1     p      STARTED    0  230b 172.19.0.2 es7_02
route_test 0     p      STARTED    1 3.3kb 172.19.0.5 es7_01

插入第二條資料B

# 插入第2條資料
PUT route_test/_doc/b?refresh
{
  "data": "B"
}

檢視分片資料

# 檢視資料 可以看到分片1也添加1條資料
GET _cat/shards/route_test?v
index      shard prirep state   docs store ip         node
route_test 1     p      STARTED    1 3.3kb 172.19.0.2 es7_02
route_test 0     p      STARTED    1 3.3kb 172.19.0.5 es7_01

指定路由添加資料

插入第三條資料C

# 插入第3條資料 并且設定路由參數為key1(自定義)
PUT route_test/_doc/c?routing=key1&refresh
{
  "data": "C"
}

檢視分片資料

# 檢視shard 通過docs的值可以看到ES将這條資料存入到分片0中
GET _cat/shards/route_test?v
index      shard prirep state   docs store ip         node
route_test 1     p      STARTED    1 3.4kb 172.19.0.2 es7_02
route_test 0     p      STARTED    2 6.9kb 172.19.0.5 es7_01

查詢索引庫的全部資料

# 檢視索引資料 可以看到_id=c的資料多了路由參數"_routing" : "key1"
GET route_test/_search
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "a",
        "_score" : 1.0,
        "_source" : {
          "data" : "A"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "c",
        "_score" : 1.0,
        "_routing" : "key1",
        "_source" : {
          "data" : "C"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "b",
        "_score" : 1.0,
        "_source" : {
          "data" : "B"
        }
      }
    ]
  }
}

資料重複問題

我們知道路由的概念是将同類型的資料存儲到相同分片中，如果我們修改A、B的資料，給它們添加路由會怎麼樣？

修改資料A

# 插入 _id=a 的資料(插入id相同的資料,ES會預設覆寫舊資料,也就是修改)，并指定 routing=key1
PUT route_test/_doc/a?routing=key1&refresh
{
  "data": "A with routing key1"
}

## ES的傳回資訊為：
{
  "_index" : "route_test",
  "_type" : "_doc",
  "_id" : "a",
  "_version" : 2,
  "result" : "updated",        # 注意此處為updated，表示執行的修改操作
  "forced_refresh" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2,
  "_primary_term" : 1
}

檢視分片資料

# 檢視shard 可以看到資料A還是在原來的分片0,與資料C同屬分片0,沒有變化
GET _cat/shards/route_test?v
index      shard prirep state   docs  store ip         node
route_test 1     p      STARTED    1  3.4kb 172.19.0.2 es7_02
route_test 0     p      STARTED    2 10.5kb 172.19.0.5 es7_01

查詢索引庫的全部資料

# 查詢索引 可以看到資料A已經有"_routing" : "key1"
GET route_test/_search
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "c",
        "_score" : 1.0,
        "_routing" : "key1",
        "_source" : {
          "data" : "C"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "a",
        "_score" : 1.0,
        "_routing" : "key1",
        "_source" : {
          "data" : "A with routing key1"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "b",
        "_score" : 1.0,
        "_source" : {
          "data" : "B"
        }
      }
    ]
  }
}

修改資料B

# 插入 _id=b 的資料，并指定 routing=key1
PUT route_test/_doc/b?routing=key1&refresh
{
  "data": "B with routing key1"
}

## ES傳回的資訊
{
  "_index" : "route_test",
  "_type" : "_doc",
  "_id" : "b",
  "_version" : 1,
  "result" : "created",        # 注意這裡不是updated 而是created 表示目前這條資料是新增的！！！！！！！
  "forced_refresh" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}

檢視分片資料

# 檢視shard資訊 這裡發現不同 分片0添加了1條資料
GET _cat/shards/route_test?v
index      shard prirep state   docs store ip         node
route_test 1     p      STARTED    1 3.4kb 172.19.0.2 es7_02
route_test 0     p      STARTED    3  11kb 172.19.0.5 es7_01

查詢索引庫的全部資料

# 查詢索引内容 竟然有兩條資料B的資料 一條有路由參數 一條沒有
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "c",
        "_score" : 1.0,
        "_routing" : "key1",
        "_source" : {
          "data" : "C"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "a",
        "_score" : 1.0,
        "_routing" : "key1",
        "_source" : {
          "data" : "A with routing key1"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "b",
        "_score" : 1.0,
        "_routing" : "key1",        # 和下面的 _id=b 的doc相比，有路由參數資訊
        "_source" : {
          "data" : "B with routing key1"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "b",
        "_score" : 1.0,
        "_source" : {
          "data" : "B"
        }
      }
    ]
  }
}

我們來分析一下。插入資料A 時，ES傳回的是updated，也就是更新了舊資料。而插入資料B 的資料時，ES傳回的是created，也就是新增了一條資料，它并沒有更新舊資料。而且從之後查詢的結果來看，有兩條資料B 的資料，但一個有routing(路由參數)，一個沒有。由此分析出有routing的在分片0上面，沒有的路由參數的那個在分片1内。

這會導緻的一個問題：_id不再全局唯一

ES 分片( shard )的實質是Lucene的索引，是以其實每個shard都是一個功能完善的反向索引。ES能保證docid(_id)全局唯一是預設采用docid作為路由，是以同樣的docid肯定會路由到同一個shard上面，如果出現docid重複，就會update或者抛異常，進而保證了叢集内docid辨別唯一條資料。但如果我們自定義設定routing，那就不能保證了，如果使用者還需要docid的全局唯一性，那隻能自己設計嚴格的限制。因為docid不再全局唯一，是以資料的增删改查操作就可能産生問題，比如下面的查詢：

查詢 docid=b 的資料

# 查詢 _id=b 的資料
GET route_test/_doc/b

## es傳回
{
  "_index" : "route_test",
  "_type" : "_doc",
  "_id" : "b",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "data" : "B"
  }
}



# 再次查詢 _id=b 的資料
GET route_test/_doc/b?routing=key1

## es傳回
{
  "_index" : "route_test",
  "_type" : "_doc",
  "_id" : "b",
  "_version" : 1,
  "_seq_no" : 3,
  "_primary_term" : 1,
  "_routing" : "key1",
  "found" : true,
  "_source" : {
    "data" : "B with routing key1"  # 可以看到兩次查詢的資料完全不一緻
  }
}

發現兩次查詢的資料并不一緻，是以如果自定義routing字段的話，一般的增删改查接口都要加上routing參數以保證一緻性。

注意這裡的【一般】指的是查詢，并不是所有查詢接口都要加上routing。

ES在建立索引庫時，mapping中提供一個選項，可以強制檢查doc的增删改查接口是否加了routing參數，如果沒有加，就會報錯。(具體加不加, 根據自身業務決定)

PUT <索引名>/
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0
  },
  "mappings": {
    "_routing": {
      "required": true        # 設定為true，則強制檢查；false則不檢查，預設為false
    }
  }
}

指定路由查詢

查詢路由為key1的資料

# 如果查詢多個路由的資料 直接指定多個路由就可以 ?routing=key1,key2,key3
GET route_test/_search?routing=key1

# 查詢索引的内容
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "c",
        "_score" : 1.0,
        "_routing" : "key1",
        "_source" : {
          "data" : "C"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "a",
        "_score" : 1.0,
        "_routing" : "key1",
        "_source" : {
          "data" : "A with routing key1"
        }
      },
      {
        "_index" : "route_test",
        "_type" : "_doc",
        "_id" : "b",
        "_score" : 1.0,
        "_routing" : "key1",        
        "_source" : {
          "data" : "B with routing key1"
        }
      }
    ]
  }
}

優化路由造成的負載不均衡

指定routing還有個弊端就是容易造成負載不均衡。

是以ES提供了一種機制可以将資料路由到一組shard上面，而不是某一個shard。

建立索引時（也隻能在建立時）設定index.routing_partition_size，預設值是1，即隻路由到1個shard，可以将其設定為大于1且小于索引shard總數的某個值，就可以路由到一組shard了。

設定值越大，資料越均勻。當然，這個設定是針對單個索引庫的，可以将其加入到動态模闆中，以對多個索引生效。指定後，shard的計算方式變為：

shard_num = (hash(_routing) + hash(_id) % routing_partition_size) % num_primary_shards

對于同一個routing值，

hash(_routing)

的結果固定的，

hash(_id) % routing_partition_size

的結果有 routing_partition_size 個可能的值，兩個組合在一起，對于同一個routing值的不同doc，也就能計算出 routing_partition_size 可能的shard num，即一個shard集合。

但要注意這樣做以後有兩個限制：

1. 索引的mapping中不能再定義join關系的字段，原因是join強制要求關聯的doc必須路由到同一個shard，如果采用shard集合，這個是不能保證的。

2. 索引mapping中

_routing

的

required

必須設定為true。

對于第2點做測試時，發現如果不寫mapping，也是可以的，此時

_routing

的

required

預設值其實是false的。但如果顯式的寫了

_routing

，就必須設定為true，否則建立索引會報錯。

# 不顯式的設定mapping，可以成功建立索引
PUT route_test_3/
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0,
    "routing_partition_size": 2
  }
}

# 查詢也可以不用帶routing，也可以正确執行，增删改也一樣
GET route_test_3/_doc/a

# 如果顯式的設定了mappings域，且required設定為false，建立索引就會失敗，必須改為true
PUT route_test_4/
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0,
    "routing_partition_size": 2
  },
  "mappings": {
    "_routing": {
      "required": false
    }
  }
}

SpringBoot：ElasticSearch-路由(_routing)機制

前言

場景模拟

路由的作用

實操示範

不指定路由添加資料

指定路由添加資料

資料重複問題

指定路由查詢

優化路由造成的負載不均衡

繼續閱讀

Kafka：Topic概念與API介紹

5G小型蜂應用指南

PAT (Advanced Level) Practise 1012 The Best Rank (25)

mysql5.7的sql優化

線程通信和程序通信差別（線程程序差別）

Matlab随機波動率SV、GARCH用MCMC馬爾可夫鍊蒙特卡羅方法分析匯率時間序列

微信小程式前端解密擷取使用者資訊

Spring MVC 自學雜記（五） -- SpringMVC與前台的json資料互動

《MySQL技術内幕：InnoDB存儲引擎》筆記

擴容TIKV節點遇到的坑

PHP輔導代做程式設計：CS353 Database System

自學Zabbix3.10.2-事件通知Notifications upon events-Actions報警配置點選傳回：自學zabbix集錦

HDU 5678 ztr loves trees

拓端tecdat|R語言彈性網絡Elastic Net正則化懲罰回歸模型交叉驗證可視化

二叉樹及其應用--二叉樹建立

詳解STM32單片機的堆棧