2、問題解析

這是個經常被問到的問題。涉及到跨版本、跨網絡、跨叢集的索引資料的遷移或同步。我們拆解一下：

2.1 跨版本

7.X 是目前的主流版本，早期的業務系統會停留在6.X、5.X 甚至 2.X、1.X 版本。

同步資料要注意：7.X 和早期版本的不同？

7.X 版本已經經曆了7.0——7.12 12+個小版本的疊代了，且7.0版本釋出時間：2019-04-10，已經過去了2年+時間。

同步要關注的一個核心點：

官方說明更具備說服力：“Before 7.0.0, the mapping definition included a type name. Elasticsearch 7.0.0 and later no longer accept a default mapping. ”

6.X版本：還有 type 的概念，可以自己定義。

7.X版本：type 就是_doc。

實戰舉例說明：在 7.X 指定 type 寫入資料：

PUT test-002/mytype/1

{

"title":"testing"

}

會有如下的警告：

#! [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

2.2 跨網絡

兩個叢集不在一個區域網路内，一個挂在雲端、一個在本地。

這是常見的業務場景之一，至少我也這麼幹過。

2.3 跨叢集

源資料和目的資料分布在兩個不同的叢集。

3、同步方案對比

如下幾個同步方案，我們邊實戰邊解讀。

3.0 實戰環境準備

為了示範友善，我們把環境簡化。複雜環境，原理一緻。

叢集1：雲端，單結點源叢集：172.21.0.14:19022。

叢集2：雲端，單結點目的叢集：172.21.0.14:19205。

兩個叢集共享一台雲伺服器，CPU：4核，記憶體：8G。

版本都一緻，都是 7.12.0 版本。

測試資料：100W條（腳本自動生成）。

單條記錄如下：

"_source" : {

"name" : "9UCROh3",

"age" : 16,

"last_updated" : 1621579460000

}

3.1 方案一：reindex 跨叢集同步

3.1.1 reindex 前置條件：設定白名單

在目标叢集上設定源叢集的白名單，具體設定隻能在：elasticsearch.yml 中。

reindex.remote.whitelist: "172.21.0.14:19022"

注意，如下實戰不要在kibana dev tools測試，除非你已經修改了預設逾時時間。

3.1.2 reindex 同步實戰

POST _reindex

"source": {

"remote": {

"host": "http://172.21.0.14:19022"

"index": "test_data",

"size":10000,

"slice": {

"id": 0,

"max": 5

}

"dest": {

"index": "test_data_from_reindex"

}

兩個核心參數說明如下：

size：預設一次 scroll 值大小是 1000，這裡設定大了 10 倍，是 10000。

slice：把大的請求切分成小的請求，并發執行。（ps：我這裡用法不嚴謹）。

3.1.3 reindex 同步實戰結論

腳本測試，reindex 同步 100W 資料，耗時：34 s。

3.2 方案二：elasticdump 同步

https://github.com/elasticsearch-dump/elasticsearch-dump

3.2.1elasticdump 安裝注意事項

elasticdump 前置依賴是 node，node要8.0+之後的版本。

[root@VM-0-14-centos test]# node -v

v12.13.1

[root@VM-0-14-centos test]# npm -v

6.12.1

安裝成功标志：

[root@VM-0-14-centos test]# elasticdump --help

elasticdump: Import and export tools for elasticsearch

version: 6.71.0

Usage: elasticdump --input SOURCE --output DESTINATION [OPTIONS]

... ...

3.2.2 elasticdump 同步實戰

elasticdump \

--input=http://172.21.0.14:19022/test_data \

--output=http://172.21.0.14:19205/test_data_from_dump \

--type=analyzer

--type=mapping

--type=data \

--concurrency=5 \

--limit=10000

基本上面的參數能做到：見名識意。

input ：源叢集索引。

output ：目标叢集索引。

analyzer ：同步分詞器。

mapping ：同步映射schema。

data ：同步資料。

concurrency ：并發請求數。

limit：一次請求同步的文檔數，預設是100。

3.2.3 elasticdump 同步實戰驗證結論

elasticdump 同步 100W資料，耗時：106 s。

3.3 方案四：ESM 工具同步

ESM 是 medcl 開源的派生自：Elasticsearch Dumper 的工具，基于 go 語言開發。

位址：

https://github.com/medcl/esm

3.3.1 ESM 工具安裝注意事項

依賴 go 版本：>= 1.7。

3.3.2 ESM 工具同步實戰

esm -s

http://172.21.0.14:19022 -d http://172.21.0.14:19205

-x test_data -y test_data_from_esm -w=5 -b=10 -c 10000

w：并發數。

b：bulk 大小，機關MB。

c：scroll 批量值大小。

3.3.3 ESM 工具同步實戰結論

100萬資料 38 s 同步完，速度極快。

test_data

[05-19 13:44:58] [INF] [main.go:474,main] start data migration..

Scroll 1000000 / 1000000 [================================================================================================================] 100.00% 38s

Bulk 999989 / 1000000 [===================================================================================================================] 100.00% 38s

[05-19 13:45:36] [INF] [main.go:505,main] data migration finished.

同步時：CPU 被打爆，說明并發參數生效了。

3.4 方案五：logstash 同步

3.4.1 logstash 同步注意事項

本文基于 logstash 7.12.0，相關插件：logstash_input_elasticsearch 和 logstash_output_elasticsearch 都已經內建安裝，無需再次安裝。

注意：配置的輸入、輸出即是插件的名字，要小寫。國外的很多部落格都有錯誤，要實戰一把甄别。

3.4.2 logstash 同步實戰

input {

elasticsearch {

hosts => ["172.21.0.14:19022"]

index => "test_data"

size => 10000

scroll => "5m"

codec => "json"

docinfo => true

filter {

output {

hosts => ["172.21.0.14:19205"]

index => "test_data_from_logstash"

3.4.3 logstash同步測試

100W 資料 74 s 同步完。

3.5 方案三：快照&恢複同步

3.5.1 快照&恢複配置注意事項

提前在 elasticsearch.yml 配置檔案配置快照存儲路徑。

path.repo: ["/home/elasticsearch/elasticsearch-7.12.0/backup"]

詳細配置參考：幹貨 | Elasitcsearch7.X叢集、索引備份與恢複實戰。

3.5.2 快照&恢複實戰

# 一個節點建立快照

PUT /_snapshot/my_backup

"type": "fs",

"settings": {

"location": "/home/elasticsearch/elasticsearch-7.12.0/backup"

PUT /_snapshot/my_backup/snapshot_testdata_index?wait_for_completion=true

"indices": "test_data_from_dump",

"ignore_unavailable": true,

"include_global_state": false,

"metadata": {

"taken_by": "mingyi",

"taken_because": "backup before upgrading"

# 另外一個恢複快照

curl -XPOST "http://172.21.0.14:19022/_snapshot/my_backup/snapshot_testdata_index/_restore"

3.5.2 快照&恢複實戰結論

執行快照時間：2 s。

恢複快照時間：1s 之内。

4、小結

本文針對 Elasticsearch 跨網絡、跨叢集之間的資料同步（模拟），給出了5 種方案，并分别在實戰環境進行了驗證。

當然，結論并非絕對，僅供參考。

各同步工具本質上都是：scroll + bulk + 多線程綜合實作。

本質不同是：開發語言不同、并發處理實作不同等。

reindex 基于 Java 語言開發

esm 基于 go 語言開發

logstash 基于 ruby + java 開發

elastidump 基于 js 語言開發

快照涉及異地拷貝檔案，速度制約因素是網絡帶寬，是以沒有統計在内。

如何選型？相信看了本文的介紹，應該做到胸中有數了。

reindex 方案涉及配置白名單，快照和恢複快照涉及配置快照庫和檔案的傳輸。

esm、logstash、elastidump 同步不需要特殊配置。

耗時長短和叢集規模、叢集各個節點硬體配置、資料類型、寫入優化方案等都有關系。

你實戰開發中是如何同步資料的？歡迎留言讨論。

Elasticsearch 跨網絡、跨叢集同步選型指南4、小結

4、小結

繼續閱讀

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

Cloud Studio初體驗

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

JS生成uuid的四種方法

vue-cli簡介（中文翻譯）

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Ajax發送和擷取json資料到Spring mvc 1.spring mvc後端2.web前段

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

layui多任務上傳添加進度條

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

JSONObject包導入異常 java.lang.NoClassDefFoundErrorweb項目的導入包的問題