Reindex API — Elastic Stack 實戰手冊

https://developer.aliyun.com/topic/download?id=1295 · 更多精彩内容，請下載下傳閱讀全本《Elastic Stack實戰手冊》 https://developer.aliyun.com/topic/download?id=1295 https://developer.aliyun.com/topic/es100 · 加入創作人行列，一起交流碰撞，參與技術圈年度盛事吧 https://developer.aliyun.com/topic/es100

創作人：楊松柏

什麼是 Reindex

将文檔從源索引複制到目的地索引，稱之為 Reindex。

在 Reindex 時可以進行資料的豐富、縮減以及字段的變更等。Reindex 可以簡單的了解為 Scroll+Bulk_Insert。source 和 dest 都可以是已存在的索引、索引别名或資料流(Data Stream)。

此外，使用 Reindex 需要注意以下幾點：

源和目的不能相同，比如不能将資料流 Reindex 給它自身
源索引的文檔中 _source 字段必須開啟。
Reindex不會複制源的 setting 和源所比對的模闆，是以在調用 _reindex 前，你需要設定好目的索引 (action.auto_create_index 為 false 或者 -.* 時)。
目标索引的 mapping，主分片數，副本數等推薦提前配置。

Reindex的主要場景：

叢集更新：将資料從舊叢集遠端 Reindex 到新叢集
索引備份
資料重構

前置要求

如果 Elasticsearch 叢集配置了安全政策和權限政策，則進行 Reindex 必須擁有以下權限：

讀取源的資料流、索引、索引别名等索引級别權限
對于目的資料流、索引、索引别名的寫權限
如果需要使用 Reindex API 自動建立資料流和索引，則必須擁有對目的資料流、索引、索引别名的 auto_configure、create_index 或者 manage 等索引級别權限。
如果源為遠端的叢集，則 source.remote.user 使用者必須擁有叢集監控權限，和讀取源索引、源索引别名、源資料流的權限。

如果 Reindex 的源為遠端叢集，必須在目前叢集的請求節點 elasticsearch.yml 檔案配置遠端白名單 reindex.remote.whitelist。

自動建立資料流，需要提前配置好資料流的比對索引模闆，

詳情可參看 Set up a data stream： https://www.elastic.co/guide/en/elasticsearch/reference/7.11/set-up-a-data-stream.html

API 介紹

RESTful API

POST /_reindex

Query parameters

refresh

可選參數，枚舉類型 (true,false,wait_for)，預設值為 false。

如果設定為 true， Elasticsearch 重新整理受目前操作影響的資料，能夠被立即搜尋(即立即重新整理，但是會對 Elasticsearch 的性能有一定的影響)。如果為 wait_for，則等待重新整理以使目前操作對搜尋可見，等待時間為預設為 1s（index.refresh_interval）。如果為 false，本次請求不執行重新整理。

timeout

可選參數，時間值（time units），預設值為 1 分鐘；每個索引周期中等待索引自動建立、動态映射更新，和等待活躍健康分片等的時間。該參數可以確定 Elasticsearch 在失敗之前，基本等待的逾時時間。實際等待時間可能更長，特别是在發生多個等待時。

wait_for_active_shards

可選參數，參數類型 string，預設值為 1（即隻要一個分片處于活躍就可以執行該操作）。在執行 Reindex 之前索引必須處于活動狀态的分片副本數，可以設定為 all 或者小于 number_of_replicas+1 的任何正整數，比如你的索引主分片數目為 3，副本設定為 2，那麼可以設定的最大正整數為 3，即副本份數加 1 (主分片)。

#因為實操叢集隻有三個節點，如下索引将會出現副本分片無法配置設定，
#index.routing.allocation.total_shards_per_node
#控制每個該索引隻允許每個節點配置設定一個分片
PUT reindex_index-name-2
{
  "settings" :{
    "index" :{
      "number_of_shards" : "3",
      "number_of_replicas" : "2"
    },
    "index.routing.allocation.total_shards_per_node":1
  }
}
#插入一條資料
PUT reindex_index-name-1/_bulk
{ "index":{ } }
{ "@timestamp": "2099-05-06T16:21:15.000Z", "message": "192.0.2.42 - - [06/May/2099:16:21:15 +0000] \"GET /images/bg.jpg HTTP/1.0\" 200 24736" }

#重建索引
POST _reindex?wait_for_active_shards=2&timeout=5s
{
  "source": {
    "index": "reindex_index-name-1"
  },
  "dest": {
    "index": "reindex_index-name-2"
  }
}

由于

reindex_index-name-2

隻有主分片配置設定成功，是以上面的

_reindex

将失敗

{
  "took" : 5002,
  "timed_out" : false,
  "total" : 1,
  "updated" : 0,
  "created" : 0,
  "deleted" : 0,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [
    {
      "index" : "reindex_index-name-2",
      "type" : "_doc",
      "id" : "U_i8VnkBYYHWy1KlBJjc",
      "cause" : {
        "type" : "unavailable_shards_exception",
        "reason" : "[reindex_index-name-2][0] Not enough active copies to meet shard count of [2] (have 1, needed 2). Timeout: [5s], request: [BulkShardRequest [[reindex_index-name-2][0]] containing [index {[reindex_index-name-2][_doc][U_i8VnkBYYHWy1KlBJjc], source[{ \"@timestamp\": \"2099-05-06T16:21:15.000Z\", \"message\": \"192.0.2.42 - - [06/May/2099:16:21:15 +0000] \\\"GET /images/bg.jpg HTTP/1.0\\\" 200 24736\" }]}]]"
      },
      "status" : 503
    }
  ]
}

wait_for_completion

可選參數，參數類型

Boolean

，預設為

true

。如果為

true

，則請求為阻塞同步方式，請求會等到操作完成才傳回。

requests_per_second

integer

，預設為 -1（不進行限制）；限制請求的每秒子請求數。

require_alias

Boolean

true

true

，

dest.index

必須為索引别名。

scroll

可選參數，參數類型為時間類型（ time units），指定滾動搜尋時索引的一緻視圖應保持多長時間。

slices

integer

，預設值為1（不切分成多個子任務）；該參數表示将一個任務切分成多少個子任務，并行執行。

max_docs

integer

，預設值為對應索引的所有文檔；要處理的最大文檔數。

Request Body

conflicts

可選參數，參數類型枚舉類型，預設為

abort

；設定為

proceed

，即使發生文檔沖突也繼續

reindexing

。

source

index

必填參數，參數類型

 string

；值可以為資料流、索引名字、索引别名，如果源有多個，也可以接受逗号分隔的資料源數組。

integer ；要被重新索引的最大文檔數目。
query

可選參數，參數類型查詢對象（query object），按查詢 DSL 過濾需要重新索引的文檔。
remote

remote 的子參數可接受如下：

參數	是否必填	類型	說明
host	否	string	索引 pattern 名源索引所在遠端 ES 叢集中任一節點位址；如果是需要從遠端叢集複制資料，則該參數必填。
username			與遠端主機進行身份驗證的使用者名；當遠端叢集需要認證時必填。
password			與遠端主機進行身份驗證的密碼；當遠端叢集需要認證時必填。
socket_timeout		時間類型	預設為 30 秒；遠端套接字讀取逾時。
connect_timeout			預設為 30 秒；遠端連接配接逾時時間。

size

可選參數，參數類型 integer；每批要索引的文檔數（批處理），在遠端索引時確定批處理能夠放在堆上緩沖區，緩沖區的預設大小為100 MB。

slice

slice的子參數可接受如下：


id		integer	進行手動切片時的，設定的切片 id
max			切片總數。

sort

可選參數，參數類型 list；以逗号分隔的：對清單（比如name:desc），用于在擷取源索引文檔時，按照 sort 中字段值的排序要求進行排序。通常與

max_docs

參數結合使用，以控制哪些文檔需要被重新索引。

注意：sort 參數在 7.6 版本已經被标注棄用，不建議在 Reindex 中進行排序。 Reindex 中的排序不能保證按順序索引文檔，并阻止 Reindex 的進一步發展，如恢複能力和性能改進。如果與結合使用

max_docs

，請考慮改為使用查詢過濾器。

_source

string

，預設值為 true。該參數可以用于選擇文檔中哪些字段需要進行重新索引。如果設定為 true，将會重索引文檔中的所有字段。

dest

必填參數，參數類型 string；該參數表示目的地的表，值可以為資料流、索引名字、索引别名。

version_type

可選參數，參數類型枚舉；用于索引操作的版本控制；枚舉值包括：

internal

external

external_gt

external_gte

詳情參看 Version types : https://www.elastic.co/guide/en/elasticsearch/reference/7.11/docs-index_.html#index-version-types

op_type

可選參數，參數類型枚舉，預設為 index，枚舉值包括：

index

create

；如果設定為

create

，則目标索引不存在該文檔就建立（可用于

reindex

續傳補償）。注意：如果

dest

是資料流，必須設定為

create

，因為資料流隻做

append

type

string

，預設值為

_doc

；被重建索引的文檔中文檔類型；注意：該參數在 Elasticsearch 6 版本中已經标記棄用，已經沒有任何實際意義。

script

可選參數，參數類型 string；重新索引時用于更新文檔 source 或中繼資料的腳本 .

lang

可選參數，參數類型枚舉；支援的腳本語言：

painless

expression

mustache

java

更多腳本語言，請參考 Scripting： https://www.elastic.co/guide/en/elasticsearch/reference/7.11/modules-scripting.html

Response Body

執行

_reindex

時的，響應體參數釋意：

字段
took		整個操作花費的總毫秒數
timed_out	Boolean	如果在重新索引期間出現的任何請求逾時，則此标志設定為 true。
total		成功處理的文檔數
updated		已成功更新的文檔數，即重新索引的文檔，在 dest 索引中存在具有相同 ID 的文檔，并且更新成功的
created		成功建立的文檔數
deleted		成功删除的文檔數
batches		由重新索引回調的滾動響應數
noops		由于重新索引的腳本為 ctx.op 傳回 noop 值而被忽略的文檔數
version_conflicts		重新索引命中的版本沖突數
retries		重索引嘗試的重試次數；bulk 是重試的批量操作數，search 是重試的搜尋操作數
throttled_millis		請求休眠以符合 requests_per_second
		在重新索引期間每秒有效執行的請求數
throttled_until_millis		此字段在 _reindex 響應中應始終等于零；該參數隻有在使用任務 API（Task API）時才有意義，在任務 API 中，它訓示下次再次執行限制請求的時間，以便符合每秒的請求數
failures	數組	如果程序中有任何不可恢複的錯誤，則傳回失敗數組。如果數組不為空，那麼請求會因為這些失敗而中止。重新索引是使用批處理實作的，任何失敗都會導緻整個程序中止，但目前批進行中的所有失敗都會收集到數組中。你可以使用 conflicts 參數，避免因為版本沖突而中止重建索引 Reindex 的一些技巧

異步執行 Reindex

如果請求的查詢參數

wait_for_completion

設定為

false

，Elasticsearch 将會執行一些預檢查，然後發起一個

task

,來運作你的 Reindex 任務，并立即傳回你一個

taskid

，然後你可以通過這個

taskid

，去檢視任務的運作結果，運作結果記錄在系統索引

.tasks

;如果任務執行完成，你可以删除掉該文檔，以使 Elasticsearch 釋放空間。

#異步執行 reindex 任務
POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "reindex_index-name-1"
  },
  "dest": {
    "index": "reindex_index-name-2"
  }
}

#傳回立即傳回的任務 id
{
  "task" : "ydZx8i8HQBe69T4vbYm30g:20987804"
}

#檢視任務的運作情況
GET _tasks/ydZx8i8HQBe69T4vbYm30g:20987804
#response傳回結果
{
  "completed" : true,
  "task" : {
    "node" : "ydZx8i8HQBe69T4vbYm30g",
    "id" : 20987804,
    "type" : "transport",
    "action" : "indices:data/write/reindex",
    "status" : {
      "total" : 2,
      "updated" : 0,
      "created" : 2,
      "deleted" : 0,
      "batches" : 1,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0
    },
    "description" : "reindex from [reindex_index-name-1] to [reindex_index-name-2][_doc]",
    "start_time_in_millis" : 1620539345400,
    "running_time_in_nanos" : 84854825,
    "cancellable" : true,
    "headers" : { }
  },
  "response" : {
    "took" : 82,
    "timed_out" : false,
    "total" : 2,
    "updated" : 0,
    "created" : 2,
    "deleted" : 0,
    "batches" : 1,
    "version_conflicts" : 0,
    "noops" : 0,
    "retries" : {
      "bulk" : 0,
      "search" : 0
    },
    "throttled" : "0s",
    "throttled_millis" : 0,
    "requests_per_second" : -1.0,
    "throttled_until" : "0s",
    "throttled_until_millis" : 0,
    "failures" : [ ]
  }
}

#實際任務運作結果會記錄在 .tasks 索引
GET  .tasks/_doc/ydZx8i8HQBe69T4vbYm30g:20987804
#傳回值如下
{
  "_index" : ".tasks",
  "_type" : "_doc",
  "_id" : "ydZx8i8HQBe69T4vbYm30g:20987804",
  "_version" : 1,
  "_seq_no" : 1,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "completed" : true,
    "task" : {
      "node" : "ydZx8i8HQBe69T4vbYm30g",
      "id" : 20987804,
      "type" : "transport",
      "action" : "indices:data/write/reindex",
      "status" : {
        "total" : 2,
        "updated" : 0,
        "created" : 2,
        "deleted" : 0,
        "batches" : 1,
        "version_conflicts" : 0,
        "noops" : 0,
        "retries" : {
          "bulk" : 0,
          "search" : 0
        },
        "throttled_millis" : 0,
        "requests_per_second" : -1.0,
        "throttled_until_millis" : 0
      },
      "description" : "reindex from [reindex_index-name-1] to [reindex_index-name-2][_doc]",
      "start_time_in_millis" : 1620539345400,
      "running_time_in_nanos" : 84854825,
      "cancellable" : true,
      "headers" : { }
    },
    "response" : {
      "took" : 82,
      "timed_out" : false,
      "total" : 2,
      "updated" : 0,
      "created" : 2,
      "deleted" : 0,
      "batches" : 1,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled" : "0s",
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until" : "0s",
      "throttled_until_millis" : 0,
      "failures" : [ ]
    }
  }
}

多源重建索引

如果有許多源需要重新索引，通常最好一次 Reindex 一個源的索引，而不是使用

glob

模式來選取多個源。這樣如果出現任何的錯誤，你可以删除有問題部分，然後選擇特定的源重新索引（

dest

的

op_type

可以設定為

create

隻重索引缺失的文檔）；另外一個好處，你可以并行運作這些

reindex

任務。

#!/bin/bash
for index in i1 i2 i3 i4 i5; do
  curl -H Content-Type:application/json -XPOST localhost:9200/_reindex?pretty -d'{
    "source": {
      "index": "'$index'"
    },
    "dest": {
      "index": "'$index'-reindexed"
    }
  }'
done

對 Reindex 限流

設定

requests_per_second

為任意的正十進制數（如 1.4，6，...1000等），以限制批量操作

_reindex

索引的速率。通過在每個批進行中，設定等待時間來限制請求；可以通過設定 requests_per_second=-1，來關閉限流操作。

限流是通過在每個批處理之間設定等待時間，是以 _reindex 在内部使用 scroll 的逾時時間，應當将這個等待時間考慮進去。等待時間=批大小/requests_per_second - 批寫入耗時；預設情況下，批處理大小為 1000，是以如果 requests_per_second 設定為500：

target_time = 1000 / 500 per second = 2 seconds
wait_time = target_time - write_time = 2 seconds - 0.5 seconds = 1.5 seconds

由于批處理是作為單個

_bulk

請求發出的，是以較大的批處理大小，會導緻 Elasticsearch 建立許多請求，然後等待一段時間，再開始下一組請求；這種情況可能會造成 Elasticsearch 周期性的抖動。

動态調整限流

可以使用

_rethrottle

API 在正在運作的重新索引上更改

requests_per_second

的值：

POST _reindex/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1

taskid

可以通過 task API 進行擷取。重新調整

requests_per_second

，如果是加快查詢速度則可以立即生效，如果是降低查詢速度，則需要在完成目前批處理後生效，這樣可以避免 scroll 逾時。

切片

Reindex 支援切片 scroll 以并行化重新索引過程，進而提高 Reindex 的效率。

注意：如果源索引是在遠端的 Elasticsearch 叢集，是不支援手動或自動切片的。

手動切片

通過為每個請求提供切片 ID 和切片總數。

示例如下：

POST _reindex
{
  "source": {
    "index": "my-index-000001",
    "slice": {
      "id": 0,
      "max": 2
    }
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

POST _reindex
{
  "source": {
    "index": "my-index-000001",
    "slice": {
      "id": 1,
      "max": 2
    }
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

可以通過以下方式驗證此功能

#避免還沒有形成 segments，文檔不可見
GET _refresh
#檢視文檔的個數
GET my-new-index-000001/_count
#或者
POST my-new-index-000001/_search?size=0&filter_path=hits.total

傳回結果如下

{
  "hits": {
    "total" : {
        "value": 120,
        "relation": "eq"
    }
  }
}

自動切片

還可以使用 Sliced scroll 基于文檔

_id

進行切片，讓

_reindex

自動并行化；通過設定 slices 參數的值來實作。

示例如下

POST _reindex?slices=5&refresh
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

POST my-new-index-000001/_search?size=0&filter_path=hits.total

{
  "hits": {
    "total" : {
        "value": 120,
        "relation": "eq"
    }
  }
}

slices

為

auto

會讓 Elasticsearch 選擇要使用的切片數。此設定将每個分片使用一個切片，直到達到某個限制。如果有多個源，它将基于分片數量最少的索引或 Backing index 确定切片數。

POST _reindex?slices=auto&refresh
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}
# 由源碼可知，slices 實際被修改為0
 if (slicesString.equals(AbstractBulkByScrollRequest.AUTO_SLICES_VALUE)) {
            return AbstractBulkByScrollRequest.AUTO_SLICES;
        }
public static final int AUTO_SLICES = 0;
public static final String AUTO_SLICES_VALUE = "auto";

_reindex 中添加 slices，将會自動完成上面手動切片建立的請求；自動切片建立的子請求有些不一樣的特征:

你可以使用 Tasks APIs 檢視這些子請求，這些子請求是帶有 slices 請求任務的”子”任務。
擷取帶有參數 slices 請求的任務狀态，将隻傳回包含已完成切片的狀态
可以分别對這些子請求，進行任務取消或者重新調節限速
重新調速帶有 slices 參數的請求，将會按比例調速它的子任務。
取消帶有 slices 參數的請求，将會取消它的所有子任務。
由于切片的性質，可能會每個切片的文檔并不均勻，會出現某些切片某些切片可能比其他切片大；但是所有文檔都會被劃分到某個切片中。
slices 請求如果含有 requests_per_second 和 max_docs ，将會按比例配置設定給每個子請求。結合上面關于分布不均勻的觀點，将 max_docs 與切片一起使用可能會出現滿足條件的 max_docs 文檔不被重新索引。
盡管這些快照幾乎都是在同一時間擷取，但是每個子請求，可能擷取的源快照會有稍微的不同。

合理選擇切片數目

自動切片，設定

slices

auto

，将為大多數索引選擇一個合理切片數目。如果手動切片或以其他方式調整自動切片，應當明白以下幾個點：

當 slices 的數目等于索引的數目時，查詢性能是最優的。設定 slices 高于分片數通常不會提高效率，反而會增加開銷（CPU，磁盤 IO 等）。
索引性能會在可用資源與切片數之間線性地縮放
查詢或索引性能在運作時是否占主導地位，取決于重新索引的文檔和叢集資源。

重新索引的路由

預設情況下，如果

_reindex

看到一個帶有路由的文檔，則路由将被保留，除非它被腳本更改。可以在

dest

的 JSON 體内上重新設定

routing

，進而改變之前的路由值，

routing

的可取值如下：

keep

keep

為預設值，将為每個比對項發送的批量請求的路由，設定為比對項舊的路由（就保持舊的路由方式）。

discard

将發送的批量請求的路由設定為

null

之後的值

示例，使用以下請求将

source_index

中公司名稱為

cat

的所有文檔複制到

source_index

且路由設定為

cat

：

POST _reindex
{
  "source": {
    "index": "source_index",
    "query": {
      "match": {
        "company": "cat"
      }
    }
  },
  "dest": {
    "index": "dest_index",
    "routing": "=cat"
  }
}

預設情況下，

_reindex

使用滾動批處理

。您可以使用元素中的

size

字段更改批處理大小：

POST _reindex
{
  "source": {
    "index": "source_index",
    "size": 100
  },
  "dest": {
    "index": "dest_index",
    "routing": "=cat"
  }
}

重索引使用預處理 Pipeline

重索引也可以使用

ingest pipeline

的特性，來富化資料；示列如下：

POST _reindex
{
  "source": {
    "index": "source"
  },
  "dest": {
    "index": "dest",
     "pipeline": "some_ingest_pipeline" #提前定義的 pipeline
  }
}

實戰示例

基于查詢重新索引文檔

可以通過在

source

中添加查詢條件，對有需要的文檔進行重新索引；

例如，複制

user.id

值為

kimchy

文檔到

my-new-index-000001

POST _reindex
{
  "source": {
    "index": "my-index-000001",
    "query": {
      "term": {
        "user.id": "kimchy"
      }
    }
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

基于 max_docs 重新索引文檔

通過在請求體中設定

max_docs

參數，控制重建索引的文檔的個數。

例如：從

my-index-000001

複制一個文檔到

my-new-index-000001

POST _reindex
{
  "max_docs": 1,
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

基于多源重新索引

source

index

屬性值可以是一個 list；這樣可以允許複制多個源的文檔到目的資料流或索引，但是需要注意多個源的文檔中字段的類型必須一緻。

例如複制

my-index-000001

my-index-000002

索引的文檔：

POST _reindex
{
  "source": {
    "index": ["my-index-000001", "my-index-000002"]
  },
  "dest": {
    "index": "my-new-index-000002"
  }
}

選擇字段重新索引

隻重新索引每個文檔篩選的字段；

例如，以下請求僅重新索引每個文檔的

user.id

_doc

字段：

POST _reindex
{
  "source": {
    "index": "my-index-000001",
    "_source": ["user.id", "_doc"]
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

通過重新索引修改文檔中字段名字

_reindex

可用于複制源索引文檔，在寫入目的索引之前，重命名字段；假設

my-index-000001

索引有以下文檔：

POST my-index-000001/_doc/1?refresh
{
  "text": "words words",
  "flag": "foo"
}

但是你想把字段名

flag

替換成

tag

，處理手段如下（當然也可以用 ingest pipeline）：

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  },
  "script": {
    "source": "ctx._source.tag = ctx._source.remove(\"flag\")"
  }
}

現在擷取新索引文檔

GET my-new-index-000001/_doc/1

傳回值如下：

{
  "found": true,
  "_id": "1",
  "_index": "my-new-index-000001",
  "_type": "_doc",
  "_version": 1,
  "_seq_no": 44,
  "_primary_term": 1,
  "_source": {
    "text": "words words",
    "tag": "foo"
  }
}

重新索引每日索引

_reindex

結合 Painless 腳本來重新索引每日索引，将新模闆應用于現有文檔。假設你有如下索引并包含下列文檔

PUT metricbeat-2021.05.10/_doc/1?refresh
{"system.cpu.idle.pct": 0.908}

PUT metricbeat-2021.05.11/_doc/1?refresh
{"system.cpu.idle.pct": 0.105}

通配

metricbeat-*

索引的新模闆，已經加載到 Elasticsearch 中，但是該模闆隻會對建立的索引生效。Painless 可用于重新索引現有文檔，并應用新模闆。下面的腳本從索引名中提取日期，并建立一個新索引，新索引名添加 -1。

metricbeat-2021.05.10

所有的資料将會被重建到

metricbeat-2021.05.10-1

POST _reindex
{
  "source": {
    "index": "metricbeat-*"
  },
  "dest": {
    "index": "metricbeat"
  },
  "script": {
    "lang": "painless",
    "source": "ctx._index = 'metricbeat-' + (ctx._index.substring('metricbeat-'.length(), ctx._index.length())) + '-1'"
  }
}

之前

metricbeat

索引的資料，都能從新的索引從擷取到：

GET metricbeat-2021.05.10-1/_doc/1
GET metricbeat-2021.05.11-1/_doc/1

提取源的随機子集

_reindex

可用于提取源的随機子集以進行測試:

POST _reindex
{
  "max_docs": 10,
  "source": {
    "index": "my-index-000001",
    "query": {
      "function_score" : {
        "random_score" : {},
        "min_score" : 0.9    #備注1
      }
    }
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

可能需要根據從源中提取資料的相對數量，來調整

min_score

的值

重新索引時修改文檔

像

_update_by_query

一樣，

_reindex

支援使用 script 修改文檔；不同的是，

_reindex

中使用腳本可以修改文檔的中繼資料。

此示例增加了源文檔的版本：

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001",
    "version_type": "external"
  },
  "script": {
    "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
    "lang": "painless"
  }
}

與

_update_by_query

一樣，你可以設定

ctx.op

更改在

dest

上執行的操作：

noop

如果決定不必在目标中為文檔重新索引，則需要在腳本中設定 ctx.op=“noop”。響應主體中的

noop

計數器将報告不做任何的操作。

delete

如果必須從目标（

dest

）中删除文檔，則需要在腳本中設定 ctx.op=“delete”。删除将在響應正文中的已删除計數器中報告。

ctx.op

為其他任何值都将傳回錯誤，就像設定中的其他任何字段一樣

ctx

同時還可以更改一些索引元資訊，但是謹慎操作：

_id
_index
_version
_routing

如果将

_version

null

或将其從

ctx

映射中清除，就像不在索引請求中發送版本一樣; 則無論目标上的版本或

_reindex

請求中使用的版本類型如何，都會導緻目标中的文檔被覆寫。

遠端重新索引

重新索引支援從遠端 Elasticsearch 複制資料：

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "my-index-000001",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

Host 參數必須包含 scheme，host，port (如：

https://otherhost

:9200)或者代理路徑（

:9200/proxy）。

_reindex 需要基本授權認證連結遠端 Elasticsearch 叢集，才需要使用者名和密碼參數;使用基本身份驗證時請確定使用協定，否則密碼将以純文字形式發送。

如果是 reindex 遠端叢集的資料，則必須在目前叢集的某個節點（請求發送到的那個節點，即協調節點）配置白名單，在 elasticsearch.yml 檔案中添加 reindex.remote.whitelist 屬性，該屬性的值為請求遠端叢集節點的 host:port，可以用逗号分隔配置多個，也可以使用通配符方式；

reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"

此外在做遠端

reindex

時，需要注意叢集之前的版本相容問題；Elasticsearch 不支援跨大版本的向前相容，如不能從 7.x 群集重新索引到 6.x 群集

從遠端伺服器重新索引時使用堆内緩沖區，預設最大為 100mb；如果遠端索引的文檔非常大，那麼批的 size 就應該設定的小一點。

如下代碼塊設定 batch size 為10：

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200"
    },
    "index": "source",
    "size": 10,
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

連接配接遠端 Elasticsearch 叢集，可以通過

socket_timeout

connect_timeout

，分别設定

socket

讀取逾時時間和連結逾時時間，在一定程度上保證 Reindex 的穩定性（網絡延遲問題），兩者的預設值均為 30s。

如下示例分别設定

socket

讀取逾時為 1 分鐘和連接配接逾時時間 10s:

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "socket_timeout": "1m",
      "connect_timeout": "10s"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

配置 SSL 參數

從遠端叢集

reindex

支援配置 ssl；這些參數是無法在

_reindex

請求體配置；必須在

elasticsearch.yml

檔案中指定，但在 Elasticsearch 密鑰庫中添加的安全設定除外,可支援的 ssl 參數。

配置如下：

	描述
reindex.ssl.certificate_authorities	應當信任的 PEM 編碼證書檔案的路徑清單；但不能同時指定 reindex.ssl.certificate_authorities和 reindex.ssl.truststore.path
reindex.ssl.truststore.path	要信任的證書的 Java Keystore 檔案的路徑；該密鑰庫可以采用”JKS”或“PKCS＃12”格式，但不能同時指定 reindex.ssl.certificate_authorities 和 reindex.ssl.truststore.path
reindex.ssl.truststore.password	信任庫的密碼（reindex.ssl.truststore.path）。此設定不能用于reindex.ssl.truststore.secure_password
reindex.ssl.truststore.secure_password	信任庫的密碼（reindex.ssl.truststore.path）。此設定不能用于reindex.ssl.truststore.password
reindex.ssl.truststore.type	信任庫的類型（reindex.ssl.truststore.path）。必須為jks或PKCS12。如果信任庫路徑以“ .p12”，“.pfx”或“ pkcs12”結尾，則此設定預設為PKCS12。否則，預設為jks。
reindex.ssl.verification_mode	表示防止中間人攻擊和證書僞造的驗證類型。其中一個 full（驗證主機名和證書路徑）， certificate （驗證證書路徑，而不是主機名）或none（不執行任何驗證-這是在生産環境中強烈反對）。預設為 full
reindex.ssl.certificate	指定用于HTTP用戶端身份驗證的 PEM 編碼證書（或證書鍊）的路徑（如果遠端叢集需要）。此設定 reindex.ssl.key 還需要設定。您不能同時指定 reindex.ssl.certificate和reindex.ssl.keystore.path。
reindex.ssl.key	指定與用于用戶端身份驗證（reindex.ssl.certificate）的證書相關聯的 PEM編碼的私鑰的路徑。您不能同時指定reindex.ssl.key 和 reindex.ssl.keystore.path。
reindex.ssl.key_passphrase	指定用于 reindex.ssl.key 加密 PEM 編碼的私鑰（reindex.ssl.key）的密碼。不能與一起使用reindex.ssl.secure_key_passphrase
reindex.ssl.secure_key_passphrase	指定用于 reindex.ssl.key 加密 PEM 編碼的私鑰（reindex.ssl.key）的密碼。不能與一起使用reindex.ssl.key_passphrase
reindex.ssl.keystore.path	指定密鑰庫的路徑，該密鑰庫包含用于HTTP用戶端身份驗證的私鑰和證書（如果遠端叢集需要）。該密鑰庫可以采用“ JKS”或“ PKCS＃12”格式。您不能同時指定 reindex.ssl.key 和 reindex.ssl.keystore.path
reindex.ssl.keystore.type	密鑰庫的類型（reindex.ssl.keystore.path）。必須為 jks 或 PKCS12。如果密鑰庫路徑以“ .p12”，“.pfx” 或 “ pkcs12” 結尾，則此設定預設為 PKCS12。否則，預設為jks
reindex.ssl.keystore.password	密鑰庫（reindex.ssl.keystore.path）的密碼。此設定不能用于reindex.ssl.keystore.secure_password
reindex.ssl.keystore.secure_password	密鑰庫（reindex.ssl.keystore.path）的密碼。此設定不能用于reindex.ssl.keystore.password
reindex.ssl.keystore.key_password	密鑰庫（reindex.ssl.keystore.path）中密鑰的密碼。預設為密鑰庫密碼。此設定不能用于 reindex.ssl.keystore.secure_key_password
reindex.ssl.keystore.secure_key_password	密鑰庫（reindex.ssl.keystore.path）中密鑰的密碼。預設為密鑰庫密碼。此設定不能用于 reindex.ssl.keystore.key_password