分布式打分—Elastic Stack 實戰手冊

https://developer.aliyun.com/topic/download?id=1295 · 更多精彩内容，請下載下傳閱讀全本《Elastic Stack實戰手冊》 https://developer.aliyun.com/topic/download?id=1295 https://developer.aliyun.com/topic/es100 · 加入創作人行列，一起交流碰撞，參與技術圈年度盛事吧 https://developer.aliyun.com/topic/es100

創作人：趙震一

什麼是打分

搜尋引擎中的搜尋與資料庫中，正常的 SELECT 查詢語句，都能幫你從一大堆資料中，找到比對某個特定關鍵字的資料條目，但是這兩者最大的差別在于，搜尋引擎能夠基于查詢和結果的相關性，幫你做好結果集排序，即搜尋引擎會将它認為最符合你查詢訴求的資料條目，放在最前面，而資料庫的 SELECT 語句卻做不到。

那麼搜尋引擎是怎麼做到的呢？其關鍵在于打分，搜尋引擎在完成關鍵字比對後，會基于一定的機制對每條比對的資料（後稱文檔）進行打分，得分高的文檔表示與本次查詢相關度高，就會在最後的結果清單中排在靠前的位置，反之則排名靠後，進而幫助你快速找到你最想要的資料。

下面，我們來向 Elasticsearch 插入一些索引資料：

#删除已有索引
DELETE /my-index-000001

#建立索引，顯示在 settings 中指定2個 shard:"number_of_shards": "2"
PUT /my-index-000001
{
  "settings": {
    "number_of_shards": "2",
    "number_of_replicas": "1"
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "date": {
        "type": "date"
      },
      "content": {
        "type": "text"
      }
    }
  }
}

#插入記錄
PUT /my-index-000001/_doc/1
{
  "title":   "三國志",
  "date":    "2021-05-01",
  "content": "國别體史書"
}

PUT /my-index-000001/_doc/2
{
  "title":   "紅樓夢",
  "date":    "2021-05-02",
  "content": "黛玉葬花..."
}

PUT /my-index-000001/_doc/3
{
  "title":   "易中天品三國",
  "date":    "2021-05-03",
  "content": "草船借箭、空城計..."
}

PUT /my-index-000001/_doc/4
{
  "title":   "水浒傳",
  "date":    "2021-05-03",
  "content": "梁山好漢被團滅..."
}

PUT /my-index-000001/_doc/5
{
  "title":   "三國演義",
  "date":    "2021-05-03",
  "content": "三國時代，群雄逐鹿..."
}

接下去，我們采用關鍵詞“三國演義”進行搜尋：

GET /my-index-000001/_search
{
  "query": {
    "query_string": {
      "query": "三國演義"
    }
  }
}

檢視一下傳回記錄的排序以及打分情況：

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 3.1212955,
    "hits" : [
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 3.1212955,
        "_source" : {
          "title" : "三國演義",
          "date" : "2021-05-03",
          "content" : "三國時代，群雄逐鹿..."
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.7946176,
        "_source" : {
          "title" : "三國志",
          "date" : "2021-05-01",
          "content" : "國别體史書"
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.59221494,
        "_source" : {
          "title" : "易中天品三國",
          "date" : "2021-05-03",
          "content" : "草船借箭、空城計..."
        }
      }
    ]
  }
}

可以看到 title 為“三國演義”的文檔排在第一，分數（ _score ）是3.1212955，其次是”三國志”，分數是0.7946176，最後是“易中天品三國”，分數是0.59221494，其餘沒有比對的文檔則沒有出現，該搜尋結果即打分排名基本符合預期。“三國志” 的得分較 ”易中天品三國” 的原因是因為 “三國志” 詞語較短。

Elasticsearch 搜尋的打分機制

衆所周知，Elasticsearch 是以 Lucene 作為其搜尋引擎技術的核心基石的。為了适應大資料時代的搜尋需求，Elasticsearch 對 Lucene 最大的增強在于，将原本的單機搜尋能力擴充到了分布式的叢集規模能力，即将原本單機無法支撐的索引資料，水準切分成多個可以獨立部署在不同機器上的 Shard，每個 Shard 由獨立的 Lucene 執行個體提供服務，進而以叢集的形式對外提供搜尋服務。

是以，為了便于了解 Elasticsearch 的分布式搜尋的打分機制，我們先來簡單回顧下單機情況下 Lucene 是如何打分的。

Lucene 打分機制

當我們向 Lucene 某個索引送出搜尋請求後，Lucene 會基于查詢完成比對，并得到一個文檔結果集，然後預設基于以下的評分公式，來對結果集中的每個條目計算相關度（評分公式可以基于配置調整）。

其中 q 表示查詢，d 表示目前文檔，t 表示 q 中的詞條，tf(t in d) 是計算詞條 t 在文檔 d 中的詞頻，idf(t) 是詞條t在整個索引中的逆文檔頻率。

我們介紹一下最關鍵的兩個概念，即詞頻（ TF ）和逆文檔頻率( IDF )。

詞頻（ TF ）：詞條在文檔中出現的次數

基于特定的 q 和文檔 d 來說，詞條 t 代指 q 分詞後的其中一個詞條，t 的詞頻指該 t 詞條在文檔 d 中的出現次數，出現次數越多，表示該文檔相對于該詞關聯度更高。

逆文檔頻率（IDF ）：在同一索引中存在該詞條的文檔數的倒數

包含某個詞條的文檔數越多，說明這個詞條的詞頻在整個索引中的影響力越弱。

對于該公式的其他各項的含義，本小節不作深入介紹，我們僅需了解，一旦給定查詢 q 和文檔 d，其得分即為查詢中每個詞條 t 的得分總和。

而每個詞條的得分，一個主要部分是該詞條在文檔 d 中的詞頻 ( TF ) 乘以逆文檔頻率 ( IDF ) 的平方。即詞條在文檔中出現的頻率越高，則得分越高。而索引中存在該詞條的文檔越少，逆文檔頻率則越高，表示該詞條越罕見，那麼對應的分數也将越高。

回顧完 Lucene 的打分機制，我們再回過來看下 Elasticsearch 的搜尋及打分機制。

Elasticsearch 打分機制

Elasticsearch 的搜尋類型有兩種，預設的稱為 QUERY_THEN_FETCH。顧名思義，它的搜尋流程分為兩個階段，分别稱之為 Query 和 Fetch 。

我們來看下 QUERY_THEN_FETCH 的流程：

Query 階段：

Elasticsearch 在收到用戶端搜尋請求後，會由協調節點将請求分發到對應索引的每個 Shard 上。
每個 Shard 的 Lucene 執行個體基于本地 Shard 内的 TF/IDF 統計資訊，獨立完成 Shard 内的索引比對和打分（基于上述公式），并根據打分結果完成單個 Shard 内的排序、分頁。
每個 Shard 将排序分頁後的結果集的中繼資料（文檔 ID 和分數，不包含具體的文檔内容）傳回給協調節點。
協調節點完成整體的彙總、排序以及分頁，篩選出最終确認傳回的搜尋結果。

Fetch 階段：

協調節點根據篩選結果去對應 shard 拉取完整的文檔資料
整合最終的結果傳回給使用者用戶端

分布式打分的權衡

我們再來看一個場景，先重建索引，但是我們将 Shard 建成 3：

#删除已有索引
DELETE /my-index-000001

#建立索引，顯示在 settings 中指定3個 shard:"number_of_shards": "3"
PUT /my-index-000001
{
  "settings": {
    "number_of_shards": "3",
    "number_of_replicas": "1"
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "date": {
        "type": "date"
      },
      "content": {
        "type": "text"
      }
    }
  }
}

#插入記錄
PUT /my-index-000001/_doc/1
{
  "title":   "三國志",
  "date":    "2021-05-01",
  "content": "國别體史書"
}

PUT /my-index-000001/_doc/2
{
  "title":   "紅樓夢",
  "date":    "2021-05-02",
  "content": "黛玉葬花..."
}

PUT /my-index-000001/_doc/3
{
  "title":   "易中天品三國",
  "date":    "2021-05-03",
  "content": "草船借箭、空城計..."
}

PUT /my-index-000001/_doc/4
{
  "title":   "水浒傳",
  "date":    "2021-05-03",
  "content": "梁山好漢被團滅..."
}

PUT /my-index-000001/_doc/5
{
  "title":   "三國演義",
  "date":    "2021-05-03",
  "content": "三國時代，群雄逐鹿..."
}

然後再次執行相同的搜尋：

GET /my-index-000001/_search
{
  "query": {
    "query_string": {
      "query": "三國演義"
    }
  }
}

檢視本次搜尋結果：

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.6285465,
    "hits" : [
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.6285465,
        "_source" : {
          "title" : "易中天品三國",
          "date" : "2021-05-03",
          "content" : "草船借箭、空城計..."
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.1507283,
        "_source" : {
          "title" : "三國演義",
          "date" : "2021-05-03",
          "content" : "三國時代，群雄逐鹿..."
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "三國志",
          "date" : "2021-05-01",
          "content" : "國别體史書"
        }
      }
    ]
  }
}

搜尋結果的排名竟然發生了變化，我們期望排第一的”三國演義”排到了第二，得分為1.1507283，而“易中天品三國”竟然得分1.6285465，躍居第一，這并不符合我們的搜尋預期。

通過分析上面的 QUERY_THEN_FETCH 流程，我們不難發現：由于分布式系統天然的割裂性質，每個 shard 無法看到全局的統計資訊，是以上述第 2 步中每個 Shard 的打分都是基于本地 Shard 内的 TF/IDF 統計資訊來完成的。

在大多數的生産環境中，由于資料量多且在每個 Shard 分布均勻，這種方式是沒有問題的。但是在極端情況下（如上例），3 個 shard 中的文檔數相差較大，那麼 IDF 在 3 個 Shard 中所起到的影響将截然不同，即單個 Shard 内打分彙總後的結果，與全局打分彙總的結果會有相當大的出入，造成我們在靠前的分頁，搜到原本應該排名靠後的文檔。

這也是分布式打分引入的實際問題，那麼如何才能解決這類問題呢？

我們曾在上一小節提到，Elasticsearch 的搜尋類型其實有兩種，除了上面介紹的 QUERY_THEN_FETCH 之外，還有一種是 DFS_QUERY_THEN_FETCH。

DFS 在這裡的意思是分布式頻率打分，其思想是提前向所有 Shard 進行全局的統計資訊搜集，然後再将這些統計資訊，随着查詢分發到各個 Shard，讓各個 Shard 在本地采用全局 TF/IDF 來打分，具體的流程如下：

預統計階段：

Elasticsearch 在收到用戶端搜尋請求後，會由協調節點進行一次預統計工作，即先向所有相關 Shard 搜集統計資訊

由協調節點整合所有統計資訊，将全局的統計資訊連同請求一起分發到對應索引的每個 Shard 上。
每個 Shard 的 Lucene 執行個體，基于全局的 TF/IDF 統計資訊，獨立完成 Shard 内的索引比對和打分（基于上述公式），并根據打分結果，完成單個 Shard 内的排序、分頁。

綜上可見，Elasticsearch 在分布式打分上做了權衡，如果要考慮絕對的精确性，那麼需要犧牲一些性能來換取全局的統計資訊。

讓我們來看下如何切換到 DFS_QUERY_THEN_FETCH，隻需在接口 URL 加上search_type=dfs_query_then_fetch

GET /my-index-000001/_search?search_type=dfs_query_then_fetch
{
  "query": {
    "query_string": {
      "query": "三國演義"
    }
  }
}

可以看到，通過這種方式傳回的結果又恢複了正常：

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 3.7694218,
    "hits" : [
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 3.7694218,
        "_source" : {
          "title" : "三國演義",
          "date" : "2021-05-03",
          "content" : "三國時代，群雄逐鹿..."
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.1795839,
        "_source" : {
          "title" : "三國志",
          "date" : "2021-05-01",
          "content" : "國别體史書"
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.8715688,
        "_source" : {
          "title" : "易中天品三國",
          "date" : "2021-05-03",
          "content" : "草船借箭、空城計..."
        }
      }
    ]
  }
}

”三國演義“的文檔仍排在第一，分數（ _score ）變成了 3.7694218，其次是”三國志”，分數是1.1795839，最後是”易中天品三國“，分數是0.8715688，其餘沒有比對的文檔同樣沒有出現。

另外，根據傳回的 took 資料，可以看到耗時較 query_then_fetch

的方式有略為增加，是以這種方式對性能會有折損，在生産環境中建議謹慎使用。

檢視得分邏輯

為了在實際開發中了解得分邏輯，進而優化我們的查詢條件或索引工作，我們需要關注例如“易中天品三國”為什麼分數是 0.8715688，而不是 3.7694218。

我們可以通過在查詢中增加 explain 來檢視得分的說明資訊。

GET /my-index-000001/_search?search_type=dfs_query_then_fetch
{
  "query": {
    "query_string": {
      "query": "三國演義"
    }
  },
  "explain": true
}

通過增加 "explain": true，我們可以看到傳回的結果集裡增加了大量 _explanation 資訊：

{
  "took" : 21,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 3.7694218,
    "hits" : [
      {
        "_shard" : "[my-index-000001][0]",
        "_node" : "ydZx8i8HQBe69T4vbYm30g",
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 3.7694218,
        "_source" : {
          "title" : "三國演義",
          "date" : "2021-05-03",
          "content" : "三國時代，群雄逐鹿..."
        },
        "_explanation" : {
          "value" : 3.7694218,
          "description" : "max of:",
          "details" : [
            {
              "value" : 3.7694218,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.52763593,
                  "description" : "weight(title:三 in 0) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 0.52763593,
                      "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.5389965,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 3,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 5,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.4449649,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 4.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 3.8,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                
                {
                  "value" : 1.357075,
                  "description" : "weight(title:演 in 0) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 1.357075,
                      "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.3862944,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 5,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.4449649,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 4.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 3.8,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                ...
              ]
            },
            ...
          ]
        }
      },
     ...
    ]
  }
}

通過分析 description 和 details 中資訊的描述，我們可以進一步深挖 Elasticsearch 的打分邏輯和我們查詢出來的每個文檔的得分詳情。

創作人簡介：

趙震一，程式員，好奇技淫巧，關注大資料與分布式計算。

分布式打分—Elastic Stack 實戰手冊

什麼是打分

Elasticsearch 搜尋的打分機制

Lucene 打分機制

Elasticsearch 打分機制

分布式打分的權衡

檢視得分邏輯

繼續閱讀

Linxu常用指令技巧彙總

ERROR 1 (HY000): Can't create/write to file '/tmp/#sql_4188_1.MYI' (Errcode: 28)

艱難安裝LDAP,SSL認證

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

Ambari介紹和架構原理

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

面試題解析：你接口測試是怎麼做的？

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述

win10本地scala和spark安裝安裝scala安裝spark