處理一對多關系

ElasticSearch目錄

轉載：https://blog.csdn.net/pony_maggie/article/details/105126342

文章目錄

一、mysql中一對多的關系
二、ES處理一對多關系的方案

2.1、普通内部對象
2.2、嵌套文檔

2.2.1、嵌套循環的查詢

2.3、父子文檔

2.3.1、父子文檔的查詢

無條件查詢
has_child 查詢
has_parent查詢
parent_id查詢

2.4、總結

一、mysql中一對多的關系

很多時候mysql的表之間是一對多的關系，比如訂單表和商品表。一筆訂單可以包含多個商品。他們的關系如下圖所示。

ElasticsSearch（以下簡稱ES）處理這種關系雖然不是特别擅長（相對于關系型資料庫），因為ES和大多數 NoSQL 資料庫類似，是扁平化的存儲結構。索引是獨立文檔的集合體。不同的索引之間一般是沒有關系的。

不過ES目前畢竟發展到7.x版本了，已經有幾種可選的方式能夠高效的支援這種一對多關系的映射。

比較常用的方案是嵌套對象，嵌套文檔和父子文檔。後兩種是我們本文要講的重點。

下面聚合分析使用的資料都是kibana自帶的，這樣友善實際測試示例。

二、ES處理一對多關系的方案

2.1、普通内部對象

kibana自帶的電商資料就是這種方式，我們來看看它的mapping。

"kibana_sample_data_ecommerce" : {
    "mappings" : {
      "properties" : {
        "category" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword"
            }
          }
        },
        "currency" : {
          "type" : "keyword"
        },
        "customer_full_name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        //省略部分
       
        "products" : {
          "properties" : {
            "_id" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "base_price" : {
              "type" : "half_float"
            },
            "base_unit_price" : {
              "type" : "half_float"
            },
            "category" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              }
            },
            "created_on" : {
              "type" : "date"
            },
            "discount_amount" : {
              "type" : "half_float"
            },
            "discount_percentage" : {
              "type" : "half_float"
            },
            "manufacturer" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              }
            },
            "min_price" : {
              "type" : "half_float"
            },
            "price" : {
              "type" : "half_float"
            },
            "product_id" : {
              "type" : "long"
            },
            "product_name" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              },
              "analyzer" : "english"
            },
            "quantity" : {
              "type" : "integer"
            },
            "sku" : {
              "type" : "keyword"
            },
            "tax_amount" : {
              "type" : "half_float"
            },
            "taxful_price" : {
              "type" : "half_float"
            },
            "taxless_price" : {
              "type" : "half_float"
            },
            "unit_discount_amount" : {
              "type" : "half_float"
            }
          }
        },
        "sku" : {
          "type" : "keyword"
        },
        "taxful_total_price" : {
          "type" : "half_float"
        },
        //省略部分

我們可以看到電商的訂單索引裡面包含了一個

products

的字段，它是對象類型，内部有自己的字段屬性。這其實就是一個包含關系，表示一個訂單可以有多個商品資訊。我們可以查詢下看看結果，

查詢語句：

POST kibana_sample_data_ecommerce/_search
{
  "query": {
    "match_all": {}
  }
}

傳回結果（我去掉了一些内容友善觀察）:

"hits" : [
      {
        "_index" : "kibana_sample_data_ecommerce",
        "_type" : "_doc",
        "_id" : "VJz1f28BdseAsPClo7bC",
        "_score" : 1.0,
        "_source" : {
          "customer_first_name" : "Eddie",
          "customer_full_name" : "Eddie Underwood",
          "order_date" : "2020-01-27T09:28:48+00:00",
          "order_id" : 584677,
          "products" : [
            {
              "base_price" : 11.99,
              "discount_percentage" : 0,
              "quantity" : 1,
              "sku" : "ZO0549605496",
              "manufacturer" : "Elitelligence",
              "tax_amount" : 0,
              "product_id" : 6283,
            },
            {
              "base_price" : 24.99,
              "discount_percentage" : 0,
              "quantity" : 1,
              "sku" : "ZO0299602996",
              "manufacturer" : "Oceanavigations",
              "tax_amount" : 0,
              "product_id" : 19400,
            }
          ],
          "taxful_total_price" : 36.98,
          "taxless_total_price" : 36.98,
          "total_quantity" : 2,
          "total_unique_products" : 2,
          "type" : "order",
          "user" : "eddie",
            "region_name" : "Cairo Governorate",
            "continent_name" : "Africa",
            "city_name" : "Cairo"
          }
        }
      },

可以看到傳回的

products

其實是個list，包含兩個對象。這就表示了一個一對多的關系。

這種方式的優點很明顯，由于所有的資訊都在一個文檔中,查詢時就沒有必要去ES内部沒有必要再去join别的文檔，查詢效率很高。那麼它優缺點嗎？

當然有，我們還用上面的例子，如下的查詢：

GET kibana_sample_data_ecommerce/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "products.base_price": 24.99 }},
        { "match": { "products.sku":"ZO0549605496"}},
        { "match": { "order_id": "584677"}}
      ]
    }
  }
}

我這裡搜尋有三個條件，order_id，商品的價格和sku，事實上同時滿足這三個條件的文檔并不存在（sku=ZO0549605496的商品價格是11.99）。但是結果卻傳回了一個文檔，這是為什麼呢？

原來在ES中對于json對象數組的處理是壓扁了處理的，比如上面的例子在ES存儲的結構是這樣的：

{
  "order_id":            [ 584677 ],
  "products.base_price":    [ 11.99, 24.99... ],
  "products.sku": [ ZO0549605496, ZO0299602996 ],
  ...
}

很明顯，這樣的結構丢失了商品金額和sku的關聯關系。

如果你的業務場景對這個問題不敏感，就可以選擇這種方式，因為它足夠簡單并且效率也比下面兩種方案高。

2.2、嵌套文檔

很明顯上面對象數組的方案沒有處理好内部對象的邊界問題，JSON數組對象被ES強行存儲成扁平化的鍵值對清單。為了解決這個問題，ES推出了一種所謂的嵌套文檔的方案，官方對這種方案的介紹是這樣的：

The nested type is a specialised version of the object datatype that allows arrays of objects to be indexed in a way that they can be queried independently of each other.

可以看到嵌套文檔的方案其實是對普通内部對象這種方案的補充。上面那個電商的例子mapping太長了，換個簡單一些的例子，隻要能說明問題就行了。

先設定給索引設定一個mapping:

PUT test_index
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested" 
      }
    }
  }
}

user屬性是

nested

，表示是個内嵌文檔。其它的屬性這裡沒有設定，讓es自動mapping就可以了。

插入兩條資料:

PUT test_index/_doc/1
{
  "group" : "root",
  "user" : [
    {
      "name" : "John",
      "age" :  30
    },
    {
      "name" : "Alice",
      "age" :  28
    }
  ]
}

PUT test_index/_doc/2
{
  "group" : "wheel",
  "user" : [
    {
      "name" : "Tom",
      "age" :  33
    },
    {
      "name" : "Jack",
      "age" :  25
    }
  ]
}

2.2.1、嵌套循環的查詢

查詢的姿勢是這樣的:

GET test_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.name": "Alice" }},
            { "match": { "user.age":  28 }} 
          ]
        }
      }
    }
  }
}

注意到

nested

文檔查詢有特殊的文法，需要指明

nested

關鍵字和路徑（

path

），再來看一個更具代表性的例子，查詢的條件在主文檔和子文檔都有:

GET test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": { "group": "root" }
        },
        {
          "nested": {
            "path": "user",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": { "user.name": "Alice" }
                  },
                  {
                    "match": { "user.age": 28 }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

說了這麼多，似乎嵌套文檔很好用啊。

沒有前面那個方案對象邊界缺失的問題，用起來似乎也不複雜。那麼它有缺點嗎？當然，我們先來做個試驗。

先看看目前索引的文檔數量:

GET _cat/indices?v

查詢結果：

green  open   test_index                   FJsEIFf_QZW4Q4SlZBsqJg
1   1          6            0     17.7kb          8.8kb

你可能已經注意到我這裡檢視文檔數量并不是用的下面這個指令：

GET test_index/_count

而是直接檢視的索引資訊，前者可以看到底層真實的文檔數量。

是不是很奇怪問啥文檔的數量是6而不是2呢？這是因為nested子文檔在ES内部其實也是獨立的lucene文檔，隻是我們在查詢的時候，ES内部幫我們做了join處理。最終看起來好像是一個獨立的文檔一樣。

那可想而知同樣的條件下，這個性能肯定不如普通内部對象的方案。在實際的業務應用中要根據實際情況決定是否選擇這種方案。

2.3、父子文檔

我們還是看上面那個例子，假如我需要更新文檔的group屬性的值，需要重新索引這個文檔。盡管嵌套的user對象我不需要更新，他也随着主文檔一起被重新索引了。

還有就是如果某個表和多個表都有着一對多的關系，也就是一個子文檔可以屬于多個主文檔的場景，用nested無法實作。

下面來看示例。

首先我們定義mapping，如下：

PUT my_index
{
  "mappings": {
    "properties": {
      "my_id": {
        "type": "keyword"
      },
      "my_join_field": { 
        "type": "join",
        "relations": {
          "question": "answer" 
        }
      }
    }
  }
}

my_join_field

是給我們的父子文檔關系的名字，這個可以自定義，此字段用處挺多的。

join

關鍵字表示這是一個父子文檔關系，

接下來

relations

裡面表示question是父，answer是子。

插入兩個父文檔:

PUT my_index/_doc/1
{
  "my_id": "1",
  "text": "This is a question",
  "my_join_field": {
    "name": "question" 
  }
}


PUT my_index/_doc/2
{
  "my_id": "2",
  "text": "This is another question",
  "my_join_field": {
    "name": "question"
  }
}

"name": "question"

表示插入的是父文檔。

然後插入兩個子文檔:

PUT my_index/_doc/3?routing=1
{
  "my_id": "3",
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", 
    "parent": "1" 
  }
}

PUT my_index/_doc/4?routing=1
{
  "my_id": "4",
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

子文檔要解釋的東西比較多：

首先從文檔id我們可以判斷子文檔都是獨立的文檔（跟nested不一樣）。
其次routing關鍵字指明了路由的id是父文檔1，這個id和下面的parent關鍵字對應的id是一緻的。
需要強調的是，索引子文檔的時候，routing是必須的，因為要確定子文檔和父文檔在同一個分片上。
name關鍵字指明了這是一個子文檔。

現在my_index中有四個獨立的文檔，我們來父子文檔在搜尋的時候是什麼姿勢。

2.3.1、父子文檔的查詢

無條件查詢

先來一個無條件查詢:

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": ["my_id"]
}

傳回結果(部分):

{
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_routing" : "1",
        "_source" : {
          "my_id" : "3",
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        },

可以看到傳回的結果帶了

my_join_field

關鍵字指明這是個父文檔還是子文檔。

has_child 查詢

has_child 查詢，傳回父文檔:

POST my_index/_search
{
  "query": {
    "has_child": {
      "type": "answer",
      "query" : {
         "match": { "text" : "answer" }
      }
    }
  }
}

傳回結果（部分）:

"hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "my_id" : "1",
          "text" : "This is a question",
          "my_join_field" : {
            "name" : "question"
          }
        }
      }
    ]

has_parent查詢

has_parent查詢，傳回相關的子文檔:

POST my_index/_search
{
  "query": {
    "has_parent": {
      "parent_type": "question",
      "query" : {
          "match": { "text" : "question"}
      }
    }
  }
}

結果（部分）:

"hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_routing" : "1",
        "_source" : {
          "my_id" : "3",
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.0,
        "_routing" : "1",
        "_source" : {
          "my_id" : "4",
          "text" : "This is another answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      }
    ]

parent_id查詢

parent_id查詢子文檔:

POST my_index/_search
{
  "query": {
    "parent_id": { 
      "type": "answer",
      "id": "1"
    }
  }
}

傳回的結果和上面基本一樣，差別在于parent id搜尋預設使用相關性算分，而Has Parent預設情況下不使用算分。

使用父子文檔的模式有一些需要特别關注的點：

每一個索引隻能定義一個 join field
父子文檔必須在同一個分片上，意味着查詢，更新操作都需要加上routing
可以向一個已經存在的join field上新增關系

2.4、總結

總的來說，嵌套對象通過備援資料來提高查詢性能，适用于讀多寫少的場景。

普通子對象模式實作一對多關系，會損失子對象的邊界，子對象的屬性之前關聯性喪失。
嵌套文檔可以解決普通子對象存在的問題，但是它有兩個缺點，

一個是更新主文檔的時候要全部更新，
另外就是不支援子文檔從屬多個主文檔的場景。

父子文檔能解決前面兩個存在的問題，但是它适用于寫多讀少的場景。

處理一對多關系

文章目錄

一、mysql中一對多的關系

二、ES處理一對多關系的方案

2.1、普通内部對象

2.2、嵌套文檔

2.2.1、嵌套循環的查詢

2.3、父子文檔

2.3.1、父子文檔的查詢

無條件查詢

has_child 查詢

has_parent查詢

parent_id查詢

2.4、總結

繼續閱讀

HDFS指令行工具

【51CTO學院三周年】自學路上的伴侶

線上教育巨頭多鄰國Duolingo入華一周年，中國市場馬力全開

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

Sql優化一：sql語句優化

Nacos 2.0 更新前後性能對比壓測

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Storm編譯打包過程中遇到的一些問題及解決方法

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

淺談企業活動中進行資料分析的重要性

Ambari介紹和架構原理

30天了解30種技術系列---(10)面向Cloud的搜尋引擎 ElasticSearch

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark