天天看點

elasticsearch 分詞器 character filter ,tokenizer,token filter

作者:孫龍程式員

分詞器:

  • 規範化:normalization
  • 字元過濾器:character filter
  • 分詞器:tokenizer
  • 令牌過濾器:token filter

無論是内置的分析器(analyzer),還是自定義的分析器(analyzer),都由三種構件塊組成的:character filters , tokenizers , token filters。

内置的analyzer将這些建構塊預先打包到适合不同語言和文本類型的analyzer中。

Character filters (字元過濾器)

字元過濾器以字元流的形式接收原始文本,并可以通過添加、删除或更改字元來轉換該流。

舉例來說,一個字元過濾器可以用來把阿拉伯數字(٠‎١٢٣٤٥٦٧٨‎٩)‎轉成成Arabic-Latin的等價物(0123456789)。

一個分析器可能有0個或多個字元過濾器,它們按順序應用。

(PS:類似Servlet中的過濾器,或者攔截器,想象一下有一個過濾器鍊)

Tokenizer (分詞器)

一個分詞器接收一個字元流,并将其拆分成單個token (通常是單個單詞),并輸出一個token流。例如,一個whitespace分詞器當它看到空白的時候就會将文本拆分成token。它會将文本“Quick brown fox!”轉換為[Quick, brown, fox!]

(PS:Tokenizer 負責将文本拆分成單個token ,這裡token就指的就是一個一個的單詞。就是一段文本被分割成好幾部分,相當于Java中的字元串的 split )

分詞器還負責記錄每個term的順序或位置,以及該term所表示的原單詞的開始和結束字元偏移量。(PS:文本被分詞後的輸出是一個term數組)

一個分析器必須隻能有一個分詞器

Token filters (token過濾器)

token過濾器接收token流,并且可能會添加、删除或更改tokens。

例如,一個lowercase token filter可以将所有的token轉成小寫。stop token filter可以删除常用的單詞,比如 the 。synonym token filter可以将同義詞引入token流。

不允許token過濾器更改每個token的位置或字元偏移量。

一個分析器可能有0個或多個token過濾器,它們按順序應用。

小結&回顧

analyzer(分析器)是一個包,這個包由三部分組成,分别是:character filters (字元過濾器)、tokenizer(分詞器)、token filters(token過濾器)一個analyzer可以有0個或多個character filters一個analyzer有且隻能有一個tokenizer一個analyzer可以有0個或多個token filterscharacter filter 是做字元轉換的,它接收的是文本字元流,輸出也是字元流tokenizer 是做分詞的,它接收字元流,輸出token流(文本拆分後變成一個一個單詞,這些單詞叫token)token filter 是做token過濾的,它接收token流,輸出也是token流由此可見,整個analyzer要做的事情就是将文本拆分成單個單詞,文本 ----> 字元 ----> token

elasticsearch 分詞器 character filter ,tokenizer,token filter

1 normalization:文檔規範化,提高召回率

  • 停用詞
  • 時态轉換
  • 大小寫
  • 同義詞
  • 語氣詞
elasticsearch 分詞器 character filter ,tokenizer,token filter
elasticsearch 分詞器 character filter ,tokenizer,token filter
#normalization
GET _analyze
{
  "text": "Mr. Ma is an excellent teacher",
  "analyzer": "english"
}           

2 字元過濾器(character filter):分詞之前的預處理,過濾無用字元

  • HTML Strip
  • Mapping
  • Pattern Replace

HTML Strip

##HTML Strip Character Filter
###測試資料<p>I'm so <a>happy</a>!</p>
DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter(自定義的分析器名字)":{
          "type":"html_strip",
          "escaped_tags":["a"]
        }
      },
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter(自定義的分析器名字)"]
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I'm so <a>happy</a>!</p>"
}           
elasticsearch 分詞器 character filter ,tokenizer,token filter

Mapping

##Mapping Character Filter 
DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter":{
          "type":"mapping",
          "mappings":[
            "滾 => *",
            "垃 => *",
            "圾 => *"
            ]
        }
      },
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "你就是個垃圾!滾"
}           
elasticsearch 分詞器 character filter ,tokenizer,token filter
Pattern Replace           
##Pattern Replace Character Filter 
#17611001200
DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter":{
          "type":"pattern_replace",
          "pattern":"(\\d{3})\\d{4}(\\d{4})",
          "replacement":"$1****$2"
        }
      },
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "您的手機号是17611001200"
}           
elasticsearch 分詞器 character filter ,tokenizer,token filter

3 令牌過濾器(token filter)

--停用詞、時态轉換、大小寫轉換、同義詞轉換、語氣詞處理等。比如:has=>have him=>he apples=>apple the/oh/a=>幹掉

  • 大小寫
  • 時态
  • 停用詞
  • 同義詞
  • 語氣詞
#token filter
DELETE test_index
PUT /test_index
{
  "settings": {
      "analysis": {
        "filter": {
          "my_synonym": {
            "type": "synonym_graph",
            "synonyms_path": "analysis/synonym.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "ik_max_word",
            "filter": [ "my_synonym" ]
          }
        }
      }
  }
}
GET test_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["蒙丢丢,大G,霸道,daG"]
}
GET test_index/_analyze
{
  "analyzer": "ik_max_word",
  "text": ["奔馳G級"]
}           
elasticsearch 分詞器 character filter ,tokenizer,token filter
elasticsearch 分詞器 character filter ,tokenizer,token filter
elasticsearch 分詞器 character filter ,tokenizer,token filter
elasticsearch 分詞器 character filter ,tokenizer,token filter

近義詞比對

DELETE test_index
PUT /test_index
{
  "settings": {
      "analysis": {
        "filter": {
          "my_synonym": {
            "type": "synonym",
            "synonyms": ["趙,錢,孫,李=>吳","周=>王"]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [ "my_synonym" ]
          }
        }
      }
  }
}
GET test_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["趙,錢,孫,李","周"]
}           
elasticsearch 分詞器 character filter ,tokenizer,token filter
elasticsearch 分詞器 character filter ,tokenizer,token filter

大小寫

#大小寫
GET test_index/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"], 
  "text": ["AASD ASDA SDASD ASDASD"]
}
GET test_index/_analyze
{
  "tokenizer": "standard",
  "filter": ["uppercase"], 
  "text": ["asdasd asd asg dsfg gfhjsdf asfdg g"]
}
#長度小于5的轉大寫
GET test_index/_analyze
{
  "tokenizer": "standard",
  "filter": {
    "type": "condition",
    "filter":"uppercase",
    "script": {
      "source": "token.getTerm().length() < 5"
    }
  }, 
  "text": ["asdasd asd asg dsfg gfhjsdf asfdg g"]
}           
elasticsearch 分詞器 character filter ,tokenizer,token filter

轉小寫

elasticsearch 分詞器 character filter ,tokenizer,token filter

轉大寫

elasticsearch 分詞器 character filter ,tokenizer,token filter

長度小于5的轉大寫

elasticsearch 分詞器 character filter ,tokenizer,token filter

停用詞

https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-stop-tokenfilter.html

#停用詞
DELETE test_index
PUT /test_index
{
  "settings": {
      "analysis": {
        "analyzer": {
          "my_analyzer自定義名字": {
            "type": "standard",
            "stopwords":["me","you"]
          }
        }
      }
  }
}
GET test_index/_analyze
{
  "analyzer": "my_analyzer自定義名字", 
  "text": ["Teacher me and you in the china"]
}

#####傳回 teacher and  you in the china           
elasticsearch 分詞器 character filter ,tokenizer,token filter

官方案例:

elasticsearch 分詞器 character filter ,tokenizer,token filter

官方支援的 token filter

https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-stop-tokenfilter.html

elasticsearch 分詞器 character filter ,tokenizer,token filter

4 分詞器(tokenizer):切詞

  • 預設分詞器:standard(英文切割,根據空白切割)
  • 中文分詞器:ik分詞

https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-whitespace-tokenizer.html

配置内置的分析器

内置的分析器不用任何配置就可以直接使用。當然,預設配置是可以更改的。例如,standard分析器可以配置為支援停止字清單:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_text": {
          "type":     "text",
          "analyzer": "standard", 
          "fields": {
            "english": {
              "type":     "text",
              "analyzer": "std_english" 
            }
          }
        }
      }
    }
  }
}
'           

在這個例子中,我們基于standard分析器來定義了一個std_englisth分析器,同時配置為删除預定義的英語停止詞清單。後面的mapping中,定義了my_text字段用standard,my_text.english用std_english分析器。是以,下面兩個的分詞結果會是這樣的:

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text", 
  "text": "The old brown cow"
}
'
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text.english", 
  "text": "The old brown cow"
}
'           

第一個由于用的standard分析器,是以分詞的結果是:[ the, old, brown, cow ]

第二個用std_english分析的結果是:[ old, brown, cow ]

--------------------------Standard Analyzer (預設)---------------------------

如果沒有特别指定的話,standard 是預設的分析器。它提供了基于文法的标記化(基于Unicode文本分割算法),适用于大多數語言。

例如:

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'           

上面例子中,那段文本将會輸出如下terms:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]           

-------------------案例3---------------------

标準分析器接受下列參數:

  • max_token_length : 最大token長度,預設255
  • stopwords : 預定義的停止詞清單,如_english_ 或 包含停止詞清單的數組,預設是 _none_
  • stopwords_path : 包含停止詞的檔案路徑
curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'           

以上輸出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]           

---------------------定義--------------------

standard分析器由下列兩部分組成:

Tokenizer

  • Standard Tokenizer

Token Filters

  • Standard Token Filter
  • Lower Case Token Filter
  • Stop Token Filter (預設被禁用)

你還可以自定義

curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}
'           

-------------------- Simple Analyzer---------------------------

simple 分析器當它遇到隻要不是字母的字元,就将文本解析成term,而且所有的term都是小寫的。例如:

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'           

輸入結果如下:

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]           

5 常見分詞器:

  • standard analyzer:預設分詞器,中文支援的不理想,會逐字拆分。
  • keyword分詞器,不對輸入的text内容做熱呢和處理,而是将整個輸入text作為一個token
  • pattern tokenizer:以正則比對分隔符,把文本拆分成若幹詞項。
  • simple pattern tokenizer:以正則比對詞項,速度比pattern tokenizer快。
  • whitespace analyzer:以空白符分隔 Tim_cookie

6 自定義分詞器:custom analyzer

  • char_filter:内置或自定義字元過濾器 。
  • token filter:内置或自定義token filter 。
  • tokenizer:内置或自定義分詞器。

分詞器(Analyzer)由0個或者多個字元過濾器(Character Filter),1個标記生成器(Tokenizer),0個或者多個标記過濾器(Token Filter)組成

說白了就是将一段文本經過處理後輸出成單個單個單詞

PUT custom_analysis
{
  "settings":{
    "analysis":{
      
    }
  }
}           
#自定義分詞器
DELETE custom_analysis
PUT custom_analysis
{
  "settings": {
    "analysis": {#第一步:字元過濾器 接收原始文本,并可以通過添加,删除或者更改字元來轉換字元串,轉換成可識别的的字元串
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "| => or"
          ]
        },
        "html_strip_char_filter":{
          "type":"html_strip",
          "escaped_tags":["a"]
        }
      },
      "filter": {
   #第三步:令牌(token)過濾器 ,接收切割好的token流(單詞,term),并且會添加,删除或者更改tokens,
   如:lowercase token fileter可以把所有token(單詞)轉成小寫,stop token filter停用詞,可以删除常用的單詞;
   synonym token filter 可以将同義詞引入token流
        "my_stopword": {
          "type": "stop",
          "stopwords": [
            "is",
            "in",
            "the",
            "a",
            "at",
            "for"
          ]
        }
      },
      "tokenizer": {#第2步:分詞器,切割點,切割成一個個單個的token(單詞),并輸出token流。它會将文本“Quick brown fox!”轉換為[Quick, brown, fox!],就是一段文本被分割成好幾部分。
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[ ,.!?]"
        }
      }, 
      "analyzer": {
        "my_analyzer":{
          "type":"custom",#告訴
          "char_filter":["my_char_filter","html_strip_char_filter"],
          "filter":["my_stopword","lowercase"],
          "tokenizer":"my_tokenizer"
        }
      }
    }
  }
}

GET custom_analysis/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["What is ,<a>as.df</a>  ss<p> in ? &</p> | is ! in the a at for "]
}           
elasticsearch 分詞器 character filter ,tokenizer,token filter

------------------------------自義定2---------------------------------------------

curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_simple": {
          "tokenizer": "lowercase",
          "filter": [         
          ]
        }
      }
    }
  }
}
'           

Whitespace Analyzer

whitespace 分析器,當它遇到空白字元時,就将文本解析成terms

示例:

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'           

輸出結果如下:

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]           

------------------------------Stop Analyzer-----------------

top 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了對删除停止詞的支援。預設用的停止詞是 _englisht_

(PS:意思是,假設有一句話“this is a apple”,并且假設“this” 和 “is”都是停止詞,那麼用simple的話輸出會是[ this , is , a , apple ],而用stop輸出的結果會是[ a , apple ],到這裡就看出二者的差別了,stop 不會輸出停止詞,也就是說它不認為停止詞是一個term)

(PS:所謂的停止詞,可以了解為分隔符)

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
    "analyzer": "stop",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'           

輸出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]           

stop 接受以下參數:

  • stopwords : 一個預定義的停止詞清單(比如,_englisht_)或者是一個包含停止詞的清單。預設是 _english_
  • stopwords_path : 包含停止詞的檔案路徑。這個路徑是相對于Elasticsearch的config目錄的一個路徑
curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}
'           

上面配置了一個stop分析器,它的停止詞有兩個:the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'           
基于以上配置,這個請求輸入會是這樣的:

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]           

Pattern Analyzer

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'           

由于預設按照非單詞字元分割,是以輸出會是這樣的:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]           

pattern 分析器接受如下參數:

  • pattern : 一個Java正規表達式,預設 \W+
  • flags : Java正規表達式flags。比如:CASE_INSENSITIVE 、COMMENTS
  • lowercase : 是否将terms全部轉成小寫。預設true
  • stopwords : 一個預定義的停止詞清單,或者包含停止詞的一個清單。預設是 _none_
  • stopwords_path : 停止詞檔案路徑
curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}
'           

上面的例子中配置了按照非單詞字元或者下劃線分割,并且輸出的term都是小寫

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_email_analyzer",
  "text": "[email protected]"
}
'           

是以,基于以上配置,本例輸出如下:

[ john, smith, foo, bar, com ]           

Language Analyzers

支援不同語言環境下的文本分析。内置(預定義)的語言有:arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

7 中文分詞器:ik分詞

  1. 安裝和部署ik下載下傳位址:https://github.com/medcl/elasticsearch-analysis-ikGithub加速器:https://github.com/fhefh2015/Fast-GitHub建立插件檔案夾 cd your-es-root/plugins/ && mkdir ik将插件解壓縮到檔案夾 your-es-root/plugins/ik重新啟動es
  2. IK檔案描述IKAnalyzer.cfg.xml:IK分詞配置檔案
  • 主詞庫:main.dic英文停用詞:stopword.dic,不會建立在反向索引中特殊詞庫:quantifier.dic:特殊詞庫:計量機關等suffix.dic:特殊詞庫:行政機關surname.dic:特殊詞庫:百家姓preposition:特殊詞庫:語氣詞自定義詞庫:網絡詞彙、流行詞、自造詞等
  1. ik提供的兩種analyzer:ik_max_word會将文本做最細粒度的拆分,比如會将“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”,會窮盡各種可能的組合,适合 Term Query;ik_smart: 會做最粗粒度的拆分,比如會将“中華人民共和國國歌”拆分為“中華人民共和國,國歌”,适合 Phrase 查詢。
  2. 熱更新遠端詞庫檔案優點:上手簡單缺點:詞庫的管理不友善,要操作直接操作磁盤檔案,檢索頁很麻煩檔案的讀寫沒有專門的優化性能不好多一層接口調用和網絡傳輸ik通路資料庫MySQL驅動版本相容性https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-versions.htmlhttps://dev.mysql.com/doc/connector-j/5.1/en/connector-j-versions.html驅動下載下傳位址https://mvnrepository.com/artifact/mysql/mysql-connector-java

示範下載下傳安裝:

elasticsearch 分詞器 character filter ,tokenizer,token filter
elasticsearch 分詞器 character filter ,tokenizer,token filter
elasticsearch 分詞器 character filter ,tokenizer,token filter
elasticsearch 分詞器 character filter ,tokenizer,token filter

擴充詞庫:

elasticsearch 分詞器 character filter ,tokenizer,token filter

重新開機es後生效=》

繼續閱讀