认识 ElasticSearch Analyzer 分析器

做全文搜索就需要对文档分析、建索引。从文档中提取词元（Token）的算法称为分词器（Tokenizer），在分词前预处理的算法称为字符过滤器（Character Filter），进一步处理词元的算法称为词元过滤器（Token Filter），最后得到词（Term）。这整个分析算法称为分析器（Analyzer）。

文档包含词的数量称为词频（Frequency）。搜索引擎会建立词与文档的索引，称为倒排索引（Inverted Index）。

Analyzer 按顺序做三件事：

使用 CharacterFilter 过滤字符
使用 Tokenizer 分词
使用 TokenFilter 过滤词

每一部分都可以指定多个组件。

Elasticsearch 默认提供了多种 CharacterFilter、Tokenizer、TokenFilter、Analyzer，你也可以下载第三方的 Analyzer 等组件。

Analyzer 一般会提供一些设置。如 standard Analyzer 提供了

stop_words

停用词过滤配置。

以下样例构造了名为

standard

的 standard Analyzer 类型的带停用词列表的分析器：

PUT /my-index/_settings

{
  "index": {
    "analysis": {
      "analyzer": {
        "standard": {
          "type": "standard",
          "stop_words": [ "it", "is", "a" ]
        }
      }
    }
  }
}

你也可以通过 Setting API 构造组合自定义的 Analyzer。如：

PUT /my-index/_settings

{
  "index": {
    "analysis": {
      "analyzer": {
        "custom": {
          "type": "custom",
          "char_filter": [ "html_strip" ],
          "tokenizer": "standard",
          "filter": [ "lowercase", "stop", "snowball" ]
        }
      }
    }
  }
}

这构造了名为

custom

的 Analyzer，它完成以下工作：

使用 html_strip 字符过滤器，移除 html 标签
使用 standard 分词器，分词
使用 lowercase 词过滤器，转为小写单词
使用 stop 词过滤器，过滤停用词
使用 snowball 词过滤器，用 snowball 雪球算法提取词干

使用 Analyze API 分析给定文档，通过这种方式可以检查配置的行为是正确的。如：

POST /my-index/_analyze?analyzer=standard

quick brown

{
  "tokens": [
    {
      "token": "quick",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "brown",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

在给目标索引建映射时，指定待分析的字段的分析器来使用我们构造的分析器。如：

PUT /my-index/_mapping/my-type

{
  "my-type": {
    "properties": {
      "name": {
        "type": "string",
        "analyzer": "custom"
      }
    }
  }
}

如果希望使用多种分析器得到不同的分词，可以使用 multi-fields 特性，指定多个产生字段：

PUT /my-index/_mapping/my-type

{
  "my-type": {
    "properties": {
      "name": {
        "type": "string",
        "analyzer": "standard",
        "fields": {
          "custom1": {
            "type": "string",
            "analyzer": "custom1"
          },
          "custom2": {
            "type": "string",
            "analyzer": "custom2"
          }
        }
      }
    }
  }
}

这样你可以通过

name

、

name.custom1

、

name.custom2

来使用不同的分析器得到的分词。

查询时也可以指定分析器。如：

POST /my-index/my-type/_search

{
  "query": {
    "match": {
      "name": {
        "query": "it's brown",
        "analyzer": "standard"
      }
    }
  }
}

或者在映射中分别指定他们。如：

PUT /my-index/_mapping/my-type

{
  "my-type": {
    "properties": {
      "name": {
        "type": "string",
        "index_analyzer": "custom",
        "search_analyzer": "standard" 
      }
    }
  }
}

然后索引一些文档，使用简单的 match 查询检查一下，如果发现问题，使用 Validate API 检查一下。如：

POST /my-index/my-type/_validate/query?explain

{
  "query": {
    "match": {
      "name": "it's brown"
    }
  }
}

现在，试着组合不同的分析器实现你的需求！

原文链接：https://blog.csdn.net/weixin_33826268/article/details/91918801

认识 ElasticSearch Analyzer 分析器

继续阅读

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

浅谈企业活动中进行数据分析的重要性

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Ambari介绍和架构原理

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

NOSQL安全攻击

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

win10本地scala和spark安装安装scala安装spark

在python中创建excel并写入