Elasticsearch中内置了几种分词器：

Standard Analyzer - 默认分词器，按词切分，小写处理
Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
Stop Analyzer - 小写处理，停用词过滤(the,a,is)
Whitespace Analyzer - 按照空格切分，不转小写
Keyword Analyzer - 不分词，直接将输入当作输出
Patter Analyzer - 正则表达式，默认\W+(非字符分割)
Language - 提供了30多种常见语言的分词器，不支持中文
Customer Analyzer 自定义分词器

而倒排索引的过程就是将文档通过分词器分成一个个的 Term，每个单词指向包含这个 Term 的文档集合。

尽管 ES 提供了这么多的分词器，但是对于中文的支持却不尽人意，默认的分词器会将每一个字看成一个词。

IKAnalyzer 介绍

IKAnalyzer 是免费开源的java分词器,目前比较流行的中文分词器之一,简单,稳定,想要特别好的效果,需要自行维护词库,支持自定义词典。

安装：将下载好的 ik 压缩包下载并解压至 elasticsearch 目录下的 plugins 目录中，启动es即可

分词算法

IK 分词器提供了两种分词算法

ik_smart：最少切分
ik_max_word：最细粒度划分

同样输入一段话 “我是程序员”，以下是两种算法的不同表现

GET _analyze
{
  "analyzer":"ik_smart",
  "text" : "我是程序员"
}

ik_smart

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "程序员",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

ik_max_word

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "程序员",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "程序",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "员",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 4
    }
  ]
}

区别

ik_smart 将 “我是程序员” 分割成 [“我”, “是”, “程序员”] 三个词

ik_max_word 在 ik_smart 的基础上将 “程序员” 继续划分成 [“程序”, “员”]

自定义词典

创建字典文件

这里在${ik}/config/custom 目录下创建自定义的字典 custom.dic

1
2

钢铁侠
燕双鹰

修改配置文件，引入自己创建的 custom.dic

${ik}/config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom/custom.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
</properties>

重启ES

测试：

输入：

GET _analyze
{
  "analyzer":"ik_smart",
  "text" : "我是钢铁侠"
}

修改前：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "钢铁",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "侠",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    }
  ]
}

修改后

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "钢铁侠",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}