版权声明

本文为Elastic开源社区版权所有,保证独立性和原创性,未获得授权和允许,任何组织和个人不得以任何方式传播或复制或分享,如若转发,请标注原创链接。否则必将追究法律责任。

知识内容输出不易,请尊重他人劳动成果。严禁随意传播、复制和盗用他人成果或文章内容用以商业或盈利目的!

1、分词器认知基础

1.1 基本概念

分词器官方称之为文本分析器,顾名思义,是对文本进行分析处理的一种手段,基本处理逻辑为按照预先制定的分词规则,把原始文档分割成若干更小粒度的词项,粒度大小取决于分词器规则。

1.1

1.2 分词发生时期

分词器的处理过程发生在 Index Time 和 Search Time 两个时期。

  • Index Time:文档写入并创建倒排索引时期,其分词逻辑取决于映射参数analyzer
  • Search Time:搜索发生时期,其分词仅对搜索词产生作用。

1.3 分词器的组成

  • 切词器(Tokenizer):用于定义切词(分词)逻辑
  • 词项过滤器(Token Filter):用于对分词之后的单个词项的处理逻辑
  • 字符过滤器(Character Filter):用于处理单个字符

注意

  • 分词器不会对源数据造成任何影响,分词仅仅是对倒排索引或者搜索词的行为。

2、文档归一化处理:Normalization

2.1 Processors

  • 大小写统一
  • 时态转换
  • 停用词:如一些语气词、介词等在大多数场景下均无搜索意义

注意:文档归一化处理的场景不仅限于以上几点,具体取决于分词器如何定义。

2.1-1682159152776

2.2 意义

  • 增加召回率
  • 减小匹配次数,进而提高查询性能

2.3 _analyzer API

_analyzer API可以用来查看指定分词器的分词结果。

语法如下:

GET _analyze
{
  "text": ["What are you doing!"],
  "analyzer": "english"
}

3、切词器:Tokenizer

tokenizer 是分词器的核心组成部分之一,其主要作用是分词,或称之为切词。主要用来对原始文本进行细粒度拆分。拆分之后的每一个部分称之为一个 Term,或称之为一个词项

可以把切词器理解为预定义的切词规则。

官方内置了很多种切词器,默认的切词器位 standard。

4、词项过滤器:Token Filter

4.1 简介

词项过滤器用来处理切词完成之后的词项,例如把大小写转换,删除停用词或同义词处理等。

官方同样预置了很多词项过滤器,基本可以满足日常开发的需要。当然也是支持第三方也自行开发的。

4.2 案例

下面将通过案例演示不同词项过滤器的基本使用。

4.2.1 Lowercase 和 Uppercase

GET _analyze
{
  "filter" : ["lowercase"],
  "text" : "WWW ELASTIC ORG CN"
}

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["uppercase"],
  "text" : ["www.elastic.org.cn","www elastic org cn"]
}

4.2.2 停用词

在切词完成之后,会被干掉词项,即停用词。停用词可以自定义

英文停用词(english):a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with。

中日韩停用词(cjk):a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with, www

GET _analyze
{
  "tokenizer": "standard", 
  "filter": ["stop"],
  "text": ["What are you doing"]
}

### 自定义 filter
DELETE test_token_filter_stop
PUT test_token_filter_stop
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": [
            "www"
          ],
          "ignore_case": true
        }
      }
    }
  }
}
GET test_token_filter_stop/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_filter"], 
  "text": ["What www WWW are you doing"]
}

4.2.3 同义词

同义词定义规则

  • a, b, c => d:这种方式,a、b、c 会被 d 代替。
  • a, b, c, d:这种方式下,a、b、c、d 是等价的。

同义词定义方式

  • 内联:直接在synonym内部声明规则
  • 文件:在文件中定义规则,文件相对顶级目录为 ES 的 Config 文件夹。

代码

PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": [ "good, nice => excellent" ] //good, nice, excellent
        }
      }
    }
  }
}
GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_synonym"], 
  "text": ["good"]
}

DELETE test_token_filter_synonym
PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }
    }
  }
}
GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard", 
  "text": ["a"], // a b c d s; q w e r ss
  "filter": ["my_synonym"]
}

5、字符过滤器:Character Filter

5.1 基本概念

分词之前的预处理,过滤无用字符

5.2 基本用法

5.2.1 ****

PUT <index_name>
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "<char_filter_type>"
        }
      }
    }
  }
}

5.2.2 ****

  • type:************************,************

    • html_strip
    • mapping
    • pattern_replace

5.3 ************** Char Filter

5.3.1 HTML **********:HTML Strip Character Filter

**************** HTML ********** HTML ****,** 、&

PUT PUT test_char_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip", // html_strip ******** HTML **********
          "escaped_tags": [			// ********** a ****
            "a"
          ]
        }
      }
    }
  }
}
GET test_html_strip_filter/_analyze
{
  "tokenizer": "standard", 
  "char_filter": ["my_char_filter"],
  "text": ["<p>I&apos;m so <a>happy</a>!</p>"]
}

****:

  • escaped_tags:********** html ****

5.3.2 **************:Mapping Character Filter

********************,************************

PUT test_html_strip_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",	// mapping **********************
          "mappings": [				// ****************************** => **********
            "** => *",
            "** => *",
            "** => *"
          ]
        }
      }
    }
  }
}
GET test_html_strip_filter/_analyze
{
  //"tokenizer": "standard", 
  "char_filter": ["my_char_filter"],
  "text": "************!**"
}

5.3.3 **************:Pattern Replace Character Filter

PUT text_pattern_replace_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",	// pattern_replace **********************			
          "pattern": """(\d{3})\d{4}(\d{4})""",	// **********
          "replacement": "$1****$2"
        }
      }
    }
  }
}
GET text_pattern_replace_filter/_analyze
{
  "char_filter": ["my_char_filter"],
  "text": "************18868686688"
}

6、**********:

  • Standard ★:**********,****************,**********。********:standard
  • Pattern:****************,********************。********:pattern
  • Simple:******************,**************,********:simple
  • Whitespace ★:************,**************,********:whitespace
  • Keyword ★:******************************,************************,********:keyword
  • Stop:********** Simple Analyzer ****,************************。********:stop
  • Language Analyzer:************************。
  • Fingerprint:******************,******

7、************:Custom Analyzer

7.1 **********************

**** ES **********************,****************、**********、******************************************。****************************************:

  • Tokenizer:**********************************,**************************。
  • Token Filter:********************,************************
  • Char Filter:********************,************************

7.2 type ****

PUT <index_name>
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {	 // ****************
          "type": "<value>", 		 // **** ES **********,************ custom ****************
          ...
        }
      }
    }
  }
}
  • type:********** 6 ****************,**********custom**************

7.3 ****

PUT test_analyzer
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[ ,.!?]"
        }
      },
      "char_filter": {
        "html_strip_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
            "a"
          ]
        },
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "** => *",
            "** => *",
            "** => *"
          ]
        }
      },
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": [
            "www"
          ],
          "ignore_case": true
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "my_char_filter",
            "html_strip_char_filter"
          ],
          "filter": [
            "my_filter",
            "uppercase"
          ],
          "tokenizer": "my_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

GET test_analyzer/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["asd****a**sd,<a>www</a>.elastic!org?<p>cnelasticsearch</p><b><span>"]
}
PUT test_analyzer/_doc/1
{
  "title":"asd****a**sd,<a>www</a>.elastic!org?<p>cnelasticsearch</p><b><span>"
}

8、analyzer ** search_analyzer

8.1 ********

  • analyzer:******************,****************,******************,******source data
  • search_analyzer:************,**********************,**********,****************。
  • ** search_analyzer ********,********** analyzer,** analyzer ******,search_analyzer ** analyzer ********standard

8.2 ********

PUT test_analyzer
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[ ,.!?]"
        },
        "my_search_tokenizer": {
          "type": "pattern",
          "pattern": "[<>(){}]"
        }
      },
      "char_filter": {
        "html_strip_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
            "a"
          ]
        },
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "** => *",
            "** => *",
            "** => *"
          ]
        }
      },
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": [
            "www"
          ],
          "ignore_case": true
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "my_char_filter",
            "html_strip_char_filter"
          ],
          "filter": [
            "my_filter"
          ],
          "tokenizer": "my_tokenizer"
        },
        "my_search_analyzer": {
          "type": "custom",
          "char_filter": [
            "my_char_filter",
            "html_strip_char_filter"
          ],
          "filter": [
            "my_filter"
          ],
          "tokenizer": "my_search_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "my_search_analyzer"
      }
    }
  }
}

GET test_analyzer/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["asd****a**sd,<a>www</a>.elastic!org?<p>cnelasticsearch</p><b><span>"]
}
GET test_analyzer/_analyze
{
  "analyzer": "my_search_analyzer",
  "text": ["ASD****a**sd<<a>www</a>>elastic)org(<p>cnELASTICSEARCH</p><b><span>"]
}

PUT test_analyzer/_doc/1
{
  "title":"asd****a**sd,<a>www</a>.elastic!org?<p>cnelasticsearch</p><b><span>"
}
GET test_analyzer/_search
{
  "query": {
    "match": {
      "title": "ASD****a**sd<<a>www</a>>)(<p>cnELASTICSEARCH</p><b><span>"
    }
  }
}

9、************:Normalizers

9.1 ****

normalizer ** analyzer **********,******************,****************normalizer******************,******** normalizer **** tokenizer

**** normalizer ******** keyword ************,**************** keyword **************************,************************** normalizer

9.2 ********

  • normalizer ******** keyword ****
  • normalizer ********************

9.3 ****

PUT test_normalizer
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "filter": [
            "lowercase"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "uppercase"
          ],
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "keyword",
        "normalizer": "my_normalizer"
      },
      "content": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

PUT test_normalizer/_doc/1
{
  "title":"ELASTIC Org cn",
  "content":"ELASTIC Org cn"
}

GET test_normalizer/_search
{
  "query": {
    "match": {
      "title": "ELASTIC"
    }
  }
}
GET test_normalizer/_search
{
  "query": {
    "match": {
      "content": "ELASTIC"
    }
  }
}

10、**********

10.1 **********

10.1.1 ****

10.1.2 ****

  • **************:cd {es-root-path}/plugins/ && mkdir ik
  • ********************:{es-root-path}/plugins/ik
  • ******** ES ****

10.2 ********

10.2.1 ************

  • IKAnalyzer.cfg.xml:IK************

  • ******:main.dic

  • **********:stopword.dic,********************

  • ********:

    • quantifier.dic:********:**********
    • suffix.dic:********:********
    • surname.dic:********:******
    • preposition:********:******
  • **********:********、******、**************。

10.2.2 ik ********** analyzer:

  • ik_max_word:************************,********“******************”******“**************,********,****,****,**********,****,**,**,******,****,**,****,****”,********************,**** Term Query;
  • ik_smart:******************,********“******************”******“**************,****”,**** Phrase ****。

10.2.3 ************

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": ["******************"]
}

GET _analyze
{
  "analyzer": "ik_smart",
  "text": ["******************"]
}

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": ["******************************G"]
}

PUT test_ik
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

PUT test_ik/_doc/1
{
  "title":"******************,**************"
}

GET test_ik/_search
{
  "query": {
    "match": {
      "title": "****"
    }
  }
}

10.3 ****************

10.3.1 ********

********:IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer ********</comment>
	<!--******************************** -->
	<entry key="ext_dict">custom/es_extend.dic;custom/es_buzzword.dic</entry>
	 <!--**************************************-->
	<entry key="ext_stopwords"></entry>
	<!--****************************** -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--************************************-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
  • ****************** ik/config
  • ********************;****。

10.3.2 ******

  • ****:********,************
  • ****:************
  • ****:****************************

10.4 ******************

10.4.1 ********

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer ********</comment>
	<!--******************************** -->
	<entry key="ext_dict">custom/es_extend.dic;custom/es_buzzword.dic</entry>
	 <!--**************************************-->
	<entry key="ext_stopwords"></entry>
	<!--****************************** -->
	<entry key="remote_ext_dict">http://localhost:9081/api/hotWord?wordlib=1</entry>
	<!--************************************-->
	<entry key="remote_ext_stopwords">http://localhost:9081/api/hotWord?wordlib=0</entry>
</properties>

10.4.2 Java ****

@RestController
@RequestMapping(value = "/api")
public class ApiController {
    @RequestMapping(value = "hotWord")
    public void msbHotword(HttpServletResponse response, Integer wordlib) throws IOException {
        File file = new File(wordlib == 1 ? "/Users/jiuchuan/Desktop/es_buzzword.dic" : "/Users/jiuchuan/Desktop/es_stopwords.dic");
        FileInputStream fis = new FileInputStream(file);
        byte[] buffer = new byte[(int) file.length()];
        response.setContentType("text/plain;charset=utf-8");
        response.setHeader("Last-Modified", String.valueOf(buffer.length));
        response.setHeader("ETag", String.valueOf(buffer.length));
        int offset = 0;
        while (fis.read(buffer, offset, buffer.length - offset) != -1) {

        }
        OutputStream out = response.getOutputStream();
        out.write(buffer);
        out.flush();
        fis.close();
    }
}

10.4.3 ******

  • ****:

    • ********
    • **********
    • ******
  • ****:

    • ****************,**********************,************
    • ********************************
    • ************************

10.5 **** MySQL ************

10.5.1 ********

********:******

10.5.2 ************

1、**************

2、************

**************:

java.sql.SQLNonTransientConnectionException: Could not create connection to database server.
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:526) ~[?:?]
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513) ~[?:?]
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505) ~[?:?]
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479) ~[?:?]
	at com.mysql.cj.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:1779) ~[?:?]
	at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:1596) ~[?:?]
	at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:633) ~[?:?]
	at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:347) ~[?:?]
	at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:219) ~[?:?]
	at java.sql.DriverManager.getConnection(DriverManager.java:683) ~[java.sql:?]
	at java.sql.DriverManager.getConnection(DriverManager.java:230) ~[java.sql:?]
	at org.wltea.analyzer.dic.Dictionary.loadMySQLExtDict(Dictionary.java:468) ~[?:?]
	at org.wltea.analyzer.dic.Dictionary.loadMainDict(Dictionary.java:407) ~[?:?]
	at org.wltea.analyzer.dic.Dictionary.reLoadMainDict(Dictionary.java:659) ~[?:?]
	at org.wltea.analyzer.dic.HotDict.run(HotDict.java:6) ~[?:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]
Caused by: java.security.AccessControlException: access denied ("java.net.SocketPermission" "127.0.0.1:3306" "connect,resolve")
	at java.security.AccessControlContext.checkPermission(AccessControlContext.java:485) ~[?:?]
	at java.security.AccessController.checkPermission(AccessController.java:1068) ~[?:?]
	at java.lang.SecurityManager.checkPermission(SecurityManager.java:411) ~[?:?]
	at java.lang.SecurityManager.checkConnect(SecurityManager.java:914) ~[?:?]
	at java.net.Socket.connect(Socket.java:661) ~[?:?]
	at com.mysql.cj.core.io.StandardSocketFactory.connect(StandardSocketFactory.java:202) ~[?:?]
	at com.mysql.cj.mysqla.io.MysqlaSocketConnection.connect(MysqlaSocketConnection.java:57) ~[?:?]
	at com.mysql.cj.mysqla.MysqlaSession.connect(MysqlaSession.java:122) ~[?:?]
	at com.mysql.cj.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:1726) ~[?:?]

********:********************

SELECT User, Host FROM mysql.user;

10.5.2

**********:

**** jdk **********,** jdk****/conf/security/java.policy**************:

grant {
    permission java.lang.RuntimePermission "getClassLoader";
    permission java.lang.RuntimePermission "createClassLoader";
    permission java.lang.RuntimePermission "setContextClassLoader";
    permission java.net.SocketPermission "127.0.0.1:3306","connect,resolve";
}

****************************,**********************************。

11、********

11.1 ****

  • char filter ** token filter ********************,********************

11.2 ********

//****
PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": "my_synonym"
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }
    }
  },
  "mappings": {
    "properties": {
    "title": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
  }
}

//****
POST test_token_filter_synonym/_bulk
{"index":{}}
{"title":"a"}
{"index":{}}
{"title":"b"} 
{"index":{}}
{"title":"c"} 
{"index":{}}
{"title":"d"} 
{"index":{}}
{"title":"s"} 
{"index":{}}
{"title":"q"} 
{"index":{}}
{"title":"w"} 
{"index":{}}
{"title":"e"} 
{"index":{}}
{"title":"r"} 
{"index":{}}
{"title":"ss"}

GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard",
  "filter": ["my_synonym"],
  "text": ["a"]
}

GET test_token_filter_synonym/_search
GET test_token_filter_synonym/_search
{
  "query": {
    "match": {
      "title": "q"
    }
  }
}

QQ + ****