A tokenizer receives a stream of characters, breaks it up into individual
tokens (usually individual words), and outputs a stream of tokens. For
instance, a whitespace tokenizer breaks
text into tokens whenever it sees any whitespace. It would convert the text
"Quick brown fox!" into the terms [Quick, brown, fox!].
The tokenizer is also responsible for recording the following:
- Order or position of each term (used for phrase and word proximity queries)
- Start and end character offsets of the original word which the term represents (used for highlighting search snippets).
-
Token type, a classification of each term produced, such as
<ALPHANUM>,<HANGUL>, or<NUM>. Simpler analyzers only produce thewordtoken type.
Elasticsearch has a number of built in tokenizers which can be used to build custom analyzers.
Word Oriented Tokenizers
The following tokenizers are usually used for tokenizing full text into individual words:
- Standard Tokenizer
-
The
standardtokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages. - Letter Tokenizer
-
The
lettertokenizer divides text into terms whenever it encounters a character which is not a letter. - Lowercase Tokenizer
-
The
lowercasetokenizer, like thelettertokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms. - Whitespace Tokenizer
-
The
whitespacetokenizer divides text into terms whenever it encounters any whitespace character. - UAX URL Email Tokenizer
-
The
uax_url_emailtokenizer is like thestandardtokenizer except that it recognises URLs and email addresses as single tokens. - Classic Tokenizer
-
The
classictokenizer is a grammar based tokenizer for the English Language. - Thai Tokenizer
-
The
thaitokenizer segments Thai text into words.
Partial Word Tokenizers
These tokenizers break up text or words into small fragments, for partial word matching:
- N-Gram Tokenizer
-
The
ngramtokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g.quick→[qu, ui, ic, ck]. - Edge N-Gram Tokenizer
-
The
edge_ngramtokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g.quick→[q, qu, qui, quic, quick].
Structured Text Tokenizers
The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text:
- Keyword Tokenizer
-
The
keywordtokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters likelowercaseto normalise the analysed terms. - Pattern Tokenizer
-
The
patterntokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms. - Simple Pattern Tokenizer
-
The
simple_patterntokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than thepatterntokenizer. - Char Group Tokenizer
-
The
char_grouptokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions. - Simple Pattern Split Tokenizer
-
The
simple_pattern_splittokenizer uses the same restricted regular expression subset as thesimple_patterntokenizer, but splits the input at matches rather than returning the matches as terms. - Path Tokenizer
-
The
path_hierarchytokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g./foo/bar/baz→[/foo, /foo/bar, /foo/bar/baz ].