Learning Elastic Stack 7.0(Second Edition)
上QQ阅读APP看书,第一时间看更新

Tokenizer

An analyzer has exactly one tokenizer. The responsibility of a tokenizer is to receive a stream of characters and generate a stream of tokens. These tokens are used to build an inverted index. A token is roughly equivalent to a word. In addition to breaking down characters into words or tokens, it also produces, in its output, the start and end offset of each token in the input stream.

Elasticsearch ships with a number of tokenizers that can be used to compose a custom analyzer; these tokenizers are also used by Elasticsearch itself to compose its built-in analyzers.

You can find a list of available built-in tokenizers here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html.

Standard tokenizer is one of the most popular tokenizers as it is suitable for most languages. Let's look at what standard tokenizer does.