
Tokenizer
An analyzer has exactly one tokenizer. The responsibility of a tokenizer is to receive a stream of characters and generate a stream of tokens. These tokens are used to build an inverted index. A token is roughly equivalent to a word. In addition to breaking down characters into words or tokens, it also produces, in its output, the start and end offset of each token in the input stream.
Elasticsearch ships with a number of tokenizers that can be used to compose a custom analyzer; these tokenizers are also used by Elasticsearch itself to compose its built-in analyzers.
You can find a list of available built-in tokenizers here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html.
Standard tokenizer is one of the most popular tokenizers as it is suitable for most languages. Let's look at what standard tokenizer does.