The whole analysis process is shown in the following figure:
As can be seen from the first part, the analyzer consists of a marker and a filter.
ES allows users to customize Analyzer analyzer through the configuration file elasticsearch.yml, as shown below:
The above configuration information registers an analyzer, myAnalyzer. After the second registration, you can use it directly when indexing or querying. The function of analyzer is similar to that of standard analyzer, tokenizer: standard, which uses a standard word separator; Filter: [standard, lowercase, stop], using standard filter, lowercase filter and stop word filter.
By default, the standard word segmenter used by ElasticSearch will divide Chinese words into Chinese characters, so many times we will find that the effect does not meet our expectations, especially after we use Chinese text to divide words, what should be a word becomes a single Chinese character, so here we use the Chinese word segmenter es-ik with better effect.
Ik has two word breakers:
Difference:
Let's create an index and use ik. Create an index named iktest, set its analyzer to use ik, set its word breaker to use ik_max_word, and create an article type with a subject field, specifying that it uses ik_max_word word breaker.
Add a few pieces of data in batches. Here I specified metadata _id for easy viewing, and the topic content is the titles of some news I randomly found.
Query "Hillary and South Korea"
The highlight attribute highlight is used here, which is displayed directly in html, and the matching words or words will be highlighted in red. If you want to search by filtering, just change the match to $ term.