Current location - Plastic Surgery and Aesthetics Network - Plastic surgery and beauty - Word breakers in ES
Word breakers in ES
The full-text search engine will use some algorithm to analyze the documents to be indexed and extract some tags from the documents. These algorithms are called Tokenizer, and these tokens will be further processed, such as lowercase. These processing algorithms are called token filtering, and the processed result is called $ Term. The file contains several such $ Term, which is called frequency. The engine will build an inverted index between the $ Term and the original document, so that the source document can be found quickly according to the $ Term. Before Tokenizer processes the text, there may be some preprocessing, such as deleting the HTML tags. These processing algorithms are called character filters, and the whole analysis algorithm is called analyzer.

The whole analysis process is shown in the following figure:

As can be seen from the first part, the analyzer consists of a marker and a filter.

ES allows users to customize Analyzer analyzer through the configuration file elasticsearch.yml, as shown below:

The above configuration information registers an analyzer, myAnalyzer. After the second registration, you can use it directly when indexing or querying. The function of analyzer is similar to that of standard analyzer, tokenizer: standard, which uses a standard word separator; Filter: [standard, lowercase, stop], using standard filter, lowercase filter and stop word filter.

By default, the standard word segmenter used by ElasticSearch will divide Chinese words into Chinese characters, so many times we will find that the effect does not meet our expectations, especially after we use Chinese text to divide words, what should be a word becomes a single Chinese character, so here we use the Chinese word segmenter es-ik with better effect.

Ik has two word breakers:

Difference:

Let's create an index and use ik. Create an index named iktest, set its analyzer to use ik, set its word breaker to use ik_max_word, and create an article type with a subject field, specifying that it uses ik_max_word word breaker.

Add a few pieces of data in batches. Here I specified metadata _id for easy viewing, and the topic content is the titles of some news I randomly found.

Query "Hillary and South Korea"

The highlight attribute highlight is used here, which is displayed directly in html, and the matching words or words will be highlighted in red. If you want to search by filtering, just change the match to $ term.