elasticsearch 如何使用多重匹配的ngram分析仪,elasticsearch,elasticsearch" /> elasticsearch 如何使用多重匹配的ngram分析仪,elasticsearch,elasticsearch" />

elasticsearch 如何使用多重匹配的ngram分析仪

elasticsearch 如何使用多重匹配的ngram分析仪,elasticsearch,elasticsearch,我有ngram_分析仪 "analysis": { "tokenizer": { "ngram_tokenizer": { "type": "ngram", "min_gram": 2, "max_gram": 10, "token_chars": [] } }, "analyzer": { "ngram_analyzer": { "type": "cu

我有ngram_分析仪

  "analysis": {
    "tokenizer": {
      "ngram_tokenizer": {
        "type": "ngram",
        "min_gram": 2,
        "max_gram": 10,
        "token_chars": []
      }
    },
    "analyzer": {
      "ngram_analyzer": {
        "type": "custom",
        "tokenizer": "ngram_tokenizer",
        "filter": [
          "lowercase",
        ]
      }
    }
  }
并尝试搜索所有字段:

  "query": {
   "multi_match" : {
      "query":      "jan teach",
      "analyzer": "ngram_analyzer", 
      "operator":   "and",
      "type":       "cross_fields",
      "fields":     [ "name", "occupation", "surname", ... ]
    }
  }
此不幸事件不会返回任何结果

希望此项与name=“Jane”、accountry=“teacher”匹配


还是有更好的方法来实现这一点

首先,您需要的不是ngram标记器(因为它创建了更多的标记,所以索引空间很昂贵),因为您正在对标记进行前缀搜索(Jan in Jane和tech in teacher)

其次,使用搜索时间,您应该使用标准分析器,因为令牌(jan和teacher)已经存在

工作示例:

索引定义

{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "edgengram_analyzer": {
                        "type": "custom",
                        "filter": [
                            "lowercase"
                        ],
                        "tokenizer": "edgeNGramTokenizer"
                    }
                },
                "tokenizer": {
                    "edgeNGramTokenizer": {
                        "token_chars": [
                            "letter",
                            "digit"
                        ],
                        "min_gram": "2",
                        "type": "edgeNGram",
                        "max_gram": "10"
                    }
                }
            },
            "max_ngram_diff": "10"
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer" : "edgengram_analyzer",
                "search_analyzer" : "standard"
            },
            "occupation" :{
                "type" : "text",
                "analyzer" : "edgengram_analyzer",
                "search_analyzer" : "standard"
            }
        }
    }
}
索引样本文档

{
    "name" : "Jane",
    "occupation" : "teacher"
}
Jane

POST yourindexname/_analyze

{
    "text" : "Jane",
    "analyzer": "edgengram_analyzer"
}

    {
        "tokens": [
            {
                "token": "ja",
                "start_offset": 0,
                "end_offset": 2,
                "type": "word",
                "position": 0
            },
            {
                "token": "jan",
                "start_offset": 0,
                "end_offset": 3,
                "type": "word",
                "position": 1
            },
            {
                "token": "jane",
                "start_offset": 0,
                "end_offset": 4,
                "type": "word",
                "position": 2
            }
        ]
    }
搜索查询与您的查询相同(但不带分析器)

和搜索结果

"hits": [
            {
                "_index": "ngram",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.5753642,
                "_source": {
                    "name": "Jane",
                    "occupation": "teacher"
                }
            }
        ]

首先,您需要的不是ngram标记器(因为它创建了更多的标记,所以索引空间很昂贵),因为您正在对标记进行前缀搜索(Jan in Jane和tech in teacher)

其次,使用搜索时间,您应该使用标准分析器,因为令牌(jan和teacher)已经存在

工作示例:

索引定义

{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "edgengram_analyzer": {
                        "type": "custom",
                        "filter": [
                            "lowercase"
                        ],
                        "tokenizer": "edgeNGramTokenizer"
                    }
                },
                "tokenizer": {
                    "edgeNGramTokenizer": {
                        "token_chars": [
                            "letter",
                            "digit"
                        ],
                        "min_gram": "2",
                        "type": "edgeNGram",
                        "max_gram": "10"
                    }
                }
            },
            "max_ngram_diff": "10"
        }
    },
    "mappings": {
        "properties": {
            "name": {
                "type": "text",
                "analyzer" : "edgengram_analyzer",
                "search_analyzer" : "standard"
            },
            "occupation" :{
                "type" : "text",
                "analyzer" : "edgengram_analyzer",
                "search_analyzer" : "standard"
            }
        }
    }
}
索引样本文档

{
    "name" : "Jane",
    "occupation" : "teacher"
}
Jane

POST yourindexname/_analyze

{
    "text" : "Jane",
    "analyzer": "edgengram_analyzer"
}

    {
        "tokens": [
            {
                "token": "ja",
                "start_offset": 0,
                "end_offset": 2,
                "type": "word",
                "position": 0
            },
            {
                "token": "jan",
                "start_offset": 0,
                "end_offset": 3,
                "type": "word",
                "position": 1
            },
            {
                "token": "jane",
                "start_offset": 0,
                "end_offset": 4,
                "type": "word",
                "position": 2
            }
        ]
    }
搜索查询与您的查询相同(但不带分析器)

和搜索结果

"hits": [
            {
                "_index": "ngram",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.5753642,
                "_source": {
                    "name": "Jane",
                    "occupation": "teacher"
                }
            }
        ]

我得到一个错误:“NGram标记器中的最大值和最小值之间的差异必须小于或等于:[1],但为[8]。可以通过更改[index.max\u NGram\u diff]索引级别设置来设置此限制。”在es v7.6.2上运行此设置时,可能是根分析器在出现此错误的情况下仍能工作。您可能需要将setting index.max\u ngram\u diff设置为10您如何在索引时分析字段
name
occulation
?他们是否得到了ngram\U分析仪?未在索引时进行分析。搜索时执行此操作您需要在索引时使用
ngram_analyzer
分析这些字段,如果您在“属性”中不明确,则将处理为
关键字
文本
(标准分析器),因此没有可匹配的toke“jan”。FWIW
cross_fields
仅当字段共享同一个分析器时才将字段组合在一起。我得到错误:“NGram标记器中的最大值和最小值之间的差异必须小于或等于:[1]但为[8]。可以通过更改[index.max_NGram_diff]索引级别设置来设置此限制。”在es v7.6.2上运行此设置时,也许这就是根?分析器即使在出现此错误的情况下也能工作。您可能需要将setting index.max\u ngram\u diff设置为10您如何在索引时分析字段
name
occulation
?他们是否得到了ngram\U分析仪
?未在索引时进行分析。搜索时执行此操作您需要在索引时使用
ngram_analyzer
分析这些字段,如果您在“属性”中不明确,则将处理为
关键字
文本
(标准分析器),因此没有可匹配的toke“jan”。FWIW
cross_字段
仅当字段共享同一个分析器时才将它们组合在一起