elasticsearch,Node.js,elasticsearch" /> elasticsearch,Node.js,elasticsearch" />

Node.js match_短语与Elasticsearch中的同义词标记过滤器(体裁扩展)不能很好地配合使用

Node.js match_短语与Elasticsearch中的同义词标记过滤器(体裁扩展)不能很好地配合使用,node.js,elasticsearch,Node.js,elasticsearch,更新: 在阅读了Richa的解释和建议后,这个问题似乎已经解决了,但我需要更多的测试才能确认 首先,应按照Richa的建议更改同义词格式: [绿色=>卡其色,绿色,宠物=>猫,宠物] 然后,我必须在索引映射中指定search_analyzer和index_analyzer: "mappings": { "properties": { "phone_case": { "type": "text", "norms": false,

更新:

在阅读了Richa的解释和建议后,这个问题似乎已经解决了,但我需要更多的测试才能确认

首先,应按照Richa的建议更改同义词格式:

[绿色=>卡其色,绿色,宠物=>猫,宠物]

然后,我必须在索引映射中指定search_analyzer和index_analyzer:

  "mappings": {
    "properties": {
      "phone_case": {
        "type": "text",
        "norms": false,
        "index_analyzer": "standard",
        "search_analyzer": "lowercaseWhiteSpaceAnalyzer"
      }
    }
  }
{
    settings: {
        "analysis": {
            "char_filter": {
                "same_word": {
                    "type": "mapping",
                    "mappings": ["-=>", "&=>and"]
                },
            },
            "filter": {
                "my_stopwords": {
                    "type": "stop",
                    "stopwords": STOPWORD_FILE
                },
                "my_synonym": {
                    "type": "synonym",
                    "synonyms": [ "khaki => khaki,green", "cat => cat,pet"],
                    "tokenizer": "whitespace"
                },
            },
            "analyzer": {
                "lowercaseWhiteSpaceAnalyzer": {
                    "type": "custom",
                    "char_filter": ["html_strip", "same_word"],
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "my_stopwords",
                        "my_synonym",
                    ]
                },
            }
        }
    }
}
在映射中添加这两个属性之后,我不需要在查询中使用analyzer

这些变化似乎使体裁扩展在任何一个术语和匹配短语查询中都能如期工作

Elasticsearch 7.2

同义词数据:

[卡其色=>卡其色,绿色,猫=>猫,宠物]

索引映射:

  "mappings": {
    "properties": {
      "phone_case": {
        "type": "text",
        "norms": false,
        "index_analyzer": "standard",
        "search_analyzer": "lowercaseWhiteSpaceAnalyzer"
      }
    }
  }
{
    settings: {
        "analysis": {
            "char_filter": {
                "same_word": {
                    "type": "mapping",
                    "mappings": ["-=>", "&=>and"]
                },
            },
            "filter": {
                "my_stopwords": {
                    "type": "stop",
                    "stopwords": STOPWORD_FILE
                },
                "my_synonym": {
                    "type": "synonym",
                    "synonyms": [ "khaki => khaki,green", "cat => cat,pet"],
                    "tokenizer": "whitespace"
                },
            },
            "analyzer": {
                "lowercaseWhiteSpaceAnalyzer": {
                    "type": "custom",
                    "char_filter": ["html_strip", "same_word"],
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "my_stopwords",
                        "my_synonym",
                    ]
                },
            }
        }
    }
}
物料字段:

"phone_case":{"type":"text","norms":false,"analyzer":"lowercaseWhiteSpaceAnalyzer"}
示例文件:

 [
  {
      id: "1",
      phone_case: "khaki,brushed and polished",
  },
  {
      id: "2",
      phone_case: "green,brushed",
  },
  {
      id: "3",
      phone_case: "black,matte"
  }
]
phone_case字段是一个文本字段

当我搜索卡其布时,我想找到只有卡其布结果的文档,不包括任何包含绿色的结果。另一方面,在搜索绿色时,我希望获得绿色或卡其色的文档。这应该是类型扩展应该做的

术语级查询可以很好地用于以下目的:

{
  "sort": [
    {
      "updated": {
        "order": "desc"
      }
    }
  ],
  "size": 10,
  "from": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "phone_case": "khaki"
        }
      }
    }
  }
它设法返回包含卡其布的文档

但是使用match_短语,它返回卡其色或绿色的文档。那不是我所期望的。我想获取包含卡其色而非绿色的文档:

谁能告诉我匹配查询不能排除包含绿色的结果有什么问题吗?我想允许用户按精确的顺序查找文本字段,但match或match_短语不适用于流派扩展同义词。

根据,当我们定义同义词如a=>b,c时,解析为

# Explicit mappings match any token sequence on the LHS of "=>"
# and replace with all alternatives on the RHS.  These types of mappings
# ignore the expand parameter in the schema.
因此,对于您的情况,卡其色=>卡其色,绿色,卡其色一词将替换为卡其色和绿色。这可以通过使用类似AnalyzeAPI的

GET stack-57703209/_analyze
{
  "text": "khaki",
  "analyzer": "lowercaseWhiteSpaceAnalyzer"
}
这将返回两个令牌,卡其色和绿色

如果你检查它是绿色的

您将只获得一个绿色标记

除此之外,您还可以在索引时应用此分析器。因此,在为文档编制索引时,卡其一词被卡其色和绿色标记替换,正如我们在上面使用AnalysisAPI看到的那样

运行术语查询时,它将搜索确切的术语

{
  "sort": [
    {
      "updated": {
        "order": "desc"
      }
    }
  ],
  "size": 10,
  "from": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "phone_case": "khaki"
        }
      }
    }
  }
如果搜索卡其色,您只会得到第一个响应结果,因为术语查询不应用任何搜索分析器,并且与确切的术语匹配,因此它会查找卡其色,第二个文档是phone_case:绿色,刷过,没有卡其色标记,可以使用analyze api进行检查,因此不会返回

但是,应用默认情况下与索引分析器相同的分析器,并且在您的情况下使用小写的WhiteSpaceAnalyzer。所以这两份文件都被退回了

因此,对于您的需求,您需要一个搜索分析器而不是索引分析器,因此您可以将索引设置更改为

{
  "settings": {
    "analysis": {
      "char_filter": {
        "same_word": {
          "type": "mapping",
          "mappings": [
            "-=>",
            "&=>and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": "a, an"
        },
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "green => khaki,green",      //NOTE THIS
            "cat => cat,pet"
          ],
          "tokenizer": "whitespace"
        }
      },
      "analyzer": {
        "lowercaseWhiteSpaceAnalyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "same_word"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
            ]
        },
        "synonym_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "my_synonym"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "phone_case": {
        "type": "text",
        "norms": false,
        "analyzer": "lowercaseWhiteSpaceAnalyzer"
      }
    }
  }
} 
然后指定搜索分析器,如

{
    "query": {
        "match_phrase": {
            "phone_case" : {
                "query" : "green",
                "analyzer" : "synonym_analyzer"  // NOTE THIS

            }
        }
    }
}
博客对此做了更详细的解释。
希望这有帮助

请共享索引和示例的映射document@Richa您好,我刚刚包括了映射和示例文档。在我搜索陶瓷时的描述中,我想查找仅包含陶瓷结果的文档。搜索陶瓷时,我想获取陶瓷或Ceracrom的文档。我想,第二次是cerachrom@Richa,谢谢你指出错误。我只是更改了示例,使其更易于理解和阅读。感谢您的详细解释。这很有帮助。但是为了像预期的那样进行体裁扩展,我必须在索引映射中指定index_分析器和search_分析器。添加了它们后,我不需要在匹配短语查询中使用analyzer。请查看我更新的帖子,看看是否有遗漏。@RedGiant是的。。搜索Anlayzer也可以在设置中指定。对于多词同义词,如vintage inspired=>vintage inspired、reissue、retro,我必须使用同义词_图而不是同义词,以使其在匹配查询中工作。我想知道是否有必要,因为我在同义词标记过滤器的文档中找到了一个关于sea-biscit的示例sea-biscit=>sea-biscit。对于多词,应该使用同义词图
{
  "sort": [
    {
      "updated": {
        "order": "desc"
      }
    }
  ],
  "size": 10,
  "from": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "phone_case": "khaki"
        }
      }
    }
  }
{
  "settings": {
    "analysis": {
      "char_filter": {
        "same_word": {
          "type": "mapping",
          "mappings": [
            "-=>",
            "&=>and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": "a, an"
        },
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "green => khaki,green",      //NOTE THIS
            "cat => cat,pet"
          ],
          "tokenizer": "whitespace"
        }
      },
      "analyzer": {
        "lowercaseWhiteSpaceAnalyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "same_word"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
            ]
        },
        "synonym_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "my_synonym"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "phone_case": {
        "type": "text",
        "norms": false,
        "analyzer": "lowercaseWhiteSpaceAnalyzer"
      }
    }
  }
} 
{
    "query": {
        "match_phrase": {
            "phone_case" : {
                "query" : "green",
                "analyzer" : "synonym_analyzer"  // NOTE THIS

            }
        }
    }
}