elasticsearch,Lucene,elasticsearch" /> elasticsearch,Lucene,elasticsearch" />

Lucene 基于ElasticSearch的精确文档匹配

Lucene 基于ElasticSearch的精确文档匹配,lucene,elasticsearch,Lucene,elasticsearch,我需要精确地查询一组“短文档”。例如: 文件: {“name”:“johndoe”,“alt”:“johnw Doe”} {“姓名”:“我的朋友约翰·多伊”,“alt”:“约翰·多伊”} {“name”:“John”,“alt”:“Susy”} {“name”:“Jack”,“alt”:“John Doe”} 预期成果: 如果我搜索“John Doe”,我希望1的分数比2和4的分数大得多 如果我搜索“John Doé”,同上 如果我搜索“John”,我想得到3(精确匹配比名称和alt重复要好)

我需要精确地查询一组“短文档”。例如:

文件:

  • {“name”:“johndoe”,“alt”:“johnw Doe”}
  • {“姓名”:“我的朋友约翰·多伊”,“alt”:“约翰·多伊”}
  • {“name”:“John”,“alt”:“Susy”}
  • {“name”:“Jack”,“alt”:“John Doe”}
  • 预期成果:

  • 如果我搜索“John Doe”,我希望1的分数比2和4的分数大得多
  • 如果我搜索“John Doé”,同上
  • 如果我搜索“John”,我想得到3(精确匹配比名称和alt重复要好)

  • 有可能吗?我怎样才能做到这一点?我尝试增强“名称”,但我找不到如何精确匹配文档字段,而不是在其中进行搜索。

    我认为如果映射为多个字段,并增强非分析字段,您将实现所需的功能:

     "name": {
                "type": "multi_field",
                "fields": {
                    "untouched": {
                        "type": "string",
                        "index": "not_analyzed",
                        "boost": "1.1"
                    },
                    "name": {
                        "include_in_all": true,
                        "type": "string",
                        "index": "analyzed",
                        "search_analyzer": "someanalyzer",
                        "index_analyzer": "someanalyzer"
                    }
                }
            }
    
    如果需要灵活性,还可以通过在query_字符串中使用“^”符号来增加查询时间,而不是索引时间

    {
        "query_string" : {
            "fields" : ["name, name.untouched^5"],
            "query" : "this AND that OR thus",
        }
    }
    

    您所描述的正是搜索引擎在默认情况下的工作方式。搜索
    “John Doe”
    将变成搜索术语
    “John”
    “Doe”
    。对于每个术语,它会查找包含该术语的文档,然后根据以下内容为每个文档分配一个
    \u分数

    • 该术语在所有文档中的常见程度(更常见==不太相关)
    • 该术语在文档字段中的常见程度(更常见==更相关)
    • 文档的字段有多长(较长==相关性较低)
    您没有看到清晰结果的原因是Elasticsearch是分布式的,并且您正在使用少量数据进行测试。默认情况下,索引有5个主分片,您的文档在不同分片上编制索引。每个碎片都有自己的文档频率计数,因此分数被扭曲

    当您添加真实世界的数据量时,频率会在碎片上均匀分布,但对于测试少量数据,您需要执行以下两项操作之一:

  • 创建只有一个主碎片的索引,或
  • 指定
    search\u type=dfs\u query\u然后\u fetch
    ,在使用全局频率运行查询之前,首先从每个碎片获取频率
  • 为了演示,首先为数据编制索引:

    curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
    {
       "alt" : "John W Doe",
       "name" : "John Doe"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
    {
       "alt" : "John A Doe",
       "name" : "My friend John Doe"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
    {
       "alt" : "Susy",
       "name" : "John"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
    {
       "alt" : "John Doe",
       "name" : "Jack"
    }
    '
    
    现在,搜索
    “johndoe”
    ,记住指定
    dfs\u query\u然后\u fetch

    curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
    {
       "query" : {
          "match" : {
             "name" : "john doe"
          }
       }
    }
    '
    
    文档1是结果中的第一个:

    # {
    #    "hits" : {
    #       "hits" : [
    #          {
    #             "_source" : {
    #                "alt" : "John W Doe",
    #                "name" : "John Doe"
    #             },
    #             "_score" : 1.0189849,
    #             "_index" : "test",
    #             "_id" : "1",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John A Doe",
    #                "name" : "My friend John Doe"
    #             },
    #             "_score" : 0.81518793,
    #             "_index" : "test",
    #             "_id" : "2",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "Susy",
    #                "name" : "John"
    #             },
    #             "_score" : 0.3066778,
    #             "_index" : "test",
    #             "_id" : "3",
    #             "_type" : "test"
    #          }
    #       ],
    #       "max_score" : 1.0189849,
    #       "total" : 3
    #    },
    #    "timed_out" : false,
    #    "_shards" : {
    #       "failed" : 0,
    #       "successful" : 5,
    #       "total" : 5
    #    },
    #    "took" : 8
    # }
    
    当您只搜索“john”:

    文档3首先出现:

    # {
    #    "hits" : {
    #       "hits" : [
    #          {
    #             "_source" : {
    #                "alt" : "Susy",
    #                "name" : "John"
    #             },
    #             "_score" : 1,
    #             "_index" : "test",
    #             "_id" : "3",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John W Doe",
    #                "name" : "John Doe"
    #             },
    #             "_score" : 0.625,
    #             "_index" : "test",
    #             "_id" : "1",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John A Doe",
    #                "name" : "My friend John Doe"
    #             },
    #             "_score" : 0.5,
    #             "_index" : "test",
    #             "_id" : "2",
    #             "_type" : "test"
    #          }
    #       ],
    #       "max_score" : 1,
    #       "total" : 3
    #    },
    #    "timed_out" : false,
    #    "_shards" : {
    #       "failed" : 0,
    #       "successful" : 5,
    #       "total" : 5
    #    },
    #    "took" : 5
    # }
    
    忽略口音 第二个问题是匹配“John Doé”的问题。这是一个分析问题。为了使全文更易于搜索,我们将其分析为单独的术语或标记,这些术语或标记存储在索引中。当用户搜索
    john
    时,为了匹配例如
    john
    john
    john
    ,每个术语/令牌都要经过许多令牌过滤器,以将它们转换为标准形式

    当我们进行全文搜索时,搜索词会经历完全相同的过程。因此,如果我们有一个包含
    John
    的文档,它被索引为
    John
    ,如果用户搜索
    John
    ,我们实际上搜索
    John

    为了使
    Doé
    匹配
    doe
    ,我们需要一个去除重音符号的标记过滤器,并且我们需要将其应用于被索引的文本和搜索词。最简单的方法是使用

    我们可以在创建索引时定义自定义分析器,并且可以在映射中指定特定字段应该在索引时和搜索时使用该分析器

    首先,删除旧索引:

    curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1' 
    
    然后创建索引,指定自定义分析器和映射:

    curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
    {
       "settings" : {
          "analysis" : {
             "analyzer" : {
                "no_accents" : {
                   "filter" : [
                      "standard",
                      "lowercase",
                      "asciifolding"
                   ],
                   "type" : "custom",
                   "tokenizer" : "standard"
                }
             }
          }
       },
       "mappings" : {
          "test" : {
             "properties" : {
                "name" : {
                   "type" : "string",
                   "analyzer" : "no_accents"
                }
             }
          }
       }
    }
    '
    
    重新编制数据索引:

    curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
    {
       "alt" : "John W Doe",
       "name" : "John Doe"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
    {
       "alt" : "John A Doe",
       "name" : "My friend John Doe"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
    {
       "alt" : "Susy",
       "name" : "John"
    }
    '
    curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
    {
       "alt" : "John Doe",
       "name" : "Jack"
    }
    '
    
    现在,测试搜索:

    curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
    {
       "query" : {
          "match" : {
             "name" : "john doé"
          }
       }
    }
    '
    
    # {
    #    "hits" : {
    #       "hits" : [
    #          {
    #             "_source" : {
    #                "alt" : "John W Doe",
    #                "name" : "John Doe"
    #             },
    #             "_score" : 1.0189849,
    #             "_index" : "test",
    #             "_id" : "1",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John A Doe",
    #                "name" : "My friend John Doe"
    #             },
    #             "_score" : 0.81518793,
    #             "_index" : "test",
    #             "_id" : "2",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "Susy",
    #                "name" : "John"
    #             },
    #             "_score" : 0.3066778,
    #             "_index" : "test",
    #             "_id" : "3",
    #             "_type" : "test"
    #          }
    #       ],
    #       "max_score" : 1.0189849,
    #       "total" : 3
    #    },
    #    "timed_out" : false,
    #    "_shards" : {
    #       "failed" : 0,
    #       "successful" : 5,
    #       "total" : 5
    #    },
    #    "took" : 6
    # }
    
    curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
    {
       "query" : {
          "match" : {
             "name" : "john doé"
          }
       }
    }
    '
    
    # {
    #    "hits" : {
    #       "hits" : [
    #          {
    #             "_source" : {
    #                "alt" : "John W Doe",
    #                "name" : "John Doe"
    #             },
    #             "_score" : 1.0189849,
    #             "_index" : "test",
    #             "_id" : "1",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "John A Doe",
    #                "name" : "My friend John Doe"
    #             },
    #             "_score" : 0.81518793,
    #             "_index" : "test",
    #             "_id" : "2",
    #             "_type" : "test"
    #          },
    #          {
    #             "_source" : {
    #                "alt" : "Susy",
    #                "name" : "John"
    #             },
    #             "_score" : 0.3066778,
    #             "_index" : "test",
    #             "_id" : "3",
    #             "_type" : "test"
    #          }
    #       ],
    #       "max_score" : 1.0189849,
    #       "total" : 3
    #    },
    #    "timed_out" : false,
    #    "_shards" : {
    #       "failed" : 0,
    #       "successful" : 5,
    #       "total" : 5
    #    },
    #    "took" : 6
    # }