elasticsearch,Python,elasticsearch" /> elasticsearch,Python,elasticsearch" />

Python 使用稀疏向量的查询的Elasticsearch运行时错误

Python 使用稀疏向量的查询的Elasticsearch运行时错误,python,elasticsearch,Python,elasticsearch,最近Elasticsearch实现了基于向量的查询。这意味着每个文档都包含一个向量作为字段,我们可以使用一个新向量在语料库中查找匹配项 你可以找到。Elasticsearch团队在这里解释了这应该如何工作,甚至提供了一个查询字符串: { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSim

最近Elasticsearch实现了基于向量的查询。这意味着每个文档都包含一个向量作为字段,我们可以使用一个新向量在语料库中查找匹配项

你可以找到。Elasticsearch团队在这里解释了这应该如何工作,甚至提供了一个查询字符串:

{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'])",
        "params": {
          "queryVector": {"2": 0.5, "10" : 111.3, "50": -1.3, "113": 14.8, "4545": 156.0}
        }
      }
    }
  }
}
我已经安装了最新的Elasticsearch版本,特别是,
curl-XGET'http://localhost:9200“
为我提供了以下信息:

"version" : {
"number" : "7.3.0",
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "de777fa",
"build_date" : "2019-07-24T18:30:11.767338Z",
"build_snapshot" : false,
"lucene_version" : "8.1.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
}
我正在使用Python库
elasticsearch
elasticsearch\u dsl
,但还没有用于这些查询)。我可以设置Elasticsearch索引、加载文档和进行查询。例如,这项工作:

query_body = {
  "query": {
    "query_string": {
      "query": "Some text",
      "default_field": "some_field"
    }
  }
}

es.search(index=my_index, body=query_body)
但是,当我尝试使用与官方示例几乎相同的查询代码时,它不起作用

我的问题是:

query_body = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_embedding_field_name'])",
        "params": {
          "queryVector": {"1703": 0.0261, "1698": 0.0261, "2283": 0.0459, "2263": 0.0523, "3741": 0.0349}
        }
      }
    }
  }
}
请注意,查询中的稀疏向量是我做的一个示例,确保至少在我的一个文档的嵌入向量中找到了键(我不确定这是否会有问题,但在这种情况下)

错误:

elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error')
这个错误消息对我的进步没有多大帮助,因为这是一个真正的新功能,我无法在网上找到其他帮助

更新:下面是使用curl进行查询时产生的更完整的错误消息

错误的核心是:

"type" : "illegal_argument_exception",
"reason" : "Variable [embedding] is not defined."
完整的信息是:

"error" : {
"root_cause" : [
  {
    "type" : "script_exception",
    "reason" : "compile error",
    "script_stack" : [
      "... (params.queryVector, doc[embedding])",
      "                             ^---- HERE"
    ],
    "script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
    "lang" : "painless"
  },
  {
    "type" : "script_exception",
    "reason" : "compile error",
    "script_stack" : [
      "... (params.queryVector, doc[embedding])",
      "                             ^---- HERE"
    ],
    "script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
    "lang" : "painless"
  }
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
  {
    "shard" : 0,
    "index" : "test-index",
    "node" : "216BQPYoQ-SIzcrV1jzMOQ",
    "reason" : {
      "type" : "query_shard_exception",
      "reason" : "script_score: the script could not be loaded",
      "index_uuid" : "e1kpygbHRai9UL8_0Lbsdw",
      "index" : "test-index",
      "caused_by" : {
        "type" : "script_exception",
        "reason" : "compile error",
        "script_stack" : [
          "... (params.queryVector, doc[embedding])",
          "                             ^---- HERE"
        ],
        "script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
        "lang" : "painless",
        "caused_by" : {
          "type" : "illegal_argument_exception",
          "reason" : "Variable [embedding] is not defined."
        }
      }
    }
  },
  {
    "shard" : 0,
    "index" : "tutorial",
    "node" : "216BQPYoQ-SIzcrV1jzMOQ",
    "reason" : {
      "type" : "query_shard_exception",
      "reason" : "script_score: the script could not be loaded",
      "index_uuid" : "n2FNFgAFRiyB_efJKfsGPA",
      "index" : "tutorial",
      "caused_by" : {
        "type" : "script_exception",
        "reason" : "compile error",
        "script_stack" : [
          "... (params.queryVector, doc[embedding])",
          "                             ^---- HERE"
        ],
        "script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
        "lang" : "painless",
        "caused_by" : {
          "type" : "illegal_argument_exception",
          "reason" : "Variable [embedding] is not defined."
        }
      }
    }
  }
],
"caused_by" : {
  "type" : "script_exception",
  "reason" : "compile error",
  "script_stack" : [
    "... (params.queryVector, doc[embedding])",
    "                             ^---- HERE"
  ],
  "script" : "cosineSimilaritySparse(params.queryVector, doc[embedding])",
  "lang" : "painless",
  "caused_by" : {
    "type" : "illegal_argument_exception",
    "reason" : "Variable [embedding] is not defined."
  }
} }, "status" : 400}
更新2:我的文档具有以下结构:

{"name": "doc_name", "field_1": "doc_id", "field_2": "a_keyword", "text": "a rather long text", "embedding": {"4655": 0.040158602078116556, "4640": 0.040158602078116556}}
更新3:我在创建索引后传递一个映射,带有:

"properties": {
    "name": {
        "type": "keyword"
    },
    "field_1": {
        "type": "keyword"
    },
    "field_2": {
        "type": "keyword"
    },
    "text": {
        "type": "text"
    },
    "embedding": {
        "type": "sparse_vector"
    }
}

这就消除了一个错误,它抱怨过多的字段(嵌入中的每个键都被当作一个字段)。但是查询错误是相同的。

要解决这个问题,我们需要确保Elasticsearch理解向量场(在我的例子中是“嵌入”)实际上是一个稀疏向量。为此,请使用:

"properties": {
    "name": {
        "type": "keyword"
    },
    "reference": {
        "type": "keyword"
    },
    "jurisdiction": {
        "type": "keyword"
    },
    "text": {
        "type": "text"
    },
    "embedding": {
        "type": "sparse_vector"
    }
}
更多详情请参阅

有两件重要的事情需要注意:

  • 查询中字段名周围的引号是必需的
  • ,以避免出现负值

    “源”:“cosineSimilaritySparse(params.queryVector,
    doc['my\u embedding\u field\u name'])+1.0“


  • 最后几点要归功于弹性团队的jimczi(谢谢!)。请参阅。

    如果在Kibana开发工具中执行查询,您可能会看到完整的错误消息。Python一个没有多大用处。你能分享一下吗?对不起,我没有和Kibana合作这个项目。如果这真的是一条很艰难的路,我可以研究一下。但我对这一切都缺乏经验,所以我打赌我需要一段时间来设置它。在这种情况下,有没有机会检查ES服务器日志?对我来说没有什么琐碎的事情,但这听起来是可行的,给我几分钟时间,让我再联系你。使用curl,我能够访问一条更详细的错误消息,我已将其添加到问题中。这就是你要找的吗?我还为我的文档添加了一个json结构示例。