elasticsearch Elasticseach-匹配更多术语的文档比匹配更少术语的文档得分更低
我有一个查询,应该返回具有类似兴趣的配置文件。问题是匹配项越多的文档得分越低 在
elasticsearch Elasticseach-匹配更多术语的文档比匹配更少术语的文档得分更低,
elasticsearch,
elasticsearch,我有一个查询,应该返回具有类似兴趣的配置文件。问题是匹配项越多的文档得分越低 在boolquery中,我有shouldwithinterests=[“游戏”、“音乐”、“运动”] 兴趣为['games']的文档得分为0.14981213 兴趣为['games','music']的文档得分为0.11516824 为什么??我正在使用AWS elasticsearch,v。2.3.2 查询如下所示: { "explain": true, "from": 0, "query":
bool
query中,我有should
withinterests=[“游戏”、“音乐”、“运动”]
兴趣为['games']的文档得分为0.14981213
兴趣为['games','music']的文档得分为0.11516824
为什么??我正在使用AWS elasticsearch,v。2.3.2
查询如下所示:
{
"explain": true,
"from": 0,
"query": {
"bool": {
"filter": [
{
"bool": {
"must_not": [
{
"term": {
"id": 3918
}
}
]
}
}
],
"should": [
{
"terms": {
"interests": [
"games",
"music",
"sport"
]
}
}
]
}
},
"size": 10
}
然后,我得到的结果是:
{
"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_explanation": {
"description": "sum of:",
"details": [
{
"description": "match on required clause, product of:",
"details": [
{
"description": "# clause",
"details": [],
"value": 0.0
},
{
"description": "-id:`\b\u0000\u0000\u001eN #*:*, product of:",
"details": [
{
"description": "boost",
"details": [],
"value": 1.0
},
{
"description": "queryNorm",
"details": [],
"value": 0.4494364
}
],
"value": 0.4494364
}
],
"value": 0.0
},
{
"description": "product of:",
"details": [
{
"description": "sum of:",
"details": [
{
"description": "weight(interests:games in 1) [PerFieldSimilarity], result of:",
"details": [
{
"description": "score(doc=1,freq=1.0), product of:",
"details": [
{
"description": "queryWeight, product of:",
"details": [
{
"description": "idf(docFreq=2, maxDocs=3)",
"details": [],
"value": 1.0
},
{
"description": "queryNorm",
"details": [],
"value": 0.4494364
}
],
"value": 0.4494364
},
{
"description": "fieldWeight in 1, product of:",
"details": [
{
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"description": "termFreq=1.0",
"details": [],
"value": 1.0
}
],
"value": 1.0
},
{
"description": "idf(docFreq=2, maxDocs=3)",
"details": [],
"value": 1.0
},
{
"description": "fieldNorm(doc=1)",
"details": [],
"value": 1.0
}
],
"value": 1.0
}
],
"value": 0.4494364
}
],
"value": 0.4494364
}
],
"value": 0.4494364
},
{
"description": "coord(1/3)",
"details": [],
"value": 0.33333334
}
],
"value": 0.14981213
}
],
"value": 0.14981213
},
"_id": "3917",
"_index": "test_44024988_profiles",
"_node": "urWXg5KhREyffYielaa6Rw",
"_score": 0.14981213,
"_shard": 2,
"_source": {
"full_name": "Bob Doe",
"id": 3916,
"interests": [
"games"
],
"user_id": 3917
},
"_type": "profile_document"
},
{
"_explanation": {
"description": "sum of:",
"details": [
{
"description": "match on required clause, product of:",
"details": [
{
"description": "# clause",
"details": [],
"value": 0.0
},
{
"description": "-id:`\b\u0000\u0000\u001eN #*:*, product of:",
"details": [
{
"description": "boost",
"details": [],
"value": 1.0
},
{
"description": "queryNorm",
"details": [],
"value": 0.9173473
}
],
"value": 0.9173473
}
],
"value": 0.0
},
{
"description": "product of:",
"details": [
{
"description": "sum of:",
"details": [
{
"description": "weight(interests:games in 0) [PerFieldSimilarity], result of:",
"details": [
{
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"description": "queryWeight, product of:",
"details": [
{
"description": "idf(docFreq=1, maxDocs=1)",
"details": [],
"value": 0.30685282
},
{
"description": "queryNorm",
"details": [],
"value": 0.9173473
}
],
"value": 0.2814906
},
{
"description": "fieldWeight in 0, product of:",
"details": [
{
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"description": "termFreq=1.0",
"details": [],
"value": 1.0
}
],
"value": 1.0
},
{
"description": "idf(docFreq=1, maxDocs=1)",
"details": [],
"value": 0.30685282
},
{
"description": "fieldNorm(doc=0)",
"details": [],
"value": 1.0
}
],
"value": 0.30685282
}
],
"value": 0.08637618
}
],
"value": 0.08637618
},
{
"description": "weight(interests:music in 0) [PerFieldSimilarity], result of:",
"details": [
{
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"description": "queryWeight, product of:",
"details": [
{
"description": "idf(docFreq=1, maxDocs=1)",
"details": [],
"value": 0.30685282
},
{
"description": "queryNorm",
"details": [],
"value": 0.9173473
}
],
"value": 0.2814906
},
{
"description": "fieldWeight in 0, product of:",
"details": [
{
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"description": "termFreq=1.0",
"details": [],
"value": 1.0
}
],
"value": 1.0
},
{
"description": "idf(docFreq=1, maxDocs=1)",
"details": [],
"value": 0.30685282
},
{
"description": "fieldNorm(doc=0)",
"details": [],
"value": 1.0
}
],
"value": 0.30685282
}
],
"value": 0.08637618
}
],
"value": 0.08637618
}
],
"value": 0.17275237
},
{
"description": "coord(2/3)",
"details": [],
"value": 0.6666667
}
],
"value": 0.11516824
}
],
"value": 0.11516824
},
"_id": "3918",
"_index": "test_44024988_profiles",
"_node": "urWXg5KhREyffYielaa6Rw",
"_score": 0.11516824,
"_shard": 4,
"_source": {
"full_name": "Alex Test",
"id": 3917,
"interests": [
"games",
"music"
],
"user_id": 3918
},
"_type": "profile_document"
},
... # not interesting doc
],
"max_score": 0.14981213,
"total": 3
},
"timed_out": false,
"took": 3
}
我的输入数据:
[{
"full_name": "Bob Doe",
"id": 3916,
"interests": [
"games"
],
"user_id": 3917
}, {
"full_name": "Alex Test",
"id": 3917,
"interests": [
"games",
"music"
],
"user_id": 3918
}, {
"full_name": "Joe Test",
"id": 3918,
"user_id": 3919
}]
让我们看看Elasticsearch中的评分公式
score(q,d) =
queryNorm(q)
· coord(q,d)
· ∑ (
tf(t in d)
· idf(t)²
· t.getBoost()
· norm(t,d)
) (t in q)
参考是,如果你不知道,你可以在这里得到一些描述。但是对你的案例的解释将非常简单,它只是做事情的公式,以及所有这些因素的组合(tf,idf,queryNorm,等等)。此外,如果索引是虚拟的,并且只包含两个文档,那么这些值可能会非常奇怪
我可以深入解释,但主要是一个评分公式。如果你想解决这个问题,那是另一个问题,你可以通过做不同的查询来解决!谢谢你的回复。我理解这个公式,但现在有一个问题——这个公式是错误的还是我的期望?我认为,
filter
不应该影响分数,应该
作为一个查询,应该向前迈进。是的,你是对的,filter不会影响分数,这正是你的情况,你只是从术语查询中得到分数。问题是,我们可以手工计算tf idf,看看公式是否完全相同,相信我,它会的。tf idf是一个棘手的问题,因为它考虑到了术语的稀有性,我不会告诉你分数与公式给出的分数不同。考虑到公式,我们同意它是正确的,但考虑到用户的共同期望,我只是想知道它是否正确。但也许那只是我。另一件事是它似乎不稳定。关于这个问题的更多上下文是,这是我在CI服务器上进行单元测试时得到的结果,在我的本地机器上的分数是“正确的”(符合我的预期)。即使使用相同的elasticsearch,也只是不同的索引名。您能提供您的完整样本数据吗?我已经添加到原始问题的底部。