Java 通配符匹配的分数与精确匹配的分数不匹配
通配符匹配的分数与精确匹配的分数不匹配 我寻找Java 通配符匹配的分数与精确匹配的分数不匹配,java,search,lucene,scoring,Java,Search,Lucene,Scoring,通配符匹配的分数与精确匹配的分数不匹配 我寻找 recording:live OR recording:luve* 下面是搜索的解释输出 DocNo:0:1.4196585:11111111-1cf0-4d1f-aca7-2a6f89e34b36 1.4196585 = (MATCH) max plus 0.1 times others of: 0.3763506 = (MATCH) ConstantScore(recording:luve*), product of: 1.0
recording:live OR recording:luve*
下面是搜索的解释输出
DocNo:0:1.4196585:11111111-1cf0-4d1f-aca7-2a6f89e34b36
1.4196585 = (MATCH) max plus 0.1 times others of:
0.3763506 = (MATCH) ConstantScore(recording:luve*), product of:
1.0 = boost
0.3763506 = queryNorm
1.3820235 = (MATCH) weight(recording:luve in 0), product of:
0.7211972 = queryWeight(recording:luve), product of:
1.9162908 = idf(docFreq=1, maxDocs=5)
0.3763506 = queryNorm
1.9162908 = (MATCH) fieldWeight(recording:luve in 0), product of:
1.0 = tf(termFreq(recording:luve)=1)
1.9162908 = idf(docFreq=1, maxDocs=5)
1.0 = fieldNorm(field=recording, doc=0)
DocNo:1:0.3763506:22222222-1cf0-4d1f-aca7-2a6f89e34b36
0.3763506 = (MATCH) max plus 0.1 times others of:
0.3763506 = (MATCH) ConstantScore(recording:luve*), product of:
1.0 = boost
0.3763506 = queryNorm
在我的测试中,我有5个文档,一个包含精确匹配,另一个包含通配符匹配,另外三个不匹配所有文档。精确匹配的得分为1.4,而通配符匹配的得分为0.37,这几乎是4的一个因素。使用更大的索引,与通配符搜索相比,精确匹配稀有项的分数将更高
整个差异是由于用于通配符精确匹配的不同计分机制造成的,通配符不考虑tf/idf或lengthnorm,您只需为每场比赛获得一个恒定的分数。现在我不担心数据域中的tf或lengthnorm,它没有多大区别,但是idf分数是一个真正的杀手。因为匹配的文档在5个文档中找到一次,所以其idf贡献是idf平方,即3.61
我知道这个常量分数比计算每个通配符匹配的tf*idf*lengthnorm要快,但对于我来说,idf对分数的贡献如此之大是没有意义的。我也知道我可以改变重写方法,但这有两个问题
- 文件0:0:1.692
- 文件1:0:1.419
public static class MultiTermUseIdfOfSearchTerm<Q extends Query> extends TopTermsRewrite<BooleanQuery> {
//public static final class MultiTermUseIdfOfSearchTerm extends TopTermsRewrite<BooleanQuery> {
private final Similarity similarity;
/**
* Create a TopTermsScoringBooleanQueryRewrite for
* at most <code>size</code> terms.
* <p>
* NOTE: if {@link BooleanQuery#getMaxClauseCount} is smaller than
* <code>size</code>, then it will be used instead.
*/
public MultiTermUseIdfOfSearchTerm(int size) {
super(size);
this.similarity = new DefaultSimilarity();
}
@Override
protected int getMaxSize() {
return BooleanQuery.getMaxClauseCount();
}
@Override
protected BooleanQuery getTopLevelQuery() {
return new BooleanQuery(true);
}
@Override
protected void addClause(BooleanQuery topLevel, Term term, float boost) {
final Query tq = new ConstantScoreQuery(new TermQuery(term));
tq.setBoost(boost);
topLevel.add(tq, BooleanClause.Occur.SHOULD);
}
protected float getQueryBoost(final IndexReader reader, final MultiTermQuery query)
throws IOException {
float idf = 1f;
float df;
if (query instanceof PrefixQuery)
{
PrefixQuery fq = (PrefixQuery) query;
df = reader.docFreq(fq.getPrefix());
if(df>=1)
{
idf = (float)Math.pow(similarity.idf((int) df, reader.numDocs()),2);
}
}
return idf;
}
@Override
public BooleanQuery rewrite(final IndexReader reader, final MultiTermQuery query) throws IOException {
BooleanQuery bq = (BooleanQuery)super.rewrite(reader, query);
float idfBoost = getQueryBoost(reader, query);
Iterator<BooleanClause> iterator = bq.iterator();
while(iterator.hasNext())
{
BooleanClause next = iterator.next();
next.getQuery().setBoost(next.getQuery().getBoost() * idfBoost);
}
return bq;
}
}
好的,将此设置为前缀查询的重写方法似乎有效
public static class MultiTermUseIdfOfSearchTerm<Q extends Query> extends TopTermsRewrite<BooleanQuery> {
//public static final class MultiTermUseIdfOfSearchTerm extends TopTermsRewrite<BooleanQuery> {
private final Similarity similarity;
/**
* Create a TopTermsScoringBooleanQueryRewrite for
* at most <code>size</code> terms.
* <p>
* NOTE: if {@link BooleanQuery#getMaxClauseCount} is smaller than
* <code>size</code>, then it will be used instead.
*/
public MultiTermUseIdfOfSearchTerm(int size) {
super(size);
this.similarity = new DefaultSimilarity();
}
@Override
protected int getMaxSize() {
return BooleanQuery.getMaxClauseCount();
}
@Override
protected BooleanQuery getTopLevelQuery() {
return new BooleanQuery(true);
}
@Override
protected void addClause(BooleanQuery topLevel, Term term, float boost) {
final Query tq = new ConstantScoreQuery(new TermQuery(term));
tq.setBoost(boost);
topLevel.add(tq, BooleanClause.Occur.SHOULD);
}
protected float getQueryBoost(final IndexReader reader, final MultiTermQuery query)
throws IOException {
float idf = 1f;
float df;
if (query instanceof PrefixQuery)
{
PrefixQuery fq = (PrefixQuery) query;
df = reader.docFreq(fq.getPrefix());
if(df>=1)
{
idf = (float)Math.pow(similarity.idf((int) df, reader.numDocs()),2);
}
}
return idf;
}
@Override
public BooleanQuery rewrite(final IndexReader reader, final MultiTermQuery query) throws IOException {
BooleanQuery bq = (BooleanQuery)super.rewrite(reader, query);
float idfBoost = getQueryBoost(reader, query);
Iterator<BooleanClause> iterator = bq.iterator();
while(iterator.hasNext())
{
BooleanClause next = iterator.next();
next.getQuery().setBoost(next.getQuery().getBoost() * idfBoost);
}
return bq;
}
}
你真的需要通配符吗?Uwe(Lucene committer)说:“有一个简单的Lucene规则:每当你需要通配符的时候,想想你的分析,你可能做错了什么。”当我说通配符实际上是一个前缀查询(所以最后只需要通配符)但是,是的,我确实需要他们从用户界面上输入“luve”,但他们希望也能找到以他们输入的内容开头的单词。你真的需要通配符吗?Uwe(Lucene committer)说:“有一个简单的Lucene规则:每当你需要通配符的时候,想想你的分析,你可能做错了什么。”当我说通配符实际上是一个前缀查询(所以最后只需要通配符)但是,是的,我确实需要用户在ui中输入“luve”,但他们希望也能找到以他们键入的内容开头的单词。不,不幸的是,上面的问题2。如果Luvey是luve的一个较少见的术语,那么包含Luvey的文档可能比luve高,即使我们正在搜索Luvey,您仍然可以使用布尔查询,使用高提升的精确子句和低提升的通配符子句:
recording:luve^10 recording:luve*
,这无法一致地解决问题,因为当两个术语的idf相似时,精确匹配将是通配符匹配的十倍,但当通配符具有更好的idf时,结果将更接近。因此,它在某些情况下会很好地工作,但另一个其他人发现了问题并尝试了解决方案,只是解决了它。不,不幸的是,上面的问题2。如果Luvey是luve的一个较少见的术语,那么包含Luvey的文档可能比luve高,即使我们正在搜索Luvey,您仍然可以使用布尔查询,使用高提升的精确子句和低提升的通配符子句:recording:luve^10 recording:luve*
,这无法一致地解决问题,因为当两个术语的idf相似时,精确匹配将是通配符匹配的十倍,但当通配符具有更好的idf时,结果将更接近。因此,它在某些情况下会很好地工作,但其他人发现了问题并尝试了解决方案,只是努力解决它。