Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/search/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java Lucene BM25评分_Java_Search_Lucene_Gate_Vsm - Fatal编程技术网

Java Lucene BM25评分

Java Lucene BM25评分,java,search,lucene,gate,vsm,Java,Search,Lucene,Gate,Vsm,我正在尝试使用Lucene计算许多文档的相似性。 对于相似性计算,im使用BM25和VSM 除了Lucene Im使用GATE之外,它是一个执行语言处理任务的开源框架 当我试图计算文档之间的相似性时(15),我遇到了一个奇怪的行为 Post-processing links before ranking Ranking all links by similarities 40/54 links above similarity 0.15 threshold 54/54 top-most 1.0

我正在尝试使用Lucene计算许多文档的相似性。 对于相似性计算,im使用BM25和VSM

除了Lucene Im使用GATE之外,它是一个执行语言处理任务的开源框架

当我试图计算文档之间的相似性时(15),我遇到了一个奇怪的行为

Post-processing links before ranking
Ranking all links by similarities
40/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 40 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[10768.2471]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1798.1300]
Link = [12695.xml(0,58320)@Bug[15009] | 13091.xml(0,1721)@Feature[216]]@[965.0315]
Link = [5822.xml(0,10098)@Bug[1434] | 13091.xml(0,1721)@Feature[216]]@[372.0819]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[174.2649]
Link = [12695.xml(0,58320)@Bug[15009] | 12700.xml(0,410)@Feature[36]]@[97.6378]
Link = [5822.xml(0,10098)@Bug[1434] | 1910.xml(0,237)@Feature[21]]@[46.4066]
Link = [12694.xml(0,1504)@Bug[188] | 13091.xml(0,1721)@Feature[216]]@[35.8532]
Link = [5822.xml(0,10098)@Bug[1434] | 12701.xml(0,137)@Feature[14]]@[29.6364]
Link = [12698.xml(0,362)@Bug[56] | 12713.xml(0,18247)@Feature[1974]]@[22.4652]
Link = [132.xml(0,409)@Bug[33] | 12713.xml(0,18247)@Feature[1974]]@[21.1697]
Link = [5822.xml(0,10098)@Bug[1434] | 12700.xml(0,410)@Feature[36]]@[16.7317]
Link = [132.xml(0,409)@Bug[33] | 13091.xml(0,1721)@Feature[216]]@[15.8749]
Link = [12697.xml(0,257)@Bug[34] | 12713.xml(0,18247)@Feature[1974]]@[15.5943]
Link = [12696.xml(0,272)@Bug[40] | 12713.xml(0,18247)@Feature[1974]]@[14.8670]
Link = [5822.xml(0,10098)@Bug[1434] | 12702.xml(0,88)@Feature[9]]@[14.8045]
Link = [12694.xml(0,1504)@Bug[188] | 1910.xml(0,237)@Feature[21]]@[13.8415]
Link = [12694.xml(0,1504)@Bug[188] | 12700.xml(0,410)@Feature[36]]@[11.7942]
Link = [12703.xml(0,331)@Bug[43] | 12713.xml(0,18247)@Feature[1974]]@[11.2949]
Link = [12699.xml(0,616)@Bug[67] | 12713.xml(0,18247)@Feature[1974]]@[9.4193]
Link = [12695.xml(0,58320)@Bug[15009] | 12701.xml(0,137)@Feature[14]]@[8.6146]
Link = [12699.xml(0,616)@Bug[67] | 13091.xml(0,1721)@Feature[216]]@[7.1386]
Link = [12695.xml(0,58320)@Bug[15009] | 1910.xml(0,237)@Feature[21]]@[5.9274]
Link = [12698.xml(0,362)@Bug[56] | 13091.xml(0,1721)@Feature[216]]@[4.4054]
Link = [12699.xml(0,616)@Bug[67] | 12700.xml(0,410)@Feature[36]]@[4.0292]
Link = [12703.xml(0,331)@Bug[43] | 13091.xml(0,1721)@Feature[216]]@[3.3257]
Link = [12696.xml(0,272)@Bug[40] | 13091.xml(0,1721)@Feature[216]]@[2.5366]
Link = [12695.xml(0,58320)@Bug[15009] | 12702.xml(0,88)@Feature[9]]@[2.2157]
Link = [12699.xml(0,616)@Bug[67] | 1910.xml(0,237)@Feature[21]]@[2.0420]
Link = [12697.xml(0,257)@Bug[34] | 13091.xml(0,1721)@Feature[216]]@[0.9461]
Link = [12694.xml(0,1504)@Bug[188] | 12702.xml(0,88)@Feature[9]]@[0.9092]
Link = [12694.xml(0,1504)@Bug[188] | 12701.xml(0,137)@Feature[14]]@[0.8928]
Link = [12697.xml(0,257)@Bug[34] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12696.xml(0,272)@Bug[40] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12698.xml(0,362)@Bug[56] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12698.xml(0,362)@Bug[56] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12696.xml(0,272)@Bug[40] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12697.xml(0,257)@Bug[34] | 12701.xml(0,137)@Feature[14]]@[0.8178]
使用VSM,我的结果如下所示:

Post-processing links before ranking
Ranking all links by similarities
3/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 3 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[1.6188]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1.5119]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[0.2702]
Clearing previous runtime results...

Score breakdown:
6.860396E-7 = (MATCH) max of:
  0.0 = (MATCH) MatchAllDocsQuery, product of:
    0.0 = boost
    0.0032560423 = queryNorm
  6.860396E-7 = (MATCH) product of:
    0.0034322562 = (MATCH) sum of:
      0.0017054792 = (MATCH) weight(TERM:http in 1) [DefaultSimilarity], result of:
        0.0017054792 = score(doc=1,freq=2.0), product of:
          0.0045762537 = queryWeight, product of:
            1.4054651 = idf(docFreq=3, maxDocs=6)
            0.0032560423 = queryNorm
          0.37268022 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.4054651 = idf(docFreq=3, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
      8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
        8.6338853E-4 = score(doc=1,freq=2.0), product of:
          0.0032560423 = queryWeight, product of:
            1.0 = idf(docFreq=5, maxDocs=6)
            0.0032560423 = queryNorm
          0.26516503 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.0 = idf(docFreq=5, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
      8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
        8.6338853E-4 = score(doc=1,freq=2.0), product of:
          0.0032560423 = queryWeight, product of:
            1.0 = idf(docFreq=5, maxDocs=6)
            0.0032560423 = queryNorm
          0.26516503 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.0 = idf(docFreq=5, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
    1.9988007E-4 = coord(3/15009)
有了BM25,我会有一些奇怪的行为

Post-processing links before ranking
Ranking all links by similarities
40/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 40 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[10768.2471]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1798.1300]
Link = [12695.xml(0,58320)@Bug[15009] | 13091.xml(0,1721)@Feature[216]]@[965.0315]
Link = [5822.xml(0,10098)@Bug[1434] | 13091.xml(0,1721)@Feature[216]]@[372.0819]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[174.2649]
Link = [12695.xml(0,58320)@Bug[15009] | 12700.xml(0,410)@Feature[36]]@[97.6378]
Link = [5822.xml(0,10098)@Bug[1434] | 1910.xml(0,237)@Feature[21]]@[46.4066]
Link = [12694.xml(0,1504)@Bug[188] | 13091.xml(0,1721)@Feature[216]]@[35.8532]
Link = [5822.xml(0,10098)@Bug[1434] | 12701.xml(0,137)@Feature[14]]@[29.6364]
Link = [12698.xml(0,362)@Bug[56] | 12713.xml(0,18247)@Feature[1974]]@[22.4652]
Link = [132.xml(0,409)@Bug[33] | 12713.xml(0,18247)@Feature[1974]]@[21.1697]
Link = [5822.xml(0,10098)@Bug[1434] | 12700.xml(0,410)@Feature[36]]@[16.7317]
Link = [132.xml(0,409)@Bug[33] | 13091.xml(0,1721)@Feature[216]]@[15.8749]
Link = [12697.xml(0,257)@Bug[34] | 12713.xml(0,18247)@Feature[1974]]@[15.5943]
Link = [12696.xml(0,272)@Bug[40] | 12713.xml(0,18247)@Feature[1974]]@[14.8670]
Link = [5822.xml(0,10098)@Bug[1434] | 12702.xml(0,88)@Feature[9]]@[14.8045]
Link = [12694.xml(0,1504)@Bug[188] | 1910.xml(0,237)@Feature[21]]@[13.8415]
Link = [12694.xml(0,1504)@Bug[188] | 12700.xml(0,410)@Feature[36]]@[11.7942]
Link = [12703.xml(0,331)@Bug[43] | 12713.xml(0,18247)@Feature[1974]]@[11.2949]
Link = [12699.xml(0,616)@Bug[67] | 12713.xml(0,18247)@Feature[1974]]@[9.4193]
Link = [12695.xml(0,58320)@Bug[15009] | 12701.xml(0,137)@Feature[14]]@[8.6146]
Link = [12699.xml(0,616)@Bug[67] | 13091.xml(0,1721)@Feature[216]]@[7.1386]
Link = [12695.xml(0,58320)@Bug[15009] | 1910.xml(0,237)@Feature[21]]@[5.9274]
Link = [12698.xml(0,362)@Bug[56] | 13091.xml(0,1721)@Feature[216]]@[4.4054]
Link = [12699.xml(0,616)@Bug[67] | 12700.xml(0,410)@Feature[36]]@[4.0292]
Link = [12703.xml(0,331)@Bug[43] | 13091.xml(0,1721)@Feature[216]]@[3.3257]
Link = [12696.xml(0,272)@Bug[40] | 13091.xml(0,1721)@Feature[216]]@[2.5366]
Link = [12695.xml(0,58320)@Bug[15009] | 12702.xml(0,88)@Feature[9]]@[2.2157]
Link = [12699.xml(0,616)@Bug[67] | 1910.xml(0,237)@Feature[21]]@[2.0420]
Link = [12697.xml(0,257)@Bug[34] | 13091.xml(0,1721)@Feature[216]]@[0.9461]
Link = [12694.xml(0,1504)@Bug[188] | 12702.xml(0,88)@Feature[9]]@[0.9092]
Link = [12694.xml(0,1504)@Bug[188] | 12701.xml(0,137)@Feature[14]]@[0.8928]
Link = [12697.xml(0,257)@Bug[34] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12696.xml(0,272)@Bug[40] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12698.xml(0,362)@Bug[56] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12698.xml(0,362)@Bug[56] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12696.xml(0,272)@Bug[40] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12697.xml(0,257)@Bug[34] | 12701.xml(0,137)@Feature[14]]@[0.8178]
BM25因其“良好”或高结果而将所有内容链接起来。 解释如下:

Score breakdown:
2.2157059 = (MATCH) max of:
  0.0 = (MATCH) MatchAllDocsQuery, product of:
    0.0 = boost
    1.0 = queryNorm
  2.2157059 = (MATCH) sum of:
    1.3065486 = (MATCH) weight(TERM:http in 1) [BM25Similarity], result of:
      1.3065486 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.6931472 = idf(docFreq=3, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength
    0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
      0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.24116206 = idf(docFreq=5, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength
    0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
      0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.24116206 = idf(docFreq=5, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength
出于调试的原因,我停用了术语boost和其他东西来查看真正的结果。 正常情况下,如果所有值高于1或低于0,则将其规范化为1或0

我正在使用Lucene 5.0.0。这些文件只是普通的票据,有其他票据的参考

这些相似之处如下所示:

new BM25Similarity(k1, b); where k1 = 1.2 and b = 0.75 (defaults). (BM25)
new DefaultSimilarity() (VSM)
比分怎么可能如此不同?正如我所看到的,VSM所竞争的一切都更小

有人遇到过这种奇怪的行为吗

我感谢任何帮助

--编辑

我还想知道在BM25的每个查询中QueryNorma是否等于1.0。 但是在VSM中,每个查询都是不同的

据此:

queryNorm(q)是一个用于在 可比较的查询。此因素不影响文档排名 (因为所有排名的文档都乘以相同的系数),但是 而只是尝试从不同的查询(甚至 不同指标)具有可比性

应该永远是一样的,对吗