MongoDB全文搜索分数“;分数是什么意思;

MongoDB全文搜索分数“;分数是什么意思;,mongodb,algorithm,full-text-search,Mongodb,Algorithm,Full Text Search,我正在为我的学校做一个MongoDB项目。我有一个句子集合,我做了一个普通的文本搜索,以找到集合中最相似的句子,这是基于评分的 我运行此查询 当我查询句子时,看看这些结果 "that kicking a dog causes it pain" ----Matched With "that kicking a dog causes it pain – is not very controversial." ----Give a Result of: *score: 2.4* "This sen

我正在为我的学校做一个MongoDB项目。我有一个句子集合,我做了一个普通的文本搜索,以找到集合中最相似的句子,这是基于评分的

我运行此查询

当我查询句子时,看看这些结果

"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*


"This sentence have nothing to do with any other"
----Matched With
"Who is the “He” in this sentence?"
----Give a result of:
*Score: 1.0* 
分数值是多少?这是什么意思? 如果我想显示相似性仅为70%及以上的结果,该怎么办


如何解释评分结果以便显示相似性百分比,我使用C#来实现这一点,但不必担心实现。我不介意使用伪代码解决方案

文本搜索为索引字段中包含搜索词的每个文档分配分数。分数决定文档与给定搜索查询的相关性

对于文档中的每个索引字段,MongoDB将匹配数乘以权重,并对结果求和。然后,使用这个总和,MongoDB计算文档的分数

索引字段的默认权重为1


使用MongoDB文本索引时,它会为每个匹配文档生成一个分数。此分数表示搜索字符串与文档的匹配程度。分数越高,越有可能与搜索到的文本相似。分数由以下公式计算:

Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing a Phrase search). Let's say T1, T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:
       
score = (weight * data.freq * coeff * adjustment);
       
Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = ​(0.5 * data.count / numTokens) + 0.5
data.count = Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)
因此,如上所述,分数受以下因素影响:

  • 与实际搜索文本匹配的术语数量越多,则得分越高
  • 文档字段中的标记数
  • 搜索的文本是否与文档字段完全匹配
  • 以下是其中一个文档的派生:

    Search String = This sentence have nothing to do with any other
    Document = Who is the “He” in this sentence?
    
    Score Calculation:
    Step 1: Tokenize search string.Apply Stemming and remove stop words.
        Token 1: "sentence"
        Token 2: "nothing"
    Step 2: For every search token obtained in Step 1, do steps 3-11:
            
          Step 3: Take Sample Document and Remove Stop Words
                Input Document:  Who is the “He” in this sentence?
                Document after stop word removal: "sentence"
          Step 4: Apply Stemming 
            Document in Step 3: "sentence"
            After Stemming : "sentence"
          Step 5: Calculate data.count per search token 
                  data.count(sentence)= 1
                  data.count(nothing)= 1
          Step 6: Calculate total number of token in document
                  numTokens = 1
          Step 7: Calculate coefficient per search token
                  coeff = ​(0.5 * data.count / numTokens) + 0.5
                  coeff(sentence) =​ 0.5*(1/1) + 0.5 = 1.0
                  coeff(nothing) =​ 0.5*(1/1) + 0.5 = 1.0    
          Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
                  adjustment(sentence) = 1
                  adjustment(nothing) =​ 1
          Step 9: weight of field (1 is default weight)
                  weight = 1
          Step 10: Calculate frequency of search token in document (data.freq)
               For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
                a. Data.freq(sentence)= 1/(2^0) = 1
                b. Data.freq(nothing)= 0
          Step 11: Calculate score per search token per field:
             score = (weight * data.freq * coeff * adjustment);
             score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
             score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
    Step 12: Add individual score for every token of search string to get total score
    Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0 
    
    用同样的方法,你可以导出另一个

    有关更详细的MongoDB分析,请检查:

    70%的相似性意味着什么?你想用什么样的分数来衡量相似性?我实际上在尝试制作一个剽窃软件,你可以上传你的文档,然后将每个句子与一组句子进行比较。所以,当最高分数的句子相似程度达到70%或以上时,就有可能发生剽窃。@NasriYatim你找到了吗?嗨,Nasri,我也是MongoDB的新手,对我来说,我需要从名称字段中搜索名称“Raja Sekar”,我已经为其编制了索引。但我的条件是搜索词应该匹配75%的相似记录。你能在这方面帮助我吗?不要抄袭,用例子来解释会很有帮助。
    Search String = This sentence have nothing to do with any other
    Document = Who is the “He” in this sentence?
    
    Score Calculation:
    Step 1: Tokenize search string.Apply Stemming and remove stop words.
        Token 1: "sentence"
        Token 2: "nothing"
    Step 2: For every search token obtained in Step 1, do steps 3-11:
            
          Step 3: Take Sample Document and Remove Stop Words
                Input Document:  Who is the “He” in this sentence?
                Document after stop word removal: "sentence"
          Step 4: Apply Stemming 
            Document in Step 3: "sentence"
            After Stemming : "sentence"
          Step 5: Calculate data.count per search token 
                  data.count(sentence)= 1
                  data.count(nothing)= 1
          Step 6: Calculate total number of token in document
                  numTokens = 1
          Step 7: Calculate coefficient per search token
                  coeff = ​(0.5 * data.count / numTokens) + 0.5
                  coeff(sentence) =​ 0.5*(1/1) + 0.5 = 1.0
                  coeff(nothing) =​ 0.5*(1/1) + 0.5 = 1.0    
          Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
                  adjustment(sentence) = 1
                  adjustment(nothing) =​ 1
          Step 9: weight of field (1 is default weight)
                  weight = 1
          Step 10: Calculate frequency of search token in document (data.freq)
               For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
                a. Data.freq(sentence)= 1/(2^0) = 1
                b. Data.freq(nothing)= 0
          Step 11: Calculate score per search token per field:
             score = (weight * data.freq * coeff * adjustment);
             score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
             score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
    Step 12: Add individual score for every token of search string to get total score
    Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0