Python 用权重标准化排名分数

Python 用权重标准化排名分数,python,nlp,nltk,normalize,cosine-similarity,Python,Nlp,Nltk,Normalize,Cosine Similarity,我正在处理一个文档搜索问题,给定一组文档和一个搜索查询,我希望找到最接近查询的文档。我使用的模型基于scikit中的TfidfVectorizer。我使用4种不同类型的标记化器为所有文档创建了4种不同的tf_idf向量。每个标记器将字符串拆分为n克,其中n在范围1。。。4 . 例如: doc_1 = "Singularity is still a confusing phenomenon in physics" doc_2 = "Quantum theory still wins over S

我正在处理一个文档搜索问题,给定一组文档和一个搜索查询,我希望找到最接近查询的文档。我使用的模型基于scikit中的TfidfVectorizer。我使用4种不同类型的标记化器为所有文档创建了4种不同的tf_idf向量。每个标记器将字符串拆分为n克,其中n在范围1。。。4 .

例如:

doc_1 = "Singularity is still a confusing phenomenon in physics"
doc_2 = "Quantum theory still wins over String theory"
1-gram similarity = 0.4370303325246957
2-gram similarity = 0.36617374546988996
3-gram similarity = 0.29519246156322099
4-gram similarity = 0.2902998188509896
因此,模型1将使用1克标记器,模型2将使用2克标记器

接下来,对于给定的搜索查询,我使用这4个模型计算搜索词和所有其他文档之间的余弦相似性

例如,搜索查询:量子物理中的奇点。 搜索查询被分解为n-gram,并根据相应的n-gram模型计算tf_idf值

因此,基于所使用的n-gram模型,对于每个查询文档对,我有4个相似性值。 例如:

doc_1 = "Singularity is still a confusing phenomenon in physics"
doc_2 = "Quantum theory still wins over String theory"
1-gram similarity = 0.4370303325246957
2-gram similarity = 0.36617374546988996
3-gram similarity = 0.29519246156322099
4-gram similarity = 0.2902998188509896
所有这些相似性分数都在0到1的范围内标准化。现在我想计算一个聚合的规范化分数,这样对于任何查询文档对,较高的n-gram相似度会得到一个非常高的权重。基本上,ngram相似度越高,对总分的影响越大


有人能提出一个解决方案吗?

有很多方法可以解决这些问题:

>>> onegram_sim = 0.43
>>> twogram_sim = 0.36
>>> threegram_sim = 0.29
>>> fourgram_sim = 0.29
# Sum(x) / len(list)
>>> all_sim = sum([onegram_sim, twogram_sim, threegram_sim, fourgram_sim]) / 4
>>> all_sim
0.3425
# Sum(x*x) / len(list)
>>> all_sim = sum(map(lambda x: x**2, [onegram_sim, twogram_sim, threegram_sim, fourgram_sim])) / 4
>>> all_sim
0.120675
# Product(x)
>>> from operator import mul
>>> onetofour_sim = [onegram_sim, twogram_sim, threegram_sim, fourgram_sim]
>>> reduce(mul, onetofour_sim, 1)
0.013018679999999998
最终,任何能让你在最终任务中获得更准确分数的东西都是最好的解决方案


除了你的问题之外:

要计算文档相似性,有一个长期运行的SemEval任务调用语义文本相似性

常用策略包括(并非详尽):

  • 使用带有相似性分数的标注语料库,提取一些特征,训练回归器,并输出相似性分数

  • 使用某种向量空间语义(强烈推荐阅读:),然后做一些向量相似性评分(看看)

    一,。向量空间语义行话的一个子集将派上用场(有时称为单词嵌入),有时人们用主题模型/神经网络/深度学习(其他相关的流行词)来训练向量空间,请参阅

    二,。您还可以使用更传统的词向量包,使用TF-IDF或任何其他“潜在”降维压缩空间,然后使用一些向量相似度函数来获得相似度

    iii.创建一个奇特的向量相似性函数(例如,
    cosmul
    ,请参见),然后调整该函数,并在不同空间对其进行评估

  • 使用一些包含概念本体的词汇资源(例如WordNet、Cyc等),然后通过遍历概念图来比较相似度(请参阅)。例如


  • 以上述内容为背景,在没有注释的情况下,让我们尝试破解一些向量空间示例:

    首先,让我们尝试使用简单二进制向量的普通NGRAM:

    import numpy as np
    from nltk import ngrams
    
    doc1 = "Singularity is still a confusing phenomenon in physics".split()
    doc2 = "Quantum theory still wins over String theory".split()
    _vec1 = list(ngrams(doc1, 3))
    _vec2 = list(ngrams(doc2, 3))
    # Create a full dictionary of all possible ngrams.
    vec_dict = list(set(_vec1).union(_vec2))
    print 'Vector Dict:', vec_dict
    # Now vectorize the documents
    vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict]
    vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict]
    print 'Vectorzied:', vec1, vec2
    print 'Similarity:', np.dot(vec1, vec2)
    
    [out]:

    Vector Dict: [('still', 'a', 'confusing'), ('confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('is', 'still', 'a'), ('over', 'String', 'theory'), ('a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity', 'is', 'still'), ('still', 'wins', 'over'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still')] 
    
    Vectorzied: [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] 
    
    Similarity: 0 
    
    Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 
    
    Vectorzied: [1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0] [0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1] 
    
    Similarity: 1 
    
    现在,让我们尝试从1gram到NGram(其中
    n=len(sent)
    ),并使用二进制NGram将所有内容放入向量字典中:

    import numpy as np
    from nltk import ngrams
    
    def everygrams(sequence):
        """
        This function returns all possible ngrams for n 
        ranging from 1 to len(sequence).
        >>> list(everygrams('a b c'.split()))
        [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
        """
        for n in range(1, len(sequence)+1):
            for ng in ngrams(sequence, n):
                yield ng
    
    doc1 = "Singularity is still a confusing phenomenon in physics".split()
    doc2 = "Quantum theory still wins over String theory".split()
    _vec1 = list(everygrams(doc1))
    _vec2 = list(everygrams(doc2))
    # Create a full dictionary of all possible ngrams.
    vec_dict = list(set(_vec1).union(_vec2))
    print 'Vector Dict:', vec_dict, '\n'
    # Now vectorize the documents
    vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict]
    vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict]
    print 'Vectorzied:', vec1, vec2, '\n'
    print 'Similarity:', np.dot(vec1, vec2), '\n'
    
    [out]:

    Vector Dict: [('still', 'a', 'confusing'), ('confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('is', 'still', 'a'), ('over', 'String', 'theory'), ('a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity', 'is', 'still'), ('still', 'wins', 'over'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still')] 
    
    Vectorzied: [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] 
    
    Similarity: 0 
    
    Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 
    
    Vectorzied: [1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0] [0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1] 
    
    Similarity: 1 
    
    现在,让我们尝试通过可能的NGRAM数量进行规范化:

    import numpy as np
    from nltk import ngrams
    
    def everygrams(sequence):
        """
        This function returns all possible ngrams for n 
        ranging from 1 to len(sequence).
        >>> list(everygrams('a b c'.split()))
        [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
        """
        for n in range(1, len(sequence)+1):
            for ng in ngrams(sequence, n):
                yield ng
    
    doc1 = "Singularity is still a confusing phenomenon in physics".split()
    doc2 = "Quantum theory still wins over String theory".split()
    _vec1 = list(everygrams(doc1))
    _vec2 = list(everygrams(doc2))
    # Create a full dictionary of all possible ngrams.
    vec_dict = list(set(_vec1).union(_vec2))
    print 'Vector Dict:', vec_dict, '\n'
    # Now vectorize the documents
    vec1 = [1/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict]
    vec2 = [1/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict]
    print 'Vectorzied:', vec1, vec2, '\n'
    print 'Similarity:', np.dot(vec1, vec2), '\n'
    
    看起来好多了,出去吧:

    Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 
    
    Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571] 
    
    Similarity: 0.000992063492063 
    
    现在,让我们尝试计算ngram,而不是使用
    1/len(\u-vec)
    ,即
    \u-vec.count(ng)/len(\u-vec)

    毫不奇怪,由于计数均为1,因此相似性分数相同:

    Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 
    
    Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.07142857142857142, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571] 
    
    Similarity: 0.000992063492063 
    

    除了ngrams,你也可以试试skipgrams:

    把每个分数平方,然后取总数怎么样?或者,如果你想给更好的分数甚至更多的权重,那就使用更高的幂。除以4将标准化为[0..1]。或者,如果已标记黄金数据,则可以使用四个分数作为特征来训练分类器(例如SVM)。在预测时,1级的概率可以被视为综合分数。感谢@lenz的提示。我尝试使用exp(n-gramsize,对应的_分数)。但是按4下潜并不能使数据正常化。我还想过创建gold数据集并训练我自己的分类器。但就解决这个问题所需的时间和人力而言,这似乎是一种过火的行为。如果我能提高n克分数,我想那就足够了。你的想法?啊,我从你在alvas的回答中的评论中了解到,你想提高n-gram较高的分数,而不是较高的分数。你的最后一段不是很清楚。。。嗯,是的,那么用一个权重乘以每个分数是一件相当简单的事情,所有权重加起来等于1。谢谢你详细的回答。我刚刚从你的回答中学到了一些规范化的方法。用矢量长度来预测ngram的数量是聪明的。然而,我不明白ngram的长度是如何影响分数的。我希望ngram分数越长,对总分数的贡献就越大。例如,不同n克对th总分的贡献/影响可以是1克:10%,2克:20%,3克:30%,4克:40%。