Python 从NLTK中的Text.similor（）和ContextIndex.similor_Words（）生成的单词按频率排序？_Python_Nltk

Python 从NLTK中的Text.similor（）和ContextIndex.similor_Words（）生成的单词按频率排序？

python

Python 从NLTK中的Text.similor（）和ContextIndex.similor_Words（）生成的单词按频率排序？,python,nltk,Python,Nltk,我使用这两个函数查找相似的单词，它们返回不同的列表。我想知道这些函数是否按从最频繁到最不频繁的关联排序？计算每个单词的相似性分数，作为每个上下文中频率乘积的总和。只需计算单词共享的唯一上下文的数量 similor_words（）似乎在NLTK 2.0中包含一个bug。请参见中的定义：返回的单词列表应按相似性得分的降序排序。将return语句替换为： return sorted(scores, key=scores.get)[::-1][:n] 在similous（）中，对similous\

我使用这两个函数查找相似的单词，它们返回不同的列表。我想知道这些函数是否按从最频繁到最不频繁的关联排序？

计算每个单词的相似性分数，作为每个上下文中频率乘积的总和。只需计算单词共享的唯一上下文的数量

similor_words（）

似乎在NLTK 2.0中包含一个bug。请参见中的定义：

返回的单词列表应按相似性得分的降序排序。将return语句替换为：

return sorted(scores, key=scores.get)[::-1][:n]

在

similous（）

中，对

similous\u words（）

的调用被注释掉，可能是由于这个错误

def similar(self, word, num=20):
    if '_word_context_index' not in self.__dict__:
        print 'Building word-context index...'
        self._word_context_index = ContextIndex(self.tokens,
                                                filter=lambda x:x.isalpha(),
                                                key=lambda s:s.lower())

#   words = self._word_context_index.similar_words(word, num)

    word = word.lower()
    wci = self._word_context_index._word_to_contexts
    if word in wci.conditions():
        contexts = set(wci[word])
        fd = FreqDist(w for w in wci.conditions() for c in wci[w]
                      if c in contexts and not w == word)
        words = fd.keys()[:num]
        print tokenwrap(words)
    else:
        print "No matches"

注意：在

FreqDist

中，与

dict

不同，

keys（）

返回排序列表

示例：

import nltk

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

similar_words = text._word_context_index.similar_words('woman')
print ' '.join(similar_words)

man day time year car moment world family house boy child country
job state girl place war way case question   # Text.similar()

#man ('a', 'who') 9 39   # output from similar_words(); see following explanation
#girl ('a', 'who') 9 6
#[...]

man number time world fact end year state house way day use part
kind boy matter problem result girl group   # ContextIndex.similar_words()

输出：

import nltk

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

similar_words = text._word_context_index.similar_words('woman')
print ' '.join(similar_words)

man day time year car moment world family house boy child country
job state girl place war way case question   # Text.similar()

#man ('a', 'who') 9 39   # output from similar_words(); see following explanation
#girl ('a', 'who') 9 6
#[...]

man number time world fact end year state house way day use part
kind boy matter problem result girl group   # ContextIndex.similar_words()

fd

，

similor（）

中的频率分布是每个单词的上下文数量的计数：

fd = [('man', 52), ('day', 30), ('time', 30), ('year', 28), ('car', 24), ('moment', 24), ('world', 23) ...]

对于每个上下文中的每个单词，

相似单词（）

计算频率乘积的总和：

man ('a', 'who') 9 39  # 'a man who' occurs 39 times in text;
                       # 'a woman who' occurs 9 times
                       # Similarity score for the context is the product:
                       #     score['man'] = 9 * 39
girl ('a', 'who') 9 6
writer ('a', 'who') 9 4
boy ('a', 'who') 9 3
child ('a', 'who') 9 2
dealer ('a', 'who') 9 2
...
man ('a', 'and') 6 11  # score += 6 * 11
...
man ('a', 'he') 4 6    # score += 4 * 6
...
[49 more occurrences of 'man']

相似性分数和分布相似性产生的值之间有什么区别？@mac389请参阅我的扩展答案谢谢你的工作。冒着劫持这个问题的风险，你所说的“上下文记录”是什么意思？我知道

similor（）

是一个总和，

similor\u words（）

是一个乘积。求和，求和的乘积是什么？我想他们都发现了所有的单词都是相同的。那么，你为什么要这样算数呢？我显然遗漏了一些东西。在这个例子中，“男人”和“女人”（目标词）共有52个不同的上下文<代码>相似（）会给“男人”打52分

similor_words（）

计算每个上下文的实例数，例如39 x‘一个男人谁’、11 x‘一个男人和’等，并对‘女人’（9 x‘一个女人谁’、6 x‘一个女人和’等）进行同样的计算，然后将相似性分数计算为和积：39x9+11x6+6x4+。。。