Python 单词的二元结构和等级_Python

Python 单词的二元结构和等级

python

Python 单词的二元结构和等级,python,Python,我使用此代码来获取Bigram的频率： text1='the cat jumped over the dog in the dog house' text=text1.split() counts = defaultdict(int) for pair in nltk.bigrams(text): counts[pair] +=1 for c, pair in ((c, pair) for pair, c in counts.iteritems()): print pair,

我使用此代码来获取Bigram的频率：

text1='the cat jumped over the dog in the dog house'
text=text1.split()

counts = defaultdict(int)
for pair in nltk.bigrams(text):
    counts[pair] +=1

for c, pair in ((c, pair) for pair, c in counts.iteritems()):
    print pair, c

输出为：

('the', 'cat') 1
('dog', 'in') 1
('cat', 'jumped') 1
('jumped', 'over') 1
('in', 'the') 1
('over', 'the') 1
('dog', 'house') 1
('the', 'dog') 2

我需要的是列出大字组，但不是每个单词，我需要打印单词的排名。当我的意思是“等级”时，我的意思是频率最高的单词有等级1，第二个有等级2等等。。。这里的等级是：1.2.狗和频率相同的狗按降序排列。3.猫4.跳5.翻车等

乙二醇

而不是

('the', 'cat') 1

我相信要做到这一点，我需要一本有单词及其等级的字典，但我被卡住了，不知道如何继续。我得到的是：

fd=FreqDist()
ranks=[]
rank=0
for word in text:
    fd.inc(word)
for rank, word in enumerate(fd):
    ranks.append(rank+1)

word_rank = {}
for word in text:
    word_rank[word] = ranks

print ranks

假设已创建了

计数

，则应获得以下结果：

freq = defaultdict(int)
for word in text:
    freq[word] += 1

ranks = sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k)))
ranks = dict(zip(ranks, range(1, len(ranks)+1)))

for (a, b), count in counts.iteritems():
    print ranks[a], ranks[b], count

输出：

以下是一些有助于理解其工作原理的中间值：

>>> dict(freq)
{'house': 1, 'jumped': 1, 'over': 1, 'dog': 2, 'cat': 1, 'in': 1, 'the': 3}
>>> sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k)))
['the', 'dog', 'cat', 'jumped', 'over', 'in', 'house']
>>> dict(zip(ranks, range(1, len(ranks)+1)))
{'house': 7, 'jumped': 4, 'over': 5, 'dog': 2, 'cat': 3, 'in': 6, 'the': 1}

为什么

（'the'，'cat'）1

13 1

，？为什么

cat

3？不是应该是2点吗？（

cat

是文本中的第二个单词）当我指的是“排名”时，我指的是频率最高的单词排名1，第二个单词排名2等。。。这里的等级是：1.2.狗和频率相同的狗按降序排列。3.cat4.5.over ect…如果你有“dog the dog the dog the dog the dog the dog”会排在“the”之前吗？因为第一个“dog”排在第一个“the”之前。后续问题：如何将生成的矩阵存储到文件中？非常感谢。上面有几个问题，如果您仍然被卡住，请随意提出另一个问题。

>>> dict(freq)
{'house': 1, 'jumped': 1, 'over': 1, 'dog': 2, 'cat': 1, 'in': 1, 'the': 3}
>>> sorted(freq.keys(), key=lambda k: (-freq[k], text.index(k)))
['the', 'dog', 'cat', 'jumped', 'over', 'in', 'house']
>>> dict(zip(ranks, range(1, len(ranks)+1)))
{'house': 7, 'jumped': 4, 'over': 5, 'dog': 2, 'cat': 3, 'in': 6, 'the': 1}

text1='the cat jumped over the dog in the dog house'.split(' ')
word_to_rank={}
for i,word in enumerate(text1):
    if word not in word_to_rank:
        word_to_rank[word]=i+1

from collections import Counter
word_to_frequency=Counter(text1)

word_to_tuple={}
for word in word_to_rank:
    word_to_tuple[word]=(-word_to_frequency[word],word_to_rank[word])

tuple_to_word=dict(zip(word_to_tuple.values(),word_to_tuple.keys()))

sorted_by_conditions=sorted(tuple_to_word.keys())

word_to_true_rank={}
for i,_tuple in enumerate(sorted_by_conditions):
    word_to_true_rank[tuple_to_word[_tuple]]=i+1

def fix(pair,c):
    return word_to_true_rank[pair[0]],word_to_true_rank[pair[1]],c

pair=('the', 'cat')
c=1
print fix(pair,c)

pair=('the', 'dog')
c=2
print fix(pair,c)


>>>
(1, 3, 1)
(1, 2, 2)