如何在python中省略字典中不太常见的单词?
我有一本字典。我想从字典中删去计数为1的单词。我怎么做?有什么帮助吗?我想提取单词的二元模型?我怎么做如何在python中省略字典中不太常见的单词?,python,Python,我有一本字典。我想从字典中删去计数为1的单词。我怎么做?有什么帮助吗?我想提取单词的二元模型?我怎么做 import codecs file=codecs.open("Pezeshki339.txt",'r','utf8') txt = file.read() txt = txt[1:] token=txt.split() count={} for word in token: if word not in count: count[word]=1 else:
import codecs
file=codecs.open("Pezeshki339.txt",'r','utf8')
txt = file.read()
txt = txt[1:]
token=txt.split()
count={}
for word in token:
if word not in count:
count[word]=1
else:
count[word]+=1
for k,v in count.items():
print(k,v)
我可以编辑我的代码如下。但有一个问题:如何创建二元矩阵并使用addone方法使其平滑?我感谢任何与我的代码相匹配的建议
import nltk
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as file:
for line in file:
token=line.split()
spl = 80*len(token)/100
train = token[:int(spl)]
test = token[int(spl):]
print(len(test))
print(len(train))
cn=Counter(train)
known_words=([word for word,v in cn.items() if v>1])# removes the rare words and puts them in a list
print(known_words)
print(len(known_words))
bigram=nltk.bigrams(known_words)
frequency=nltk.FreqDist(bigram)
for f in frequency:
print(f,frequency[f])
使用计数器指令对单词进行计数,然后过滤项目。删除值为1的键:
from collections import Counter
import codecs
with codecs.open("Pezeshki339.txt",'r','utf8') as f:
cn = Counter(word for line in f for word in line.split())
print(dict((word,v )for word,v in cn.items() if v > 1 ))
如果您只想使用“使用列表组件”:
print([word for word,v in cn.items() if v > 1 ])
您不需要调用read,您可以一边走一边拆分每一行,如果您想删除标点符号,还需要删除:
from string import punctuation
cn = Counter(word.strip(punctuation) for line in file for word in line.split())
Padraic的解决方案非常有效。但这里有一个解决方案,它可以深入到代码下面,而不是完全重写代码:
newdictionary = {}
for k,v in count.items():
if v != 1:
newdictionary[k] = v
:-)西部最快的枪。@AmiTavory,有时;)@marysd,我会在am中看一看,今晚大脑在这里关闭。谢谢大家,谢谢Padraic。你的是最好的。这是我所需要的代码。Padraic的的确通常是最好的。还有最快的。@marysd,没问题,不客气。阿美,能给我写下来吗@事实上,我还有一个问题。我编辑了上面的问题。你能帮我解决吗?我提前感谢你的帮助。
newdictionary = {}
for k,v in count.items():
if v != 1:
newdictionary[k] = v