从文本文件创建Python字典并检索每个单词的计数_Python_Dictionary

从文本文件创建Python字典并检索每个单词的计数

python dictionary

从文本文件创建Python字典并检索每个单词的计数,python,dictionary,Python,Dictionary,我正在尝试从文本文件创建一个单词词典，然后计算每个单词的实例，并能够在词典中搜索一个单词并接收其计数，但我处于停滞状态。我遇到了最大的麻烦使文本文件的话小写和删除他们的标点符号，因为否则我的计数将关闭。有什么建议吗 f=open("C:\Users\Mark\Desktop\jefferson.txt","r") wc={} words = f.read().split() count = 0 i = 0 for line in f: count += len(line.split()) for

我正在尝试从文本文件创建一个单词词典，然后计算每个单词的实例，并能够在词典中搜索一个单词并接收其计数，但我处于停滞状态。我遇到了最大的麻烦使文本文件的话小写和删除他们的标点符号，因为否则我的计数将关闭。有什么建议吗

f=open("C:\Users\Mark\Desktop\jefferson.txt","r")
wc={}
words = f.read().split()
count = 0
i = 0
for line in f: count += len(line.split())
for w in words: if i < count: words[i].translate(None, string.punctuation).lower() i += 1 else: i += 1 print words
for w in words: if w not in wc: wc[w] = 1 else: wc[w] += 1
print wc['states']

f=open（“C:\Users\Mark\Desktop\jefferson.txt”，“r”）
wc={}
words=f.read（）.split（）
计数=0
i=0
对于f中的行：count+=len（line.split（））
对于文字中的w：如果i

有几点：
在Python中，始终使用以下构造来读取文件：
 with open('ls;df', 'r') as f:
     # rest of the statements

如果使用f.read（）.split（）
，则它将读取到文件的末尾。在此之后，您需要回到开头：
f.seek(0)

第三，你要做的部分：
for w in words: 
    if i < count: 
        words[i].translate(None, string.punctuation).lower() 
        i += 1 
    else: 
        i += 1 
        print words

最后，如果您只想计数<代码>状态< /代码>，而不创建一个完整的项目字典，请考虑使用筛选器…
print len(filter( lambda m: m == 'states', words ))

最后一件事
如果文件很大，不建议一次将每个单词都放在内存中。考虑逐行更新<代码> WC./COD>字典。你可以考虑：
for line in f: 
    words = line.split()
    # rest of your code

这听起来像是集合的作业。计数器：
import collections

with open('gettysburg.txt') as f:
    c = collections.Counter(f.read().split())

print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

结果:
$python foo.py
“四”出现1次
“the”出现9次
总共有267个单词
最常见的5个词是[（'that'，10），（'The'，9），（'to'，8），（'we'，8），（'a'，7）]


当然，这把“自由”和“这个”算作单词（注意单词中的标点符号）。此外，它还将“The”和“The”视为不同的词。此外，作为一个整体处理文件可能会丢失非常大的文件
import collections
import re

with open('gettysburg.txt') as f:
    c = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line))

print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

这是一个忽略标点符号和大小写的版本，在大文件上更节省内存
import collections
import re

with open('gettysburg.txt') as f:
    c = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line))

print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

结果:
$python foo.py
“四”显示0次
“the”出现11次
总共有271个单词
最常见的5个词是[（'that'，13），（'The'，11），（'we'，10），（'to'，8），（'here'，8）]

参考资料：




文件名='File.txt'
计数器={}
打开（文件名，'r'）作为fh：
对于fh中的线路：
#删除标点符号
words=line.replace（'.''，''）.replace（'\''，''）.replace（'，''，''）.lower（）.split（）
用文字表示：
如果单词不在反语中：
反语[词]=1
其他：
反言词
print（'countoftheword>common<：：'，countedict.get（'common'，0））
您遇到了什么问题？
import collections
import re

with open('gettysburg.txt') as f:
    c = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line))

print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

File_Name = 'file.txt'
counterDict={}

with open(File_Name,'r') as fh:
    for line in fh:
   # removing their punctuation
        words = line.replace('.','').replace('\'','').replace(',','').lower().split()
        for word in words:
            if word not in counterDict:
                counterDict[word] = 1
            else:
                counterDict[word] = counterDict[word] + 1

print('Count of the word > common< :: ',  counterDict.get('common',0))