Can';无法使用唯一的单词/短语计数器-Python
我很难在outut文件(word\u count.txt)中写入任何内容 我希望脚本能够查看myphrases.txt文档中的所有500个短语,并输出所有单词及其出现次数的列表Can';无法使用唯一的单词/短语计数器-Python,python,shell,keyword,Python,Shell,Keyword,我很难在outut文件(word\u count.txt)中写入任何内容 我希望脚本能够查看myphrases.txt文档中的所有500个短语,并输出所有单词及其出现次数的列表 from re import findall,sub from os import listdir from collections import Counter # path to folder containg all the files str_dir_folder = '
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
任何帮助都会很好(特别是如果我能够在输出列表中实际输出“单个”单词的话)
非常感谢。如果我们获得更多信息,例如您尝试了什么以及您收到了什么类型的错误消息,那将非常有用。正如kaveh在上面所评论的,此代码存在一些重大缩进问题。一旦我解决了这些问题,还有许多其他逻辑错误需要解决。我做了一些假设:
- 列表文件数据被分配给“../data/phrases.txt”,但有一个 循环遍历目录中的所有文件。因为您对 在其他地方有多个文件,我已经删除了该逻辑并引用了 文件列在列表文件数据中(并添加了一点错误 如果你想浏览一个目录,我建议你 使用os.walk()()
- 您将文件命名为“pharses.txt”,但随后检查文件是否 最后是“数据”。我已经删除了这个逻辑
- 当findall可以很好地处理字符串并忽略手动删除的特殊字符时,您已将数据集放入列表中。请在此处测试: 确保
- 将“w+”更改为“\w+”-请查看上面的链接
- 不需要转换到输出循环之外的列表-您的dict\u word\u count是一个计数器对象,它有一个“iteritems”方法来滚动每个键和值。还将变量名更改为“Counter\u word\u count”以稍微精确一点
- 我没有手动生成csv,而是导入csv并使用writerow方法(和引用选项)
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
像这样的
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]
您粘贴的代码中存在缩进问题。使用语句缩进
中的行,将它们放入循环中。嘿,西蒙,看起来您可能是新手。如果您觉得有答案解决了问题,请单击绿色复选标记将其标记为“已接受”。这有助于将重点放在仍然没有答案的较旧的SO上。不是吗hanks@robertrodkey全部完成。祝你周末愉快。谢谢你,Robert,帮了大忙。剧本现在很完美。