Can';无法使用唯一的单词/短语计数器-Python

Can';无法使用唯一的单词/短语计数器-Python,python,shell,keyword,Python,Shell,Keyword,我很难在outut文件(word\u count.txt)中写入任何内容 我希望脚本能够查看myphrases.txt文档中的所有500个短语,并输出所有单词及其出现次数的列表 from re import findall,sub from os import listdir from collections import Counter # path to folder containg all the files str_dir_folder = '

我很难在outut文件(word\u count.txt)中写入任何内容

我希望脚本能够查看myphrases.txt文档中的所有500个短语,并输出所有单词及其出现次数的列表

    from re import findall,sub
    from os import listdir
    from collections import Counter

    # path to folder containg all the files
    str_dir_folder = '../data'

    # name and location of output file
    str_output_file = '../data/word_count.txt'

    # the list where all the words will be placed
    list_file_data = '../data/phrases.txt'

    # loop through all the files in the directory
    for str_each_file in listdir(str_dir_folder):
        if str_each_file.endswith('data'):

    # open file and read
    with open(str_dir_folder+str_each_file,'r') as file_r_data:
        str_file_data = file_r_data.read()

    # add data to list
    list_file_data.append(str_file_data)

    # clean all the data so that we don't have all the nasty bits in it
    str_full_data = ' '.join(list_file_data)
    str_clean1 = sub('t','',str_full_data)
    str_clean_data = sub('n',' ',str_clean1)

    # find all the words and put them into a list
    list_all_words = findall('w+',str_clean_data)

    # dictionary with all the times a word has been used
    dict_word_count = Counter(list_all_words)

    # put data in a list, ready for output file
    list_output_data = []
    for str_each_item in dict_word_count:
        str_word = str_each_item
        int_freq = dict_word_count[str_each_item]

        str_out_line = '"%s",%d' % (str_word,int_freq)

        # populates output list
        list_output_data.append(str_out_line)

    # create output file, write data, close it
    file_w_output = open(str_output_file,'w')
    file_w_output.write('n'.join(list_output_data))
    file_w_output.close()
任何帮助都会很好(特别是如果我能够在输出列表中实际输出“单个”单词的话)


非常感谢。

如果我们获得更多信息,例如您尝试了什么以及您收到了什么类型的错误消息,那将非常有用。正如kaveh在上面所评论的,此代码存在一些重大缩进问题。一旦我解决了这些问题,还有许多其他逻辑错误需要解决。我做了一些假设:

  • 列表文件数据被分配给“../data/phrases.txt”,但有一个 循环遍历目录中的所有文件。因为您对 在其他地方有多个文件,我已经删除了该逻辑并引用了 文件列在列表文件数据中(并添加了一点错误 如果你想浏览一个目录,我建议你 使用os.walk()()
  • 您将文件命名为“pharses.txt”,但随后检查文件是否 最后是“数据”。我已经删除了这个逻辑
  • 当findall可以很好地处理字符串并忽略手动删除的特殊字符时,您已将数据集放入列表中。请在此处测试: 确保
  • 将“w+”更改为“\w+”-请查看上面的链接
  • 不需要转换到输出循环之外的列表-您的dict\u word\u count是一个计数器对象,它有一个“iteritems”方法来滚动每个键和值。还将变量名更改为“Counter\u word\u count”以稍微精确一点
  • 我没有手动生成csv,而是导入csv并使用writerow方法(和引用选项)
代码如下,希望对您有所帮助:

import csv
import os

from collections import Counter
from re import findall,sub


# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'

if not os.path.exists(list_file_data):
    raise OSError('File {} does not exist.'.format(list_file_data))

with open(list_file_data, 'r') as file_r_data:
    str_file_data = file_r_data.read()
    # find all the words and put them into a list
    list_all_words = findall('\w+',str_file_data)
    # dictionary with all the times a word has been used
    counter_word_count = Counter(list_all_words)

    with open(str_output_file, 'w') as output_file:
        fieldnames = ['word', 'freq']
        writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
        writer.writerow(fieldnames)

        for key, value in counter_word_count.iteritems():
            output_row = [key, value]
            writer.writerow(output_row)
像这样的

from collections import Counter
from glob import glob

def extract_words_from_line(s):
    # make this as complicated as you want for extracting words from a line
    return s.strip().split()

tally = sum(
    (Counter(extract_words_from_line(line)) 
        for infile in glob('../data/*.data')
            for line in open(infile)), 
     Counter())

for k in sorted(tally, key=tally.get, reverse=True):
    print k, tally[k]

您粘贴的代码中存在缩进问题。使用语句缩进
中的行,将它们放入循环中。嘿,西蒙,看起来您可能是新手。如果您觉得有答案解决了问题,请单击绿色复选标记将其标记为“已接受”。这有助于将重点放在仍然没有答案的较旧的SO上。不是吗hanks@robertrodkey全部完成。祝你周末愉快。谢谢你,Robert,帮了大忙。剧本现在很完美。