Python 如何在多个文件中找到不重复的词频？_Python_Python 3.x

Python 如何在多个文件中找到不重复的词频？

python python-3.x

Python 如何在多个文件中找到不重复的词频？,python,python-3.x,Python,Python 3.x,我试图在一个文件夹中的多个文件中查找单词的频率，如果在一个文件中找到单词，我需要将其计数增加1。例如：如果在文件1中读取“all's well that ends well”，则该行必须将“well”的计数增加1，而不是2，如果在文件2中读到“她身体不好”，那么“好”的计数将变为2 我需要在不包含重复项的情况下增加计数器，但是我的程序没有考虑到这一点，所以请提供帮助 import os import re import sys sys.stdout=open('f1.txt','w') fr

我试图在一个文件夹中的多个文件中查找单词的频率，如果在一个文件中找到单词，我需要将其计数增加1。例如：如果在文件1中读取“all's well that ends well”，则该行必须将“well”的计数增加1，而不是2，如果在文件2中读到“她身体不好”，那么“好”的计数将变为2

我需要在不包含重复项的情况下增加计数器，但是我的程序没有考虑到这一点，所以请提供帮助

import os
import re
import sys
sys.stdout=open('f1.txt','w')
from collections import Counter
from glob import glob

def removegarbage(text):
    text=re.sub(r'\W+',' ',text)
    text=text.lower()
    sorted(text)
    return text

def removeduplicates(l):
    return list(set(l))


folderpath='d:/articles-words'
counter=Counter()


filepaths = glob(os.path.join(folderpath,'*.txt'))

num_files = len(filepaths)

# Add all words to counter
for filepath in filepaths:
    with open(filepath,'r') as filehandle:
        lines = filehandle.read()
        words = removegarbage(lines).split()
        cwords=removeduplicates(words)
        counter.update(cwords)

# Display most common
for word, count in counter.most_common():

    # Break out if the frequency is less than 0.1 * the number of files
    if count < 0.1*num_files:
        break
    print('{}  {}'.format(word,count))

导入操作系统
进口稀土
导入系统
sys.stdout=open（'f1.txt'，'w'）
从收款进口柜台
从全局导入全局
def RemovegarPage（文本）：
text=re.sub（r'\W+'，''，text）
text=text.lower（）
已排序（文本）
返回文本
def移除副本（l）：
返回列表（集合（l））
folderpath='d:/articles words'
计数器=计数器（）
filepath=glob（os.path.join（folderpath，*.txt'））
num_files=len（文件路径）
#将所有单词添加到计数器
对于文件路径中的文件路径：
以open（filepath，'r'）作为文件句柄：
lines=filehandle.read（）
words=removegarbage（行）.split（）
cwords=移除的副本（文字）
计数器更新（cwords）
#显示最常见的
对于单词，在计数器中计数。最常见（）
#如果频率小于0.1*文件数，则中断
如果计数小于0.1*num_文件：
打破
打印（{}{}.格式（字，计数））

我已经尝试过排序和删除重复的技术，但仍然不起作用

如果我正确理解了您的问题，基本上您想知道每个单词在所有文件中出现了多少次（不管同一个单词在同一个文件中是否多次出现）。为了做到这一点，我做了下面的模式，它模拟了许多文件的列表（我只关心过程，而不是文件本身，因此您可能必须设法更改要处理的实际列表的“文件”）

d = {}
i = 0 
for f in files:
    i += 1
    for line in f:   
        words = line.split()
        for word in words:
            if word not in d:
                d[word] = {}
            d[word][i] = 1    

d2 = {}
for word,occurences in d.iteritems():
    d2[word] = sum( d[word].values() )

结果将为您提供如下信息：

{'ends'：1，'that'：1，'is'：1，'well'：2，'she'：1，'not'：1，'all's'：1}

我会用一种完全不同的方法来做，但关键是使用集合

frequency = Counter()
for line in open("file", "r"):
    for word in set(line):
        frequency[word] += 1

我不确定是否最好使用

.readline（）

或其他什么；我通常使用for循环，因为它们非常简单

编辑：我知道你做错了什么。你用

.read（）

，（对其执行

removegarbage（）

）读取文件的全部内容，然后

.split（）

读取结果。这将给你一个列表，销毁换行符：

>>> "Hello world!\nFoo bar!".split()
['Hello', 'world!', 'Foo', 'bar!']

我想你的朋友比你先做到了：我会为每个文件建立一组单词，当你找到EOF时，你会更新计数器字典，为集合中的每个条目递增。然后为下一个文件启动一个新集合。你使用的是什么操作系统？windows 7，当你说“它不工作”时，我正在使用python 3.3，你的意思是什么？发生了什么？你期望得到什么？你能给出一些示例输入（例如，两个短文件）和明显不正确的输出吗？