Python 多个文件中的唯一字频_Python_Data Mining

Python 多个文件中的唯一字频

python

Python 多个文件中的唯一字频,python,data-mining,Python,Data Mining,我是python新手。我有一个文件夹，里面有大约2000个文本文件。我应该输出每个单词及其出现的次数（在文件中不重复）。例如，“我就是我”这句话在一个文件中必须只包含一次“我” 我可以对单个文件执行此操作，但如何对多个文件执行此操作 from collections import Counter import re def openfile(filename): fh = open(filename, "r+") str = fh.read() fh.close()

我是python新手。我有一个文件夹，里面有大约2000个文本文件。我应该输出每个单词及其出现的次数（在文件中不重复）。例如，“我就是我”这句话在一个文件中必须只包含一次“我”

我可以对单个文件执行此操作，但如何对多个文件执行此操作

from collections import Counter
import re

def openfile(filename):
    fh = open(filename, "r+")
    str = fh.read()
    fh.close()
    return str

def removegarbage(str):
    # Replace one or more non-word (non-alphanumeric) chars with a space
    str = re.sub(r'\W+', ' ', str)
    str = str.lower()
    return str

def getwordbins(words):
    cnt = Counter()
    for word in words:
        cnt[word] += 1
    return cnt

def main(filename, topwords):
    txt = openfile(filename)
    txt = removegarbage(txt)
    words = txt.split(' ')
    bins = getwordbins(words)
    for key, value in bins.most_common(topwords):
        print key,value

main('speech.txt', 500)

请参见

os.listdir（）

，它将为您提供目录中所有条目的列表

如果我理解正确，您需要计算每个单词包含该单词的文件数。这是你能做的

对于每个文件，获取该文件中的一组单词（也就是说，单词应该是唯一的）。然后，对每一个单词计算可以找到的集合数

以下是我的建议：

循环遍历目标目录中的所有文件。您可以为此目的使用

制作一组在此文件中找到的单词：

with open(filepath, 'r') as f:
    txt = removegarbage(f.read())
    words = set(txt.split())

现在，当您在每个文件中都有一组单词时，您最终可以对这些单词集使用

Counter

。最好使用它的

update

方法。下面是一个小演示：

>>> a = set("hello Python world hello".split())
>>> a
{'Python', 'world', 'hello'}
>>> b = set("foobar hello world".split())
>>> b
{'foobar', 'hello', 'world'}
>>> c = Counter()
>>> c.update(a)
>>> c.update(b)
>>> c
Counter({'world': 2, 'hello': 2, 'Python': 1, 'foobar': 1})

所以你可以做一些类似的事情：

#!python
from __future__ import print_function
# Your code here
# ...
#

if __name__ == '__main__':
    import sys

    top=500

    if len(sys.argv) < 2:
        print ("Must supply a list of files to operate on", file=sys.stderr)
        ## For Python versions older than 2.7 use print >> sys.stderr, "..."
        sys.exit(1)

    for each in sys.argv[1:]:
        main(each, top)

您可以选择许多其他方法来处理参数、硬编码默认参数等等。我将把如何将“top”从硬编码值更改为命令行参数留给您想象。要获得额外的积分，请使用选项/arg parsing modules（或）将其设置为具有默认值的命令行开关

请注意，如果业务是一种Python约定，它鼓励您将功能与操作分离，从而促进良好的编程实践。因此，您的所有功能都可以在

行上方定义，如果

行之后可以调用脚本（使用该功能）执行的所有操作。这允许您的文件被其他程序用作模块，同时仍允许它以自己的权限作为程序使用自己的实用程序。（这几乎是Python独有的特性，尽管Ruby实现了一组类似的语义，但语法略有不同）

\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu。这对于最终向Python3的任何转换都很有用，对于语言的讨论也很有用，这样我们就可以逐步淘汰旧的“print”语句，并促进“print（）”函数的使用。如果你不在乎细节，不要担心。只要意识到这些差异是普遍存在的，而且大多是很小的。您将看到的绝大多数示例都使用旧的打印语义，并且您希望将来使用更新的、稍微不兼容的语义
（注意：在我最初的帖子中，我在\uuuuuu main\uuuuuuuu
部分使用了from\uuuuuuu future\uuuuuuuuuuu导入
。特别是，这不起作用（一般来说，Python\uuuuuu future\uuuuuuuuu
导入应该发生在任何其他代码之前）。[我基本上是想让大家了解这个想法，不想陷入Python2.x和Python3打印语义之间的差异中。
您可以通过使用glob（）
或iglob（）获得文件列表
函数。我注意到您没有有效地使用计数器
对象。最好只调用它的update（）
方法并将单词列表传递给它。下面是一个简化版本的代码，用于处理指定文件夹中的所有*.txt
文件：
from collections import Counter
from glob import iglob
import re
import os

def remove_garbage(text):
    """Replace non-word (non-alphanumeric) chars in text with spaces,
       then convert and return a lowercase version of the result.
    """
    text = re.sub(r'\W+', ' ', text)
    text = text.lower()
    return text

topwords = 100
folderpath = 'path/to/directory'
counter = Counter()
for filepath in iglob(os.path.join(folderpath, '*.txt')):
    with open(filepath) as file:
        counter.update(remove_garbage(file.read()).split())

for word, count in counter.most_common(topwords):
    print('{}: {}'.format(count, word))

从您的示例来看，输出似乎总是1
？而且在我看来，您的代码没有输出2
，而不是您的示例中的1
。在脚本中为count“cnt=Counter（）”创建一个全局变量，并在各自的函数中更新它。使用words=set（words）
删除重复项。从uuu future\uuuuu导入
内部如果。哇！嗯，是的！关于这一点。我已经把它移到了应该的位置。
from collections import Counter
from glob import iglob
import re
import os

def remove_garbage(text):
    """Replace non-word (non-alphanumeric) chars in text with spaces,
       then convert and return a lowercase version of the result.
    """
    text = re.sub(r'\W+', ' ', text)
    text = text.lower()
    return text

topwords = 100
folderpath = 'path/to/directory'
counter = Counter()
for filepath in iglob(os.path.join(folderpath, '*.txt')):
    with open(filepath) as file:
        counter.update(remove_garbage(file.read()).split())

for word, count in counter.most_common(topwords):
    print('{}: {}'.format(count, word))