从txt文件读取单词-Python_Python_Dictionary

从txt文件读取单词-Python

python dictionary

从txt文件读取单词-Python,python,dictionary,Python,Dictionary,我开发了一个代码，负责读取txt文件中的单词，在我的例子“elquijote.txt”中，然后使用字典{key:value}显示出现的单词及其出现的次数例如，对于包含以下文字的文件“test1.txt”： hello hello hello good bye bye 我的程序的输出是： hello 3 good 1 bye 2 程序的另一个选择是，它显示的单词比我们通过参数引入的数字出现的次数要多如果在shell中，我们将下面的命令“python readingwords.

我开发了一个代码，负责读取txt文件中的单词，在我的例子“elquijote.txt”中，然后使用字典{key:value}显示出现的单词及其出现的次数

例如，对于包含以下文字的文件“test1.txt”：

hello hello hello good bye bye

我的程序的输出是：

 hello 3
 good  1
 bye   2

程序的另一个选择是，它显示的单词比我们通过参数引入的数字出现的次数要多

如果在shell中，我们将下面的命令“python readingwords.py text.txt 2”，将显示文件“test1.txt”中包含的单词，这些单词的出现次数超过我们输入的数字，在本例中为2

输出：

hello 3

bye 1
good 1
hello 3
goodbye 2

hello 3

现在我们可以引入第三个常用词的参数，比如限定连词，因为它太普通了，我们不想在字典中显示或介绍

我的代码工作正常，问题是使用大型文件，如“elquijote.txt”，需要很长时间才能完成这个过程

我一直在思考，这是因为我使用辅助列表来消除单词

作为一种解决方案，我想不在我的列表中引入由参数输入的txt文件中出现的单词，该文件包含要丢弃的单词

这是我的密码：

def contar(aux):
  counts = {}
  for palabra in aux:
    palabra = palabra.lower()
    if palabra not in counts:
      counts[palabra] = 0
    counts[palabra] += 1
  return counts

def main():

  characters = '!?¿-.:;-,><=*»¡'
  aux = []
  counts = {}

  with open(sys.argv[1],'r') as f:
    aux = ''.join(c for c in f.read() if c not in characters)
    aux = aux.split()

  if (len(sys.argv)>3):
    with open(sys.argv[3], 'r') as f:
      remove = "".join(c for c in f.read())
      remove = remove.split()

    #Borrar del archivo  
    for word in aux:  
      if word in remove:
        aux.remove(word) 

  counts = contar(aux)

  for word, count in counts.items():
    if count > int(sys.argv[2]):
      print word, count

if __name__ == '__main__':
    main()

def contar（辅助）：
计数={}
对于aux中的palabra：
palabra=palabra.lower（）
如果palabra不在统计范围内：
计数[palabra]=0
计数[palabra]+=1
返回计数
def main（）：
字符='！？¿-.:;-,> 这里有一些效率低下的地方。我已经重写了您的代码，以利用其中一些优化。每个更改的原因都在注释/文档字符串中：
# -*- coding: utf-8 -*-
import sys
from collections import Counter


def contar(aux):
    """Here I replaced your hand made solution with the
    built-in Counter which is quite a bit faster.
    There's no real reason to keep this function, I left it to keep your code
    interface intact.
    """
    return Counter(aux)

def replace_special_chars(string, chars, replace_char=" "):
    """Replaces a set of characters by another character, a space by default
    """
    for c in chars:
        string = string.replace(c, replace_char)
    return string

def main():
    characters = '!?¿-.:;-,><=*»¡'
    aux = []
    counts = {}

    with open(sys.argv[1], "r") as f:
        # You were calling lower() once for every `word`. Now we only
        # call it once for the whole file:
        contents = f.read().strip().lower()
        contents = replace_special_chars(contents, characters)
        aux = contents.split()

    #Borrar del archivo
    if len(sys.argv) > 3:
        with open(sys.argv[3], "r") as f:
            # what you had here was very ineffecient:
            # remove = "".join(c for c in f.read())
            # that would create an array or characters then join them together as a string.
            # this is a bit silly because it's identical to f.read():
            # "".join(c for c in f.read()) === f.read()
            ignore_words = set(f.read().strip().split())
            """ignore_words is a `set` to allow for very fast inclusion/exclusion checks"""
            aux = (word for word in aux if word not in ignore_words)

    counts = contar(aux)

    for word, count in counts.items():
        if count > int(sys.argv[2]):
            print word, count


if __name__ == '__main__':
    main()

#-*-编码：utf-8-*-
导入系统
从收款进口柜台
def控制（辅助）：
“在这里，我用
内置计数器，速度快一点。
没有真正的理由保留这个函数，我留下它是为了保留您的代码
接口完好无损。
"""
返回计数器（辅助）
def replace_special_chars（字符串，chars，replace_char=“”）：
“”“将一组字符替换为另一个字符，默认情况下为空格
"""
对于以字符表示的c：
string=string.replace（c，replace_char）
返回字符串
def main（）：
字符='！？¿-.:;-,> int（sys.argv[2]）：
打印字数
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
main（）
以下是一些优化：

使用collections.Counter（）对contar（）中的项目进行计数
使用string.translate（）删除不需要的字符
在计数后从“忽略单词”列表中弹出项目，而不是将其从原始文本中删除

加快速度一点，但不是一个数量级
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import os
import collections  

def contar(aux):
    return collections.Counter(aux)

def main():

  characters = '!?¿-.:;-,><=*»¡'
  aux = []
  counts = {}

  with open(sys.argv[1],'r') as f:
    text = f.read().lower().translate(None, characters)
    aux = text.split()

  if (len(sys.argv)>3):
    with open(sys.argv[3], 'r') as f:
      remove = set(f.read().strip().split())
  else:
    remove = []

  counts = contar(aux)
  for r in remove:
    counts.pop(r, None)

  for word, count in counts.items():
    if count > int(sys.argv[2]):
      print word, count

if __name__ == '__main__':
    main()

#/usr/bin/python
#-*-编码：utf-8-*-
导入系统
导入操作系统
导入集合
def控制（辅助）：
返回集合。计数器（辅助）
def main（）：
字符='！？¿-.:;-,> 一些变化和推理：
在\uuuu name\uuuu==“main”
：下解析命令行参数，通过这样做，您可以强制实现代码的模块化，因为它只在您运行此脚本本身时请求命令行参数，而不是从其他脚本导入函数
使用正则表达式过滤不需要字符的单词：使用正则表达式可以说出需要或不需要的字符，以较短者为准。在这种情况下，与在简单的正则表达式模式中声明所需的字符相比，对不需要的每个特殊字符进行硬编码是一项相当繁琐的任务。在下面的脚本中，我使用模式[aA-zZ0-9]+
过滤掉非字母数字的单词
在获得许可之前请求原谅：由于最小计数命令行参数是可选的，因此它显然不总是存在。因此，我们可以通过使用try
except
块来尝试将最小计数定义为sys.argv[2]
，并捕获索引器的异常以将最小计数默认为0

Python脚本：
# sys
import sys
# regex
import re

def main(text_file, min_count):
    word_count = {}

    with open(text_file, 'r') as words:
        # Clean words of linebreaks and split
        # by ' ' to get list of words
        words = words.read().strip().split(' ')

        # Filter words that are not alphanum
        pattern = re.compile(r'^[aA-zZ0-9]+$')
        words = filter(pattern.search,words)

        # Iterate through words and collect
        # count
        for word in words:
            if word in word_count:
                word_count[word] = word_count[word] + 1
            else:
                word_count[word] = 1

    # Iterate for output
    for word, count in word_count.items():
        if count > min_count:
            print('%s %s' % (word, count))

if __name__ == '__main__':
    # Get text file name
    text_file = sys.argv[1]

    # Attempt to get minimum count
    # from command line.
    # Default to 0
    try:
        min_count = int(sys.argv[2])
    except IndexError:
        min_count = 0

    main(text_file, min_count)

文本文件：
hello hello hello good bye goodbye !bye bye¶ b?e goodbye

命令：
python script.py text.txt

输出：
hello 3

bye 1
good 1
hello 3
goodbye 2

hello 3

使用最小计数命令：
python script.py text.txt 2

输出：
hello 3

bye 1
good 1
hello 3
goodbye 2

hello 3

看起来我们的想法非常相似，但你击败了我。的确，但我很高兴：你不经意间给我介绍了一种新方法：translate（）
。我不确定我会在这里使用它（取决于数据：糟糕的标点符号/标点符号周围缺少间距会破坏它），但我肯定能找到它的位置。干杯在我的示例中，我非常天真地使用了translate
，以保持其简单性，但您可以创建一个转换表，将列出的字符交换为一个空间，而不是删除它们，如果这是所需的功能。为什么不对集合进行正常的字数计算呢？计数器，然后消除不需要的字数？将慢速代码移动到较小的卷循环。您是否有内存问题？“elquijote.txt”可能是一个很长的文件。如果是整本书，它有381.104个单词，来自22.939个不同的单词和200多万个字符。批量处理这本书应该是个好主意。