Python项目上的TXT文件，如何读取计数字行和排序_Python

Python项目上的TXT文件，如何读取计数字行和排序

python

Python项目上的TXT文件，如何读取计数字行和排序,python,Python,我有一个.txt文件。我需要从文本中删除所有非字母的字符，然后打印其中有多少行字符。然后，我需要计算每个单词在文件中出现的次数，并将这些信息放入字典。然后我需要打印前3个常用词和前1000个常用词我已经写了这段代码，但它不起作用。有什么问题 def word_count(path): raws = 0 file = open(path, 'r') while file.readline(): raws += 1 print ('There are

我有一个.txt文件。我需要从文本中删除所有非字母的字符，然后打印其中有多少行字符。然后，我需要计算每个单词在文件中出现的次数，并将这些信息放入字典。然后我需要打印前3个常用词和前1000个常用词

我已经写了这段代码，但它不起作用。有什么问题

def word_count(path):
    raws = 0
    file = open(path, 'r')
    while file.readline():
        raws += 1
    print ('There are', raws, 'raws in the TXT file')

    file = open(path, 'r')
    nchars = 0
    nwords = 0
    words = file.read().strip()
    words = words.translate(str.maketrans('', '', string.punctuation))
    for char in '#$%^&*-.:/()@\n1234567890;_':
        words = words.replace(char, ' ')
    words = words.lower()
    word_list = words.split()
    for word in word_list:
        nwords += 1
    for char in words:
        nchars += 1
    print ('There are', nwords, 'words in the TXT file')
    print ('There are', nchars, 'characters in the TXT file')


def word_frequency(path):
    dictionary = {}

    file = open(path, 'r')
    data = file.read()
    data = data.translate(str.maketrans('', '', string.punctuation))
    data = data.lower().split()

    for word in data:
        if not word[0] in '1234567890':
            if word in dictionary:
                dictionary[word] += 1
            else:
                dictionary[word] = 1


def most_appear_words(dictionary):
    new_d = collections._OrderedDictValuesView

     # new_d = sorted(dictionary)

    print 'The three most apppear word in the TXT file are:'
    for key in new_d:
        print (key, new_d[key])

老实说，您的代码存在多个问题。您正在调用内置的

open

三次。这意味着您的代码将整个文件读取三次，而一次就足够了。无论何时执行

file.read（）

操作，您都试图将整个文件读入内存。虽然这对小文件很有效，但如果文件太大而无法放入内存，则会导致

内存错误
你的功能做得太多了。他们

打开一个文件
它们解析文件的内容
他们打印计算出的统计数据

作为一般建议，功能和对象应遵循
目前，您的代码根本不起作用，因为在您的函数中，大多数单词调用print
函数时缺少括号。此外，您不应导入任何名称以下划线开头的项目，如collections.\u OrderedDictValuesView
。下划线表示此视图仅供内部使用。您可能需要导入集合。请在此处计数器
您不提供最小的。因此，不清楚您实际上是如何调用代码示例中的函数的
但是，word\u frequency
似乎缺少一条return
语句。为了使代码按原样工作，您必须执行以下操作
def word_频率（路径）：
字典={}
# 
返回字典
def大多数单词（字典）：
new_d=collections.Counter（）
# 
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
# 
#将返回的单词\u频率馈送给大多数出现的单词：
d=单词频率（你的路径）
大多数单词（d）

我希望这将帮助您使代码正常工作

但是，请注意，我建议采用不同的方法：
有一个函数负责打开和处理文件（word\u iterator）。
有一个功能负责进行统计，即统计单词和字母（word\u count）。
有一个将结果打印到控制台的功能（print\u statistics
）
我建议的任务解决方案是：
从集合导入计数器
导入字符串
def word_迭代器（fp）：
t=str.maketrans（“”，，，string.标点符号+string.digits）
字号=0
打开（fp）时，如_文件中所示：
对于第_行，枚举中的第行（在_文件中，开始=1）：
直线=直线。平移（t）
words=line.split（）
对于w，大写：
单词_no+=1
屈服线号，字号，w.下（）
定义字数（字数）：
单词=计数器（）
行号=0
字号=0
n_chars=0
对于行号、字号、字中字：
n_chars+=len（字）
words.update（[word]）
结果={
“n_行”：行号，
“n_words”：单词n_no，
“n_chars”：n_chars，
“单词”：单词
}
返回结果
def打印统计数据（wc，top_n1=3，top_n2=None）：
打印（‘字数’。居中（20，='））
打印（f'文件{fn}由'）
打印（f'{wc[“n_行”]：5}行）
打印（f'{wc[“n_单词”]：5}个单词）
打印（f'{wc[“n_chars”]：5}个字符）
打印（）
打印（'Word Frequency'.center（20'，='））
print（f'The{top_n1}最常见的单词是：'）
对于单词，在wc['words']中计数。最常见（最常见）：
打印（f'{word}（{count}次）'
如果顶部为n2：
打印（）
print（f'The{top_n2}最常见的单词是：'）
top_words=[w代表w，在wc['words']中。最常见（top_n2）]
打印（“，”.join（最上面的单词））
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
fn='text_file.txt'
stat=word\u计数（word\u迭代器（fn））
打印统计数据（统计数据，前n1=3，前n2=1000）

使用示例输出
==== Word Count ====
File text_file.txt consists of
     7 lines
   104 words
   492 characters

== Word Frequency ==
The 3 most frequent words are:
 a (5 times)
 the (4 times)
 it (3 times)

The 1000 most frequent words are:
a, the, it, content, of, lorem, ipsum, and, is, that, will, by, readable, page, using, as, here, like, many, web, their, sometimes, long, established, fact, reader, be, distracted, when, looking, at, its, layout, point, has, moreorless, normal, distribution, letters, opposed, to, making, look, english, desktop, publishing, packages, editors, now, use, default, model, text, search, for, uncover, sites, still, in, infancy, various, versions, have, evolved, over, years, accident, on, purpose, injected, humour

你是在最后呼叫word\u count
？你能告诉我们输出错误吗？我对“单词频率”函数和“大多数单词”函数有问题。我无法将所有的单词都放在一个列表中，并对列表进行排序，这样我就可以打印出最常用的3个单词。到目前为止，我编写的代码不起作用，第三个功能也不起作用。你能说得更具体些吗？请提供一份报告。另一方面，我建议使用上下文管理器来处理文件对象。有几件事可以重做，例如split（）生成一个列表，获取列表的长度非常简单（您还可以看到字符串的情况）。有一个名为set的内置类型，它是唯一元素的无序集合。字符串内置类型有一个名为count的方法，该方法计算字母或子字符串的出现次数。字典存储每个单词频率的方法很好。现在唯一剩下的事情（除了代码中的重构，因为您的文件是打开的）是使用sorted（）使用列表理解按其值对字典进行排序