Python 计算文本文件中特定单词的出现次数，并打印其中出现次数最多的50个单词_Python

Python 计算文本文件中特定单词的出现次数，并打印其中出现次数最多的50个单词

python

Python 计算文本文件中特定单词的出现次数，并打印其中出现次数最多的50个单词,python,Python,我想计算文本文件中特定关键字（存储在.txt文件中，每行一个单词）的出现次数，并打印出现次数最多的50个。以下是我所做的： from collections import Counter with open("./Text_file.txt", "r", encoding='utf8') as logfile: word_counts = Counter(logfile.read().split()) with open("./key_words.txt", "r", encodin

我想计算文本文件中特定关键字（存储在.txt文件中，每行一个单词）的出现次数，并打印出现次数最多的50个。以下是我所做的：

from collections import Counter

with open("./Text_file.txt", "r", encoding='utf8') as logfile:
    word_counts = Counter(logfile.read().split())

with open("./key_words.txt", "r", encoding='utf8') as word:
    lines = word.readlines()
    for line in lines:
        count = [word_counts.get('line')]
lst = sorted (count)
print (lst[:50])

我把这个还给我，这并不意味着什么：

[20]

有什么帮助吗？

以下是您可以做的：

from collections import Counter

with open("./Text_file.txt", "r") as file,open("./key_words.txt", "r") as word:
    words1 = [w.strip() for w in file.read().split()] # Strore words from text file into list
    words2 = [w.strip() for w in word.read().split()] # Strore words from key file into list

s = [w1 for w1 in words1 if w1 in words2] # List all words from text file that are in key file

d = Counter(s) # Diction that stores each word from s with the amount of times the word occurs in s

lst = [w for k,w in sorted([(v,k) for k,v in d.items()],reverse=True)[:50]]

print(lst)

以下是您可以做的：

from collections import Counter

with open("./Text_file.txt", "r") as file,open("./key_words.txt", "r") as word:
    words1 = [w.strip() for w in file.read().split()] # Strore words from text file into list
    words2 = [w.strip() for w in word.read().split()] # Strore words from key file into list

s = [w1 for w1 in words1 if w1 in words2] # List all words from text file that are in key file

d = Counter(s) # Diction that stores each word from s with the amount of times the word occurs in s

lst = [w for k,w in sorted([(v,k) for k,v in d.items()],reverse=True)[:50]]

print(lst)

在这里，

word\u counts.get（'line'）

，您只需要调用每次迭代中出现的

line

，这就是结果列表只有一个值的原因。以下是您对关键词前50个单词的修改代码

from collections import Counter

with open("./Text_file.txt", "r", encoding='utf8') as logfile:
    word_counts = Counter(logfile.read().split())

wc = dict(word_counts)
kwc = {}    #keyword counter
with open("./key_words.txt", "r", encoding='utf8') as word:
    lines = word.readlines()
    for line in lines:
        line = line.strip() #assuming each word is in separate line, removes '\n' character from end of line
        if line in wc.keys():
            kwc.update({line:wc[line]}) # if keyword is found, adds that to kwc

lst = sorted (kwc, key = kwc.get, reverse = True)   #sorts in decreasing order on value of dict
print (lst[:50])

在这里，

word\u counts.get（'line'）

，您只需要调用每次迭代中出现的

line

，这就是结果列表只有一个值的原因。以下是您对关键词前50个单词的修改代码

from collections import Counter

with open("./Text_file.txt", "r", encoding='utf8') as logfile:
    word_counts = Counter(logfile.read().split())

wc = dict(word_counts)
kwc = {}    #keyword counter
with open("./key_words.txt", "r", encoding='utf8') as word:
    lines = word.readlines()
    for line in lines:
        line = line.strip() #assuming each word is in separate line, removes '\n' character from end of line
        if line in wc.keys():
            kwc.update({line:wc[line]}) # if keyword is found, adds that to kwc

lst = sorted (kwc, key = kwc.get, reverse = True)   #sorts in decreasing order on value of dict
print (lst[:50])

一种选择

from collections import Counter

# Read keywords
with open("./key_words.txt", "r", encoding='utf8') as keyfile:
  # Use set of keywords (@MisterMiyagi comment)
  keywords = set(keyfile.read().split('\n'))

# Process words
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
  cnts = Counter()
  for line in logfile:
    if line:
      line = line.rstrip()
      # only count keywords
      cnts.update(word for word in line.split() if word in keywords)

# Use counter most_common to get most popular 50
print(cnts.most_common(50))

使用计数器+正则表达式的替代方法
正则表达式用于将单词与标点符号分开，如句号、引号、逗号等

import re from collections import Counter with open("./key_words.txt", "r", encoding='utf8') as keyfile: keywords = keyfile.read().lower().split('\n') with open("./Text_file.txt", "r", encoding='utf8') as logfile: cnts = Counter() for line in logfile: # use regex to separate words from punctuation # lowercase words words = map(lambda x:x.lower(), re.findall('[a-zA-Z]+', line, flags=re.A)) cnts.update(word for word in words if word in keywords) print(cnts.most_common(50))
一种选择

from collections import Counter # Read keywords with open("./key_words.txt", "r", encoding='utf8') as keyfile: # Use set of keywords (@MisterMiyagi comment) keywords = set(keyfile.read().split('\n')) # Process words with open("./Text_file.txt", "r", encoding='utf8') as logfile: cnts = Counter() for line in logfile: if line: line = line.rstrip() # only count keywords cnts.update(word for word in line.split() if word in keywords) # Use counter most_common to get most popular 50 print(cnts.most_common(50))
使用计数器+正则表达式的替代方法
正则表达式用于将单词与标点符号分开，如句号、引号、逗号等

import re from collections import Counter with open("./key_words.txt", "r", encoding='utf8') as keyfile: keywords = keyfile.read().lower().split('\n') with open("./Text_file.txt", "r", encoding='utf8') as logfile: cnts = Counter() for line in logfile: # use regex to separate words from punctuation # lowercase words words = map(lambda x:x.lower(), re.findall('[a-zA-Z]+', line, flags=re.A)) cnts.update(word for word in words if word in keywords) print(cnts.most_common(50))

我修改了您的代码-您很接近，但需要解决一些问题：

您只存储了一个
计数
，没有建立单词列表。我通过制作一个新的单词目录来解决这个问题，但只针对找到的关键词

正如其他人所说，您使用的是字符串literal
'line'
，而不是
line

您没有从每行
中删除换行符
-当您使用
readlines（）
时，
\n
换行符位于每行的末尾，因此在
计数器中找不到您的单词
这是代码。它按计数的降序打印关键字，仅打印前50个： from collections import Counter with open("./Text_file.txt", "r", encoding='utf8') as logfile: word_counts = Counter(logfile.read().split()) found_keywords = {} with open("./key_words.txt", "r", encoding='utf8') as word: lines = word.readlines() for line in lines: line = line.rstrip() count = word_counts[line] if count > 0: found_keywords[line] = count >>> print([(k, v) for k, v in sorted(found_keywords.items(), key=lambda item: item[1], reverse=True)][:50]) [('cat', 3), ('dog', 1)] 我修改了您的代码-您很接近，但需要解决一些问题：您只存储了一个计数，没有建立单词列表。我通过制作一个新的单词目录来解决这个问题，但只针对找到的关键词正如其他人所说，您使用的是字符串literal'line' ，而不是line 您没有从每行中删除换行符 -当您使用readlines（）时，\n 换行符位于每行的末尾，因此在计数器中找不到您的单词这是代码。它按计数的降序打印关键字，仅打印前50个： from collections import Counter with open("./Text_file.txt", "r", encoding='utf8') as logfile: word_counts = Counter(logfile.read().split()) found_keywords = {} with open("./key_words.txt", "r", encoding='utf8') as word: lines = word.readlines() for line in lines: line = line.rstrip() count = word_counts[line] if count > 0: found_keywords[line] = count >>> print([(k, v) for k, v in sorted(found_keywords.items(), key=lambda item: item[1], reverse=True)][:50]) [('cat', 3), ('dog', 1)] 请注意，整个第二个块只会一次又一次地查询单词“line” 。Do:count.append（word\u counts.get（'line'）），并在循环开始时将count 初始化为空列表。@Asocia这也会重复查找“line” @mistermiagi的计数，我只是从OP复制粘贴，没有意识到'line' 周围的引号。是的，没错，应该是line 而不是'line' 。请注意，整个第二个块只会一遍又一遍地查询单词“line” 的字数。Do:count.append（word\u counts.get（'line'））并在循环开始时将计数初始化为空列表。@Asocia这也将重复查找'line' @MisterMiyagi的计数。好吧，我只是从OP复制粘贴，没有意识到'line' 周围的引号。是的，没错，它应该是line 而不是'line' 。您可能想看看集合。计数器也可以是问题中使用的。顺便说一句，集合可能比列表更适合单词2 。您可能想看看问题中使用的集合。计数器。顺便说一句，集合可能比列表更适合单词2 。请注意，对于关键字使用集合，而不是隐式的列表更适合大量关键字。@MisterMiyagi--ah details，我总是忘记细节——你是对的。请注意，对关键字使用集合，而不是隐式的列表可以更好地缩放大量关键字。@MisterMiyagi--啊，细节，我总是忘记细节——你是对的。