Python 计算文本文件中特定单词的出现次数,并打印其中出现次数最多的50个单词
我想计算文本文件中特定关键字(存储在.txt文件中,每行一个单词)的出现次数,并打印出现次数最多的50个。以下是我所做的:Python 计算文本文件中特定单词的出现次数,并打印其中出现次数最多的50个单词,python,Python,我想计算文本文件中特定关键字(存储在.txt文件中,每行一个单词)的出现次数,并打印出现次数最多的50个。以下是我所做的: from collections import Counter with open("./Text_file.txt", "r", encoding='utf8') as logfile: word_counts = Counter(logfile.read().split()) with open("./key_words.txt", "r", encodin
from collections import Counter
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
word_counts = Counter(logfile.read().split())
with open("./key_words.txt", "r", encoding='utf8') as word:
lines = word.readlines()
for line in lines:
count = [word_counts.get('line')]
lst = sorted (count)
print (lst[:50])
我把这个还给我,这并不意味着什么:
[20]
有什么帮助吗?以下是您可以做的:
from collections import Counter
with open("./Text_file.txt", "r") as file,open("./key_words.txt", "r") as word:
words1 = [w.strip() for w in file.read().split()] # Strore words from text file into list
words2 = [w.strip() for w in word.read().split()] # Strore words from key file into list
s = [w1 for w1 in words1 if w1 in words2] # List all words from text file that are in key file
d = Counter(s) # Diction that stores each word from s with the amount of times the word occurs in s
lst = [w for k,w in sorted([(v,k) for k,v in d.items()],reverse=True)[:50]]
print(lst)
以下是您可以做的:
from collections import Counter
with open("./Text_file.txt", "r") as file,open("./key_words.txt", "r") as word:
words1 = [w.strip() for w in file.read().split()] # Strore words from text file into list
words2 = [w.strip() for w in word.read().split()] # Strore words from key file into list
s = [w1 for w1 in words1 if w1 in words2] # List all words from text file that are in key file
d = Counter(s) # Diction that stores each word from s with the amount of times the word occurs in s
lst = [w for k,w in sorted([(v,k) for k,v in d.items()],reverse=True)[:50]]
print(lst)
在这里,
word\u counts.get('line')
,您只需要调用每次迭代中出现的line
,这就是结果列表只有一个值的原因。以下是您对关键词前50个单词的修改代码
from collections import Counter
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
word_counts = Counter(logfile.read().split())
wc = dict(word_counts)
kwc = {} #keyword counter
with open("./key_words.txt", "r", encoding='utf8') as word:
lines = word.readlines()
for line in lines:
line = line.strip() #assuming each word is in separate line, removes '\n' character from end of line
if line in wc.keys():
kwc.update({line:wc[line]}) # if keyword is found, adds that to kwc
lst = sorted (kwc, key = kwc.get, reverse = True) #sorts in decreasing order on value of dict
print (lst[:50])
在这里,
word\u counts.get('line')
,您只需要调用每次迭代中出现的line
,这就是结果列表只有一个值的原因。以下是您对关键词前50个单词的修改代码
from collections import Counter
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
word_counts = Counter(logfile.read().split())
wc = dict(word_counts)
kwc = {} #keyword counter
with open("./key_words.txt", "r", encoding='utf8') as word:
lines = word.readlines()
for line in lines:
line = line.strip() #assuming each word is in separate line, removes '\n' character from end of line
if line in wc.keys():
kwc.update({line:wc[line]}) # if keyword is found, adds that to kwc
lst = sorted (kwc, key = kwc.get, reverse = True) #sorts in decreasing order on value of dict
print (lst[:50])
一种选择
from collections import Counter
# Read keywords
with open("./key_words.txt", "r", encoding='utf8') as keyfile:
# Use set of keywords (@MisterMiyagi comment)
keywords = set(keyfile.read().split('\n'))
# Process words
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
cnts = Counter()
for line in logfile:
if line:
line = line.rstrip()
# only count keywords
cnts.update(word for word in line.split() if word in keywords)
# Use counter most_common to get most popular 50
print(cnts.most_common(50))
使用计数器+正则表达式的替代方法
正则表达式用于将单词与标点符号分开,如句号、引号、逗号等
import re
from collections import Counter
with open("./key_words.txt", "r", encoding='utf8') as keyfile:
keywords = keyfile.read().lower().split('\n')
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
cnts = Counter()
for line in logfile:
# use regex to separate words from punctuation
# lowercase words
words = map(lambda x:x.lower(), re.findall('[a-zA-Z]+', line, flags=re.A))
cnts.update(word for word in words if word in keywords)
print(cnts.most_common(50))
一种选择
from collections import Counter
# Read keywords
with open("./key_words.txt", "r", encoding='utf8') as keyfile:
# Use set of keywords (@MisterMiyagi comment)
keywords = set(keyfile.read().split('\n'))
# Process words
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
cnts = Counter()
for line in logfile:
if line:
line = line.rstrip()
# only count keywords
cnts.update(word for word in line.split() if word in keywords)
# Use counter most_common to get most popular 50
print(cnts.most_common(50))
使用计数器+正则表达式的替代方法
正则表达式用于将单词与标点符号分开,如句号、引号、逗号等
import re
from collections import Counter
with open("./key_words.txt", "r", encoding='utf8') as keyfile:
keywords = keyfile.read().lower().split('\n')
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
cnts = Counter()
for line in logfile:
# use regex to separate words from punctuation
# lowercase words
words = map(lambda x:x.lower(), re.findall('[a-zA-Z]+', line, flags=re.A))
cnts.update(word for word in words if word in keywords)
print(cnts.most_common(50))
我修改了您的代码-您很接近,但需要解决一些问题:
- 您只存储了一个
,没有建立单词列表。我通过制作一个新的单词目录来解决这个问题,但只针对找到的关键词计数
- 正如其他人所说,您使用的是字符串literal
,而不是'line'
line
- 您没有从每行
-当您使用中删除换行符
时,readlines()
换行符位于每行的末尾,因此在\n
计数器中找不到您的单词
from collections import Counter
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
word_counts = Counter(logfile.read().split())
found_keywords = {}
with open("./key_words.txt", "r", encoding='utf8') as word:
lines = word.readlines()
for line in lines:
line = line.rstrip()
count = word_counts[line]
if count > 0:
found_keywords[line] = count
>>> print([(k, v) for k, v in sorted(found_keywords.items(), key=lambda item: item[1], reverse=True)][:50])
[('cat', 3), ('dog', 1)]
我修改了您的代码-您很接近,但需要解决一些问题:
- 您只存储了一个
,没有建立单词列表。我通过制作一个新的单词目录来解决这个问题,但只针对找到的关键词计数
- 正如其他人所说,您使用的是字符串literal
,而不是'line'
line
- 您没有从每行
-当您使用中删除换行符
时,readlines()
换行符位于每行的末尾,因此在\n
计数器中找不到您的单词
from collections import Counter
with open("./Text_file.txt", "r", encoding='utf8') as logfile:
word_counts = Counter(logfile.read().split())
found_keywords = {}
with open("./key_words.txt", "r", encoding='utf8') as word:
lines = word.readlines()
for line in lines:
line = line.rstrip()
count = word_counts[line]
if count > 0:
found_keywords[line] = count
>>> print([(k, v) for k, v in sorted(found_keywords.items(), key=lambda item: item[1], reverse=True)][:50])
[('cat', 3), ('dog', 1)]
请注意,整个第二个块只会一次又一次地查询单词
“line”
。Do:count.append(word\u counts.get('line'))
,并在循环开始时将count
初始化为空列表。@Asocia这也会重复查找“line”
@mistermiagi的计数,我只是从OP复制粘贴,没有意识到'line'
周围的引号。是的,没错,应该是line
而不是'line'
。请注意,整个第二个块只会一遍又一遍地查询单词“line”
的字数。Do:count.append(word\u counts.get('line'))
并在循环开始时将计数初始化为空列表。@Asocia这也将重复查找'line'
@MisterMiyagi的计数。好吧,我只是从OP复制粘贴,没有意识到'line'
周围的引号。是的,没错,它应该是line
而不是'line'
。您可能想看看集合。计数器也可以是问题中使用的。顺便说一句,集合
可能比列表
更适合单词2
。您可能想看看问题中使用的集合。计数器
。顺便说一句,集合
可能比列表
更适合单词2
。请注意,对于关键字
使用集合
,而不是隐式的列表
更适合大量关键字。@MisterMiyagi--ah details,我总是忘记细节——你是对的。请注意,对关键字使用集合,而不是隐式的列表
可以更好地缩放大量关键字。@MisterMiyagi--啊,细节,我总是忘记细节——你是对的。