Python 如何查找文件中十个最常见单词的频率?
我正在Python上编写一个函数,以文本文件的名称(作为字符串)作为输入。函数应该首先确定每个单词在文件中出现的次数。稍后,我将制作一个条形图,显示文件中十个最常见单词的频率,在每个条形旁边是第二个条形,其高度是Zipf定律预测的频率。我已经有了一些图形代码,但是我需要帮助在文本文件中查找最常见的单词Python 如何查找文件中十个最常见单词的频率?,python,python-3.x,zipf,Python,Python 3.x,Zipf,我正在Python上编写一个函数,以文本文件的名称(作为字符串)作为输入。函数应该首先确定每个单词在文件中出现的次数。稍后,我将制作一个条形图,显示文件中十个最常见单词的频率,在每个条形旁边是第二个条形,其高度是Zipf定律预测的频率。我已经有了一些图形代码,但是我需要帮助在文本文件中查找最常见的单词 def zipf_graph(text_file): import string file = open(text_file, encoding = 'utf8') tex
def zipf_graph(text_file):
import string
file = open(text_file, encoding = 'utf8')
text = file.read()
file.close()
#the following strips and removes punctuation and makes the words lowercase
punc = string.punctuation + '’”—⎬⎪“⎫'
new_text = text
for char in punc:
new_text = new_text.replace(char,'')
new_text = new_text.lower()
text_split = new_text.split()
我被困在这里,我试图在列表中找到最常见的字符串,但我不确定从这里可以走到哪里,以下是我尝试的:
words = text_split
most_common = max(words, key = words.count)
# print(most_common)
我还想添加下面的代码,因为有人建议这样做会有所帮助
# Sorting a list by frequency
# Assumes you have your elements as (word, frequency) tuples
# (Useful for the zipf function)
words = [('the', 1), ('and', 1), ('test',2)]
sorted(words, key = lambda x: x[1], reverse = True)
# "Sorting" a dictionary by frequency
# Assumes you have your elements as word:frequency
# (Useful for the zipf function)
words = dict()
words['the'] = 1
words['and'] = 1
words['test'] = 2
# This returns a list of just the most common words without their frequencies
most_common_words = sorted(words, key = words.get, reverse = True)
# print(most_common_words)
# We can go back to the dictionary to get the frequencies
for word in most_common_words:
print(word, words[word])
zipf_graph('fortune.txt') #name of the file I chose to use
您可以使用nltk库:
import nltk
words = ['words', 'in', 'the', 'file']
fd = nltk.FreqDist(words)
fd.most_common(10)
将以以下格式给出值:
[('file', 1), ('words', 1), ('in', 1), ('the', 1)]
我建议您使用
集合
中的计数器
从集合导入计数器
text_split=[“a”、“b”、“c”、“a”、“c”、“d”、“a”、“d”、“b”]
单词和频率=计数器(文本分割)
top=最常见的单词和频率(2)
打印(顶部)
有趣的是,这将返回所需的格式
[(“a”,3),(“b”,2)]
+1并投入使用。