10,字符串Python中最常见的单词

10,字符串Python中最常见的单词,python,Python,我需要显示文本文件中10个最频繁的单词,从最频繁到最少,以及它被使用的次数。我不能使用字典或计数器功能。到目前为止,我有: import urllib cnt = 0 i=0 txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt") uniques = [] for line in txtFile: words = line.split() for word in words:

我需要显示文本文件中10个最频繁的单词,从最频繁到最少,以及它被使用的次数。我不能使用字典或计数器功能。到目前为止,我有:

import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
    words = line.split()
    for word in words:
        if word not in uniques:
            uniques.append(word)
for word in words:
    while i<len(uniques):
        i+=1
        if word in uniques:
             cnt += 1
print cnt
导入urllib
cnt=0
i=0
txtFile=urllib.urlopen(“http://textfiles.com/etext/FICTION/alice30.txt")
uniques=[]
对于txtFile中的行:
words=line.split()
用文字表示:
如果单词不是唯一的:
uniques.append(word)
用文字表示:

虽然我个人会自己实现
collections.Counter
。我假设您知道该对象是如何工作的,但如果不知道,我将总结:

text = "some words that are mostly different but are not all different not at all"

words = text.split()

resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
我们当然可以使用
sorted
key
关键字参数根据频率对其进行排序,并返回该列表中的前10项。然而,这对您没有多大帮助,因为您没有实现
计数器。我将把这部分留给您作为练习,并向您展示如何将
计数器
作为函数而不是对象来实现

def counter(iterable):
    d = {}
    for element in iterable:
        if element in d:
            d[element] += 1
        else:
            d[element] = 1
    return d
其实不难。检查iterable的每个元素。如果该元素不在
d
中,请将其添加到
d
,值为
1
。如果在
d
中,则增加该值。更容易通过以下方式表达:

def counter(iterable):
    d = {}
    for element in iterable:
        d.setdefault(element, 0) += 1

请注意,在您的用例中,您可能希望去掉标点符号,并可能将整个内容进行大小写折叠(以便
someword
someword
计数相同,而不是作为两个单独的单词)。我也会把这件事留给你,但我要指出的是,
str.strip
需要一个关于去掉什么的参数,
string。标点符号包含了你可能需要的所有标点符号。

你的思路是对的。请注意,该算法相当慢,因为对于每个唯一的单词,它会迭代所有单词。没有散列的更快的方法需要构建一个

输出:

1:the-1432
2:和-734
3:to-703
4:a-579
5:of-501
6:she-466
7:it-440
8:said-434
9:I-371
10:in-338


此方法确保计数器中只有字母数字和空格。没什么大不了的。

您也可以通过熊猫数据帧来完成,并以表格的形式方便地得到结果:“word its freq.”

def count_words(words_list):
 words_df = pn.DataFrame(words_list)
 words_df.columns = ["word"]
 words_df_unique = pn.DataFrame(pn.unique(words_list))
 words_df_unique.columns = ["unique"]
 words_df_unique["count"] = 0
 i = 0
 for word in pn.Series.tolist(words_df_unique.unique):
     words_df_unique.iloc[i, 1] =  len(words_df.word[words_df.word == word])
     i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)

通过使用python集合可以很容易地解决上述问题 下面是解决方案

from collections import Counter

data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \

# split() returns list of all the words in the string
split_it = data_set.split()

# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)

# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)

这听起来像是家庭作业problem@Greg的确如此。所以,不要歧视家庭作业,所以我看不出问题所在?你的代码有什么问题?什么不起作用?您收到了什么错误消息?或者你只是想让别人帮你写代码?也许你是想把每一行中的所有单词都读到一个列表中,
单词
?现在它读每一行的单词,所以当你第二次重复它时,它只得到最后一行的单词。@Adam Smith是肯定的,但也恳请OP披露这是一个家庭作业问题谢谢你的帮助。我将如何实现此功能?我会处理剩余的细节,如分类和剥离,我只需要这个part@KevinKZ文件对象已经是其行的迭代器。我会制作一个生成器函数,它接受空格上的行和拆分,根据需要进行剥离,并将整个内容传递给
计数器
函数。类似于
words=(word.strip(不过)用于文件中的line\u obj用于line.split()中的word)
count=
counter(words)`您不希望以默认值1启动计数器,而是以0启动计数器。扩展版本是可以的,但是使用setdefault的版本应该以0thus开始,您将有一个数据帧,您可以使用df.head(10)选择10个最常见的单词,或者使用df.tail(10)选择10个最罕见的单词
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.

word_counter = {}
for word in txtFile.split(" "): # split in every space.
    if len(word) > 0 and word != '\r\n':
        if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
            word_counter[word] = 1
        else:
            word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1

for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
    # sorts the dict by the values, from top to botton, takes the 10 top items,
    print "%s: %s - %s"%(i+1,word,word_counter[word])
def count_words(words_list):
 words_df = pn.DataFrame(words_list)
 words_df.columns = ["word"]
 words_df_unique = pn.DataFrame(pn.unique(words_list))
 words_df_unique.columns = ["unique"]
 words_df_unique["count"] = 0
 i = 0
 for word in pn.Series.tolist(words_df_unique.unique):
     words_df_unique.iloc[i, 1] =  len(words_df.word[words_df.word == word])
     i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)
from collections import Counter

data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \

# split() returns list of all the words in the string
split_it = data_set.split()

# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)

# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)