10，字符串Python中最常见的单词_Python

10，字符串Python中最常见的单词

python

10，字符串Python中最常见的单词,python,Python,我需要显示文本文件中10个最频繁的单词，从最频繁到最少，以及它被使用的次数。我不能使用字典或计数器功能。到目前为止，我有： import urllib cnt = 0 i=0 txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt") uniques = [] for line in txtFile: words = line.split() for word in words:

我需要显示文本文件中10个最频繁的单词，从最频繁到最少，以及它被使用的次数。我不能使用字典或计数器功能。到目前为止，我有：

import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
    words = line.split()
    for word in words:
        if word not in uniques:
            uniques.append(word)
for word in words:
    while i<len(uniques):
        i+=1
        if word in uniques:
             cnt += 1
print cnt

导入urllib
cnt=0
i=0
txtFile=urllib.urlopen（“http://textfiles.com/etext/FICTION/alice30.txt")
uniques=[]
对于txtFile中的行：
words=line.split（）
用文字表示：
如果单词不是唯一的：
uniques.append（word）
用文字表示：
虽然我个人会自己实现collections.Counter
。我假设您知道该对象是如何工作的，但如果不知道，我将总结：
text = "some words that are mostly different but are not all different not at all"

words = text.split()

resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}

我们当然可以使用sorted
的key
关键字参数根据频率对其进行排序，并返回该列表中的前10项。然而，这对您没有多大帮助，因为您没有实现计数器。我将把这部分留给您作为练习，并向您展示如何将计数器
作为函数而不是对象来实现
def counter(iterable):
    d = {}
    for element in iterable:
        if element in d:
            d[element] += 1
        else:
            d[element] = 1
    return d

其实不难。检查iterable的每个元素。如果该元素不在d
中，请将其添加到d
，值为1
。如果在d
中，则增加该值。更容易通过以下方式表达：
def counter(iterable):
    d = {}
    for element in iterable:
        d.setdefault(element, 0) += 1

请注意，在您的用例中，您可能希望去掉标点符号，并可能将整个内容进行大小写折叠（以便someword
与someword
计数相同，而不是作为两个单独的单词）。我也会把这件事留给你，但我要指出的是，str.strip
需要一个关于去掉什么的参数，string。标点符号包含了你可能需要的所有标点符号。
你的思路是对的。请注意，该算法相当慢，因为对于每个唯一的单词，它会迭代所有单词。没有散列的更快的方法需要构建一个
输出：
1:the-1432
2:和-734
3:to-703
4:a-579
5:of-501
6:she-466
7:it-440
8:said-434
9:I-371
10:in-338

此方法确保计数器中只有字母数字和空格。没什么大不了的。
您也可以通过熊猫数据帧来完成，并以表格的形式方便地得到结果：“word its freq.”
def count_words(words_list):
 words_df = pn.DataFrame(words_list)
 words_df.columns = ["word"]
 words_df_unique = pn.DataFrame(pn.unique(words_list))
 words_df_unique.columns = ["unique"]
 words_df_unique["count"] = 0
 i = 0
 for word in pn.Series.tolist(words_df_unique.unique):
     words_df_unique.iloc[i, 1] =  len(words_df.word[words_df.word == word])
     i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)

通过使用python集合可以很容易地解决上述问题
下面是解决方案
from collections import Counter

data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \

# split() returns list of all the words in the string
split_it = data_set.split()

# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)

# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)

这听起来像是家庭作业problem@Greg的确如此。所以，不要歧视家庭作业，所以我看不出问题所在？你的代码有什么问题？什么不起作用？您收到了什么错误消息？或者你只是想让别人帮你写代码？也许你是想把每一行中的所有单词都读到一个列表中，单词
？现在它读每一行的单词，所以当你第二次重复它时，它只得到最后一行的单词。@Adam Smith是肯定的，但也恳请OP披露这是一个家庭作业问题谢谢你的帮助。我将如何实现此功能？我会处理剩余的细节，如分类和剥离，我只需要这个part@KevinKZ文件对象已经是其行的迭代器。我会制作一个生成器函数，它接受空格上的行和拆分，根据需要进行剥离，并将整个内容传递给计数器函数。类似于words=（word.strip（不过）用于文件中的line\u obj用于line.split（）中的word）
count=counter（words）`您不希望以默认值1启动计数器，而是以0启动计数器。扩展版本是可以的，但是使用setdefault的版本应该以0thus开始，您将有一个数据帧，您可以使用df.head（10）选择10个最常见的单词，或者使用df.tail（10）选择10个最罕见的单词
import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.

word_counter = {}
for word in txtFile.split(" "): # split in every space.
    if len(word) > 0 and word != '\r\n':
        if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
            word_counter[word] = 1
        else:
            word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1

for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
    # sorts the dict by the values, from top to botton, takes the 10 top items,
    print "%s: %s - %s"%(i+1,word,word_counter[word])

def count_words(words_list):
 words_df = pn.DataFrame(words_list)
 words_df.columns = ["word"]
 words_df_unique = pn.DataFrame(pn.unique(words_list))
 words_df_unique.columns = ["unique"]
 words_df_unique["count"] = 0
 i = 0
 for word in pn.Series.tolist(words_df_unique.unique):
     words_df_unique.iloc[i, 1] =  len(words_df.word[words_df.word == word])
     i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)

from collections import Counter

data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \

# split() returns list of all the words in the string
split_it = data_set.split()

# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)

# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)