Python 如何从语料库中获取最频繁的单词?
我正在使用语料库,希望从语料库中获得使用最多和最少的单词和单词类。我有一段代码的开头,但是我遇到了一些错误,我不知道如何处理。我想从布朗语料库中找出最常用的单词,然后是使用最多和最少的单词类。我有以下代码:Python 如何从语料库中获取最频繁的单词?,python,python-2.7,nltk,counter,corpus,Python,Python 2.7,Nltk,Counter,Corpus,我正在使用语料库,希望从语料库中获得使用最多和最少的单词和单词类。我有一段代码的开头,但是我遇到了一些错误,我不知道如何处理。我想从布朗语料库中找出最常用的单词,然后是使用最多和最少的单词类。我有以下代码: import re import nltk import string from collections import Counter from nltk.corpus import stopwords from collections import defaultdict, Counter
import re
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from collections import defaultdict, Counter
from nltk.corpus import brown
brown = nltk.corpus.brown
stoplist = stopwords.words('english')
from collections import defaultdict
def toptenwords(brown):
words = brown.words()
no_capitals = ([word.lower() for word in words])
filtered = [word for word in no_capitals if word not in stoplist]
translate_table = dict((ord(char), None) for char in string.punctuation)
no_punct = [s.translate(translate_table) for s in filtered]
wordcounter = defaultdict(int)
for word in no_punct:
if word in wordcounter:
wordcounter[word] += 1
else:
wordcounter[word] = 1
sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
return sorting
print(toptenwords(brown))
words_2 = [word[0] for word in brown.tagged_words(categories="news")]
# the most frequent words
print Counter(words_2).most_common(10)
words_2 = [word[1] for word in brown.tagged_words(categories="news")]
# the most frequent word class
print Counter(words_2).most_common(10)
# Keeps words and pos into a dictionary
# where the key is a word and
# the value is a counter of POS and counts
word_tags = defaultdict(Counter)
for word, pos in brown.tagged_words():
word_tags[word][pos] +=1
# To access the POS counter.
print 'Red', word_tags['Red']
print 'Marlowe', word_tags['Marlowe']
print
# Greatest number of distinct tag.
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0]
print word_with_most_distinct_pos
print word_tags[word_with_most_distinct_pos]
print len(word_tags[word_with_most_distinct_pos])
# which word has the greatest number of distinct tags
word_tags_2 = nltk.defaultdict(lambda: set())
for word, token in tagged_words:
word_tags[word].add(token)
ambig_words = sorted([(k, len(v)) for (k, v) in word_tags.items()]),
key=itemgetter(1), reverse=True)[:50]
print [(word, numtoks, word_tags[word]) for (word, numtoks) in ambig_words]
当我运行上面的代码时,我得到以下错误:
File "Oblig2a.py", line 64
key=itemgetter(1), reverse=True)[:50]
^
SyntaxError: invalid syntax
从这段代码中,我想得到:
欢迎提供任何帮助。代码有3个问题:
stoplist=stopwords.words('english')
defaultdict
dictionaryget
方法对dict进行正确排序:
[(k,字计数器[k]),用于排序中的k(字计数器,key=wordcounter.get,reverse=True)]
(单词,词性)
import re
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
brown = nltk.corpus.brown
stoplist = stopwords.words('english')
from collections import defaultdict
def toptenwords(brown):
words = brown.words()
no_capitals = set([word.lower() for word in words])
filtered = [word for word in no_capitals if word not in stoplist]
translate_table = dict((ord(char), None) for char in string.punctuation)
no_punct = [s.translate(translate_table) for s in filtered]
wordcounter = defaultdict(int)
for word in no_punct:
if word in wordcounter:
wordcounter[word] += 1
else:
wordcounter[word] = 1
sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
return sorting
print(toptenwords(brown))
words_2 = [word[0] for word in brown.tagged_words(categories="news")]
# the most frequent words
print Counter(words_2).most_common(10)
words_2 = [word[1] for word in brown.tagged_words(categories="news")]
# the most frequent word class
print Counter(words_2).most_common(10)
查看堆栈跟踪。令人不快的一行显然是
stoplist=stopwords.words(brown)
。此方法需要文件ID,但不需要一系列带标签的单词(这是您分配给变量brown
)。如何更改它?您应该为函数提供语言名称,例如stoplist=stopwords。单词('english')
现在运行正常,但我不确定如何从输出中打印所需内容。。。我尝试了很多地方和方法,但我没有得到任何打印…维比约恩,看看你定义无大写字母的那一行,想想它的作用,以及这会如何影响你计算单词的目标。谢谢!但是我如何从这段代码中获得最少使用的单词和单词类呢?请查看这个主题@VebjørnBergaplass,要使用nltk,您需要能够编写一点程序。你需要把“我没有得到我想要的输出”缩小到一个编程问题。对不起。当我运行编辑过的代码(带打印)时,我不会得到输出。我试着运行它,并把打印在最后,但没有得到什么。。。