Python 如何从语料库中获取最频繁的单词？_Python_Python 2.7_Nltk_Counter_Corpus

Python 如何从语料库中获取最频繁的单词？

python python-2.7

Python 如何从语料库中获取最频繁的单词？,python,python-2.7,nltk,counter,corpus,Python,Python 2.7,Nltk,Counter,Corpus,我正在使用语料库，希望从语料库中获得使用最多和最少的单词和单词类。我有一段代码的开头，但是我遇到了一些错误，我不知道如何处理。我想从布朗语料库中找出最常用的单词，然后是使用最多和最少的单词类。我有以下代码： import re import nltk import string from collections import Counter from nltk.corpus import stopwords from collections import defaultdict, Counter

我正在使用语料库，希望从语料库中获得使用最多和最少的单词和单词类。我有一段代码的开头，但是我遇到了一些错误，我不知道如何处理。我想从布朗语料库中找出最常用的单词，然后是使用最多和最少的单词类。我有以下代码：

import re
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from collections import defaultdict, Counter
from nltk.corpus import brown

brown = nltk.corpus.brown
stoplist = stopwords.words('english')

from collections import defaultdict

def toptenwords(brown):
    words = brown.words()
    no_capitals = ([word.lower() for word in words])
    filtered = [word for word in no_capitals if word not in stoplist]
    translate_table = dict((ord(char), None) for char in string.punctuation)
    no_punct = [s.translate(translate_table) for s in filtered]
    wordcounter = defaultdict(int)
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
    return sorting

print(toptenwords(brown))

words_2 = [word[0] for word in brown.tagged_words(categories="news")]
# the most frequent words
print Counter(words_2).most_common(10)

words_2 = [word[1] for word in brown.tagged_words(categories="news")]
# the most frequent word class
print Counter(words_2).most_common(10)


# Keeps words and pos into a dictionary
# where the key is a word and
# the value is a counter of POS and counts
word_tags = defaultdict(Counter)
for word, pos in brown.tagged_words():
word_tags[word][pos] +=1

# To access the POS counter.
print 'Red', word_tags['Red']
print 'Marlowe', word_tags['Marlowe']
print

# Greatest number of distinct tag.
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0]

print word_with_most_distinct_pos
print word_tags[word_with_most_distinct_pos]
print len(word_tags[word_with_most_distinct_pos])

# which word has the greatest number of distinct tags
word_tags_2 = nltk.defaultdict(lambda: set())
for word, token in tagged_words:
    word_tags[word].add(token)
    ambig_words = sorted([(k, len(v)) for (k, v) in word_tags.items()]),
    key=itemgetter(1), reverse=True)[:50]
  print [(word, numtoks, word_tags[word]) for (word, numtoks) in ambig_words]

当我运行上面的代码时，我得到以下错误：

File "Oblig2a.py", line 64
    key=itemgetter(1), reverse=True)[:50]
                               ^
SyntaxError: invalid syntax

从这段代码中，我想得到：

最常用词

最常用词类

最小频繁词类

有多少个单词包含多个单词类别

哪个单词的标签最多，有多少不同的标签

我需要帮助的最后一件事是为一个特定的单词写一个函数，并写下它在每个标记中出现的次数。我正试图做到以上，但我不能让它工作

是3号，4号，5号和6号我需要帮助。。。

欢迎提供任何帮助。

代码有3个问题：

解释器告诉您的错误-您应该向stopwords函数提供语言名称：

stoplist=stopwords.words（'english'）

使用

defaultdict

dictionary

get

方法对dict进行正确排序：

[（k，字计数器[k]），用于排序中的k（字计数器，key=wordcounter.get，reverse=True）]

对Unicode数据使用translate表，请参见

棕色标记的单词是元组格式的

（单词，词性）

完整代码：

import re
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords

brown = nltk.corpus.brown
stoplist = stopwords.words('english')

from collections import defaultdict

def toptenwords(brown):
    words = brown.words()
    no_capitals = set([word.lower() for word in words])
    filtered = [word for word in no_capitals if word not in stoplist]
    translate_table = dict((ord(char), None) for char in string.punctuation)
    no_punct = [s.translate(translate_table) for s in filtered]
    wordcounter = defaultdict(int)
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
    return sorting


print(toptenwords(brown))

words_2 = [word[0] for word in brown.tagged_words(categories="news")]
# the most frequent words
print Counter(words_2).most_common(10)

words_2 = [word[1] for word in brown.tagged_words(categories="news")]
# the most frequent word class
print Counter(words_2).most_common(10)

查看堆栈跟踪。令人不快的一行显然是

stoplist=stopwords.words（brown）

。此方法需要文件ID，但不需要一系列带标签的单词（这是您分配给变量

brown

）。如何更改它？您应该为函数提供语言名称，例如

stoplist=stopwords。单词（'english'）

现在运行正常，但我不确定如何从输出中打印所需内容。。。我尝试了很多地方和方法，但我没有得到任何打印…维比约恩，看看你定义

无大写字母的那一行，想想它的作用，以及这会如何影响你计算单词的目标。谢谢！但是我如何从这段代码中获得最少使用的单词和单词类呢？请查看这个主题@VebjørnBergaplass，要使用nltk，您需要能够编写一点程序。你需要把“我没有得到我想要的输出”缩小到一个编程问题。对不起。当我运行编辑过的代码（带打印）时，我不会得到输出。我试着运行它，并把打印在最后，但没有得到什么。。。