Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何从语料库中获取最频繁的单词?_Python_Python 2.7_Nltk_Counter_Corpus - Fatal编程技术网

Python 如何从语料库中获取最频繁的单词?

Python 如何从语料库中获取最频繁的单词?,python,python-2.7,nltk,counter,corpus,Python,Python 2.7,Nltk,Counter,Corpus,我正在使用语料库,希望从语料库中获得使用最多和最少的单词和单词类。我有一段代码的开头,但是我遇到了一些错误,我不知道如何处理。我想从布朗语料库中找出最常用的单词,然后是使用最多和最少的单词类。我有以下代码: import re import nltk import string from collections import Counter from nltk.corpus import stopwords from collections import defaultdict, Counter

我正在使用语料库,希望从语料库中获得使用最多和最少的单词和单词类。我有一段代码的开头,但是我遇到了一些错误,我不知道如何处理。我想从布朗语料库中找出最常用的单词,然后是使用最多和最少的单词类。我有以下代码:

import re
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from collections import defaultdict, Counter
from nltk.corpus import brown

brown = nltk.corpus.brown
stoplist = stopwords.words('english')

from collections import defaultdict

def toptenwords(brown):
    words = brown.words()
    no_capitals = ([word.lower() for word in words])
    filtered = [word for word in no_capitals if word not in stoplist]
    translate_table = dict((ord(char), None) for char in string.punctuation)
    no_punct = [s.translate(translate_table) for s in filtered]
    wordcounter = defaultdict(int)
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
    return sorting

print(toptenwords(brown))

words_2 = [word[0] for word in brown.tagged_words(categories="news")]
# the most frequent words
print Counter(words_2).most_common(10)

words_2 = [word[1] for word in brown.tagged_words(categories="news")]
# the most frequent word class
print Counter(words_2).most_common(10)


# Keeps words and pos into a dictionary
# where the key is a word and
# the value is a counter of POS and counts
word_tags = defaultdict(Counter)
for word, pos in brown.tagged_words():
word_tags[word][pos] +=1

# To access the POS counter.
print 'Red', word_tags['Red']
print 'Marlowe', word_tags['Marlowe']
print

# Greatest number of distinct tag.
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0]

print word_with_most_distinct_pos
print word_tags[word_with_most_distinct_pos]
print len(word_tags[word_with_most_distinct_pos])

# which word has the greatest number of distinct tags
word_tags_2 = nltk.defaultdict(lambda: set())
for word, token in tagged_words:
    word_tags[word].add(token)
    ambig_words = sorted([(k, len(v)) for (k, v) in word_tags.items()]),
    key=itemgetter(1), reverse=True)[:50]
  print [(word, numtoks, word_tags[word]) for (word, numtoks) in ambig_words]
当我运行上面的代码时,我得到以下错误:

File "Oblig2a.py", line 64
    key=itemgetter(1), reverse=True)[:50]
                               ^
SyntaxError: invalid syntax
从这段代码中,我想得到:

  • 最常用词
  • 最常用词类
  • 最小频繁词类
  • 有多少个单词包含多个单词类别
  • 哪个单词的标签最多,有多少不同的标签
  • 我需要帮助的最后一件事是为一个特定的单词写一个函数,并写下它在每个标记中出现的次数。我正试图做到以上,但我不能让它工作
  • 是3号,4号,5号和6号我需要帮助。。。
    欢迎提供任何帮助。

    代码有3个问题:

  • 解释器告诉您的错误-您应该向stopwords函数提供语言名称:
    stoplist=stopwords.words('english')
  • 使用
    defaultdict
    dictionary
    get
    方法对dict进行正确排序:
    [(k,字计数器[k]),用于排序中的k(字计数器,key=wordcounter.get,reverse=True)]
  • 对Unicode数据使用translate表,请参见
  • 棕色标记的单词是元组格式的
    (单词,词性)
  • 完整代码:

    import re
    import nltk
    import string
    from collections import Counter
    from nltk.corpus import stopwords
    
    brown = nltk.corpus.brown
    stoplist = stopwords.words('english')
    
    from collections import defaultdict
    
    def toptenwords(brown):
        words = brown.words()
        no_capitals = set([word.lower() for word in words])
        filtered = [word for word in no_capitals if word not in stoplist]
        translate_table = dict((ord(char), None) for char in string.punctuation)
        no_punct = [s.translate(translate_table) for s in filtered]
        wordcounter = defaultdict(int)
        for word in no_punct:
            if word in wordcounter:
                wordcounter[word] += 1
            else:
                wordcounter[word] = 1
        sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
        return sorting
    
    
    print(toptenwords(brown))
    
    words_2 = [word[0] for word in brown.tagged_words(categories="news")]
    # the most frequent words
    print Counter(words_2).most_common(10)
    
    words_2 = [word[1] for word in brown.tagged_words(categories="news")]
    # the most frequent word class
    print Counter(words_2).most_common(10)
    

    查看堆栈跟踪。令人不快的一行显然是
    stoplist=stopwords.words(brown)
    。此方法需要文件ID,但不需要一系列带标签的单词(这是您分配给变量
    brown
    )。如何更改它?您应该为函数提供语言名称,例如
    stoplist=stopwords。单词('english')
    现在运行正常,但我不确定如何从输出中打印所需内容。。。我尝试了很多地方和方法,但我没有得到任何打印…维比约恩,看看你定义
    无大写字母的那一行,想想它的作用,以及这会如何影响你计算单词的目标。谢谢!但是我如何从这段代码中获得最少使用的单词和单词类呢?请查看这个主题@VebjørnBergaplass,要使用nltk,您需要能够编写一点程序。你需要把“我没有得到我想要的输出”缩小到一个编程问题。对不起。当我运行编辑过的代码(带打印)时,我不会得到输出。我试着运行它,并把打印在最后,但没有得到什么。。。