Can'；t正确计算NLTK、Python 3.7.3中单词形式的频率_Python_Nltk_Frequency_Lemmatization

Can'；t正确计算NLTK、Python 3.7.3中单词形式的频率

python

Can'；t正确计算NLTK、Python 3.7.3中单词形式的频率,python,nltk,frequency,lemmatization,Python,Nltk,Frequency,Lemmatization,我是Python的初学者，我有一个问题。我想计算文本中的引理（不仅仅是单词，还有引理！），并将它们从频繁到不频繁地排成一行。我使用以下代码： with open('D:\YandexDisk\Mishas_project\Книги\E\E.txt', 'r', encoding = 'utf-8-sig') as inp: txt = inp.read() points = ['.', ',', ';', ':', '!', '?', '"', '--', '‘', '---', '

我是Python的初学者，我有一个问题。我想计算文本中的引理（不仅仅是单词，还有引理！），并将它们从频繁到不频繁地排成一行。我使用以下代码：

with open('D:\YandexDisk\Mishas_project\Книги\E\E.txt', 'r', encoding = 'utf-8-sig') as inp:
    txt = inp.read()

points = ['.', ',', ';', ':', '!', '?', '"', '--', '‘', '---', '–', '—', '«', '»', '-', '…', '(', ')', '*', '_', '¬', '“', '”', '\\', '`', 'ù', '/', 'ï', 'ß', ',']
out = "".join(c for c in txt if c not in points).lower().replace('\n', ' ').replace('&', 'and')

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
st = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(out)]

dic= {}
for key in st:
    if key in dic:
        value = dic[key]
        dic[key]=value+1
    else:
        dic[key]=1

在这一步中，我找出了NLTK所做工作中的大量错误。例如，代码不能区分“land”作为名词和动词形式，它不理解“break”和“break”作为引理“break”的词形（这很奇怪，因为它理解“break”、“break”和“break”作为同一引理的词形）

因此，为了解决此问题，我尝试了以下代码：

import nltk
from nltk import tokenize
str_text = "He was a boy. I am a boy. I break a vase. I broke a vase. I'm breaking a vase. I land succesfully. The land was here"
tokened_text = nltk.word_tokenize(str_text)
tagged_text = nltk.pos_tag(tokened_text)

它给出了以下结果：

>>> tagged_text
>>> [('He', 'PRP'), ('was', 'VBD'), ('a', 'DT'), ('boy', 'NN'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('a', 'DT'), ('boy', 'NN'), ('.', '.'), ('I', 'PRP'), ('break', 'VBP'), ('a', 'DT'), ('vase', 'NN'), ('.', '.'), ('I', 'PRP'), ('broke', 'VBD'), ('a', 'DT'), ('vase', 'NN'), ('.', '.'), ('I', 'PRP'), ("'m", 'VBP'), ('breaking', 'VBG'), ('a', 'DT'), ('vase', 'NN'), ('.', '.'), ('I', 'PRP'), ('land', 'VBP'), ('succesfully', 'RB'), ('.', '.'), ('The', 'DT'), ('land', 'NN'), ('was', 'VBD'), ('here', 'RB')]

代码将“land”视为名词，“land”视为动词，它将不同形式的“break”视为动词形式——因此，我认为我可以用它来改进我的主要代码，但据我所知，这种方法无法将单词形式简单化

所以，为了进行研究，我必须将两种方法联系起来：一种是WordNet，另一种是Tokenize。但我认为这必须是一种更简单的方法，因为NLTK是一种灵活且非常实用的工具包

谁能帮帮我吗