Python 3.x 使用spacy和nltk对单词进行引理化，但没有给出正确的引理_Python 3.x_Lemmatization

Python 3.x 使用spacy和nltk对单词进行引理化，但没有给出正确的引理

python-3.x

Python 3.x 使用spacy和nltk对单词进行引理化，但没有给出正确的引理,python-3.x,lemmatization,Python 3.x,Lemmatization,我想得到下面列表中单词的柠檬化单词：（例如）当我做spacy时 import spacy nlp = spacy.load('en') words = ['Funnier','Funniest','mightiest','tighter','biggify'] doc = spacy.tokens.Doc(nlp.vocab, words=words) for items in doc: print(items.lemma_) 我得到了这样的引理： Funnier Funniest

我想得到下面列表中单词的柠檬化单词：

（例如）

当我做spacy时

import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
    print(items.lemma_)

我得到了这样的引理：

Funnier
Funniest
mighty
tight

当我选择nltk WordNetLemmatizer时

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token))

我得到：

Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter

有人帮我吗

谢谢。

引理化完全取决于您在获取特定单词的引理时使用的词性标记

# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']

# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best

上面的代码是如何在单词和句子上使用wordnet lemmatizer的简单示例

注意，它做得不好。因为“are”未按预期转换为“be”，而“hanging”未按预期转换为“hang”。如果我们提供正确的'part of speech'标记（POS标记）作为lemmatize（）的第二个参数，则可以更正此问题

有时，同一个单词可以有多个基于意义/上下文的引理

print(lemmatizer.lemmatize("stripes", 'v'))  
#> strip

print(lemmatizer.lemmatize("stripes", 'n'))  
#> stripe

对于上面的示例（），指定相应的pos标记：

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token, wordnet.ADJ_SAT))

你能把你的密码也发出来吗？谢谢。它没有为问题定义中指定的列表中包含的单词提供引理。在nltk中，默认情况下，wordnetLemmatizer中的pos标记设置为Noun，而单词“mightiest”是一个形容词，因此在获取引理时需要指定它。>>lemmatizer.lemmatize（“最强大的”，wordnet.ADJ_SAT）输出：强大的，在spacy中，它在内部将相应单词的pos标记传递给lemma函数。是的，我试过了。但是“funnier”和“funniest”并不是将引理作为“搞笑”给出。在[21]：lemmatizer.lemmatize（“funniest”，wordnet.ADJ_SAT）Out[21]：“搞笑”是指当你在NLTkinternaly spacy中传递wordnetlemma中的词性标签时，它给出输出使用预定义/训练过的语料库来标记输入。它的训练语料库中可能不包含最滑稽的和更滑稽的，因此将其标记为名词。

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token, wordnet.ADJ_SAT))