Python 3.x 使用spacy和nltk对单词进行引理化,但没有给出正确的引理

Python 3.x 使用spacy和nltk对单词进行引理化,但没有给出正确的引理,python-3.x,lemmatization,Python 3.x,Lemmatization,我想得到下面列表中单词的柠檬化单词: (例如) 当我做spacy时 import spacy nlp = spacy.load('en') words = ['Funnier','Funniest','mightiest','tighter','biggify'] doc = spacy.tokens.Doc(nlp.vocab, words=words) for items in doc: print(items.lemma_) 我得到了这样的引理: Funnier Funniest

我想得到下面列表中单词的柠檬化单词:

(例如)

当我做spacy时

import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
    print(items.lemma_)
我得到了这样的引理:

Funnier
Funniest
mighty
tight 
当我选择nltk WordNetLemmatizer时

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token))
我得到:

Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter
有人帮我吗


谢谢。

引理化完全取决于您在获取特定单词的引理时使用的词性标记

# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']

# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best
上面的代码是如何在单词和句子上使用wordnet lemmatizer的简单示例

注意,它做得不好。因为“are”未按预期转换为“be”,而“hanging”未按预期转换为“hang”。如果我们提供正确的'part of speech'标记(POS标记)作为lemmatize()的第二个参数,则可以更正此问题

有时,同一个单词可以有多个基于意义/上下文的引理

print(lemmatizer.lemmatize("stripes", 'v'))  
#> strip

print(lemmatizer.lemmatize("stripes", 'n'))  
#> stripe
对于上面的示例(),指定相应的pos标记:

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token, wordnet.ADJ_SAT))

你能把你的密码也发出来吗?谢谢。它没有为问题定义中指定的列表中包含的单词提供引理。在nltk中,默认情况下,wordnetLemmatizer中的pos标记设置为Noun,而单词“mightiest”是一个形容词,因此在获取引理时需要指定它。>>lemmatizer.lemmatize(“最强大的”,wordnet.ADJ_SAT)输出:强大的,在spacy中,它在内部将相应单词的pos标记传递给lemma函数。是的,我试过了。但是“funnier”和“funniest”并不是将引理作为“搞笑”给出。在[21]:lemmatizer.lemmatize(“funniest”,wordnet.ADJ_SAT)Out[21]:“搞笑”是指当你在NLTkinternaly spacy中传递wordnetlemma中的词性标签时,它给出输出使用预定义/训练过的语料库来标记输入。它的训练语料库中可能不包含最滑稽的和更滑稽的,因此将其标记为名词。
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token, wordnet.ADJ_SAT))