Python NLTK WordNet Lemmatizer:；它不是把一个词的所有词形变化都套用了吗？_Python_Nlp_Nltk

Python NLTK WordNet Lemmatizer:；它不是把一个词的所有词形变化都套用了吗？

python nlp

Python NLTK WordNet Lemmatizer:；它不是把一个词的所有词形变化都套用了吗？,python,nlp,nltk,Python,Nlp,Nltk,我将NLTK WordNet Lemmatizer用于词性标记项目，首先将训练语料库中的每个单词修改为词干（就地修改），然后仅在新语料库上进行训练。然而，我发现柠檬化器并没有像我预期的那样工作例如，单词love被引理化为love，这是正确的，但是单词loving即使在引理化之后仍然保持loving。这里的loving就像一句话“我爱它” love这个词的词干不是有屈折变化的loving？类似地，许多其他的“ing”形式在柠檬化后仍然保持原样。这是正确的行为吗还有什么其他的柠檬汁是准确的？（不

我将NLTK WordNet Lemmatizer用于词性标记项目，首先将训练语料库中的每个单词修改为词干（就地修改），然后仅在新语料库上进行训练。然而，我发现柠檬化器并没有像我预期的那样工作

例如，单词

love

被引理化为

love

，这是正确的，但是单词

loving

即使在引理化之后仍然保持

loving

。这里的

loving

就像一句话“我爱它”

love

这个词的词干不是有屈折变化的

loving

？类似地，许多其他的“ing”形式在柠檬化后仍然保持原样。这是正确的行为吗

还有什么其他的柠檬汁是准确的？（不必在NLTK中）在决定词干时，是否有词法分析器或词素分析器也考虑了单词的词性标记？例如，如果

killing

用作动词，则单词

killing

应具有

kill

作为词干，但如果用作名词，则单词

killing

应具有

killing

作为词干（如

中，killing是由xyz

完成的）。

>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'

如果没有POS标记，它会假定您输入的所有内容都是名词。因此，这里它认为你在传递名词“loving”（如“sweet loving”）。

WordNet lemmatizer确实考虑了POS标记，但它并没有神奇地确定它：

>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'

如果没有POS标记，它会假定您输入的所有内容都是名词。所以在这里，它认为你在传递一个名词“loving”（如“sweet loving”）。

解决这个问题的最好方法是实际查看Wordnet。请看这里：。正如你所看到的，Wordnet中实际上有一个形容词“loving”。事实上，甚至还有副词“lovingly”：。因为wordnet实际上不知道您真正想要的词性，所以它默认为名词（wordnet中的“n”）。如果您使用的是Penn Treebank标记集，这里有一些将Penn标记转换为WN标记的便捷功能：

from nltk.corpus import wordnet as wn

def is_noun(tag):
    return tag in ['NN', 'NNS', 'NNP', 'NNPS']


def is_verb(tag):
    return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']


def is_adverb(tag):
    return tag in ['RB', 'RBR', 'RBS']


def is_adjective(tag):
    return tag in ['JJ', 'JJR', 'JJS']


def penn_to_wn(tag):
    if is_adjective(tag):
        return wn.ADJ
    elif is_noun(tag):
        return wn.NOUN
    elif is_adverb(tag):
        return wn.ADV
    elif is_verb(tag):
        return wn.VERB
    return None

希望这能有所帮助。

解决此问题的最佳方法是实际查看Wordnet。请看这里：。正如你所看到的，Wordnet中实际上有一个形容词“loving”。事实上，甚至还有副词“lovingly”：。因为wordnet实际上不知道您真正想要的词性，所以它默认为名词（wordnet中的“n”）。如果您使用的是Penn Treebank标记集，这里有一些将Penn标记转换为WN标记的便捷功能：

from nltk.corpus import wordnet as wn

def is_noun(tag):
    return tag in ['NN', 'NNS', 'NNP', 'NNPS']


def is_verb(tag):
    return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']


def is_adverb(tag):
    return tag in ['RB', 'RBR', 'RBS']


def is_adjective(tag):
    return tag in ['JJ', 'JJR', 'JJS']


def penn_to_wn(tag):
    if is_adjective(tag):
        return wn.ADJ
    elif is_noun(tag):
        return wn.NOUN
    elif is_adverb(tag):
        return wn.ADV
    elif is_verb(tag):
        return wn.VERB
    return None

希望这能有所帮助。

它比枚举更清晰、更有效：

from nltk.corpus import wordnet

def get_wordnet_pos(self, treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def penn_to_wn(tag):
    return get_wordnet_pos(tag)

它比枚举更清晰、更有效：

from nltk.corpus import wordnet

def get_wordnet_pos(self, treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def penn_to_wn(tag):
    return get_wordnet_pos(tag)

作为上文

@Fred Foo

接受答案的延伸

from nltk import WordNetLemmatizer, pos_tag, word_tokenize

lem = WordNetLemmatizer()
word = input("Enter word:\t")

# Get the single character pos constant from pos_tag like this:
pos_label = (pos_tag(word_tokenize(word))[0][1][0]).lower()

# pos_refs = {'n': ['NN', 'NNS', 'NNP', 'NNPS'],
#            'v': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
#            'r': ['RB', 'RBR', 'RBS'],
#            'a': ['JJ', 'JJR', 'JJS']}

if pos_label == 'j': pos_label = 'a'    # 'j' <--> 'a' reassignment

if pos_label in ['r']:  # For adverbs it's a bit different
    print(wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name())
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
    print(lem.lemmatize(word, pos=pos_label))
else:   # For nouns and everything else as it is the default kwarg
    print(lem.lemmatize(word))

从nltk导入WordNetLemmatizer、pos_标记、word_标记化
lem=WordNetLemmatizer（）
word=输入（“输入word:\t”）
#从pos_标记获取单字符pos常量，如下所示：
pos_标签=（pos_标签（word_标记化（word））[0][1][0]）。下（）
#位置参考={'n'：['NN'，'NNS'，'NNP'，'NNPS']，
#“v”：[“VB”、“VBD”、“VBG”、“VBN”、“VBP”、“VBZ”]，
#'r'：['RB'，'RBR'，'RBS']，
#'a'：['JJ'，'JJR'，'JJS']}
如果pos_label='j'：pos_label='a'#'j''a'重新分配
如果['r']中的pos#u标签：#对于副词，它有点不同
打印（wordnet.synset（word+'.r.1'）.lemmas（）[0].pertainyms（）[0].name（））
['a'，'s'，'v']中的elif pos_标签：形容词和动词
打印（lem.lemmatize（单词，pos=pos_标签））
else:#用于名词和其他所有内容，因为它是默认的kwarg
打印（lem.lemmatize（word））

作为上文

@Fred Foo

接受答案的扩展

from nltk import WordNetLemmatizer, pos_tag, word_tokenize

lem = WordNetLemmatizer()
word = input("Enter word:\t")

# Get the single character pos constant from pos_tag like this:
pos_label = (pos_tag(word_tokenize(word))[0][1][0]).lower()

# pos_refs = {'n': ['NN', 'NNS', 'NNP', 'NNPS'],
#            'v': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
#            'r': ['RB', 'RBR', 'RBS'],
#            'a': ['JJ', 'JJR', 'JJS']}

if pos_label == 'j': pos_label = 'a'    # 'j' <--> 'a' reassignment

if pos_label in ['r']:  # For adverbs it's a bit different
    print(wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name())
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
    print(lem.lemmatize(word, pos=pos_label))
else:   # For nouns and everything else as it is the default kwarg
    print(lem.lemmatize(word))

从nltk导入WordNetLemmatizer、pos_标记、word_标记化
lem=WordNetLemmatizer（）
word=输入（“输入word:\t”）
#从pos_标记获取单字符pos常量，如下所示：
pos_标签=（pos_标签（word_标记化（word））[0][1][0]）。下（）
#位置参考={'n'：['NN'，'NNS'，'NNP'，'NNPS']，
#“v”：[“VB”、“VBD”、“VBG”、“VBN”、“VBP”、“VBZ”]，
#'r'：['RB'，'RBR'，'RBS']，
#'a'：['JJ'，'JJR'，'JJS']}
如果pos_label='j'：pos_label='a'#'j''a'重新分配
如果['r']中的pos#u标签：#对于副词，它有点不同
打印（wordnet.synset（word+'.r.1'）.lemmas（）[0].pertainyms（）[0].name（））
['a'，'s'，'v']中的elif pos_标签：形容词和动词
打印（lem.lemmatize（单词，pos=pos_标签））
else:#用于名词和其他所有内容，因为它是默认的kwarg
打印（lem.lemmatize（word））

wnpos=lambda e:（'a'如果e[0]。lower（）=='j'else e[0]。lower（））如果e[0]。在['n'，'r'，'v']else'x'1行中的lower（）比28稍好一点；）但是，它应该是

wnpos=lambda e:（'a'如果e[0]。lower（）='j'else e[0]。lower（））如果e[0]。lower（）在['n'，r'，v']else'n'

中，因为函数的默认值是名词，而不是'x'或

None

。wnpos=lambda e:（'a'如果e[0]。lower（）='j'else e[0]。lower（））如果e[0]。lower（））在['n'，'r'，'v']否则“x”1行比28行稍微好一点；）但是，它应该是

wnpos=lambda e:（'a'如果e[0]。lower（）=='j'else e[0]。lower（））如果e[0]。lower（）在['n'，'r'，'v']else'n'

中，因为函数的默认值是名词，而不是'x'或

无。谢谢您的回答！你能告诉我，它的标签是什么吗？名词，动词…？谢谢你的回答！你能告诉我，它的标签是什么吗？n-名词，v=动词。。。？