Python WordNetlemmatizer错误-所有字母都已进行lemmatized_Python_Pandas_Nltk_Wordnet_Lemmatization

Python WordNetlemmatizer错误-所有字母都已进行lemmatized

python pandas

Python WordNetlemmatizer错误-所有字母都已进行lemmatized,python,pandas,nltk,wordnet,lemmatization,Python,Pandas,Nltk,Wordnet,Lemmatization,我正在尝试将我的数据集简单化，以便进行情绪分析——我应该做些什么来获得预期的产出，而不是当前的产出？输入文件是存储为数据帧对象的csv文件 dataset = pd.read_csv('xyz.csv') 这是我的密码 from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() list1_ = [] for file_ in dataset: result1 = dataset['Content']

我正在尝试将我的数据集简单化，以便进行情绪分析——我应该做些什么来获得预期的产出，而不是当前的产出？输入文件是存储为数据帧对象的csv文件

dataset = pd.read_csv('xyz.csv')

这是我的密码

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
list1_ = []
for file_ in dataset:
    result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x])
    list1_.append(result1)
dataset = pd.concat(list1_, ignore_index=True)

期望

>> lemmatizer.lemmatize('cats')
>> [cat]

电流输出

>> lemmatizer.lemmatize('cats')
>> [c,a,t,s]

TL；DR

result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x.split()])

Lemmatizer接受任何字符串作为输入

如果

dataset['Content']

列是字符串，则遍历字符串将是遍历字符而不是“单词”，例如

因此，您必须首先使用单词标记您的句子字符串，例如：

>>> from nltk import word_tokenize
>>> [wnl.lemmatize(word) for word in x.split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', 'is', 'of', 'type', 'str']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['this', 'is', 'a', 'foo', 'bar', 'sentence', ',', 'that', 'is', 'of', 'type', 'str']

另一个例子

>>> from nltk import word_tokenize
>>> x = 'the geese ran through the parks'
>>> [wnl.lemmatize(word) for word in x.split()]
['the', u'goose', 'ran', 'through', 'the', u'park']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['the', u'goose', 'ran', 'through', 'the', u'park']

但要获得更准确的柠檬化，您应该将句子单词标记化并添加词性标记，请参阅

>>> from nltk import word_tokenize
>>> x = 'the geese ran through the parks'
>>> [wnl.lemmatize(word) for word in x.split()]
['the', u'goose', 'ran', 'through', 'the', u'park']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['the', u'goose', 'ran', 'through', 'the', u'park']