在python中使用停止字清除tweets列时出现KeyError_Python_Nlp_Nltk_Tokenize_Stop Words

在python中使用停止字清除tweets列时出现KeyError

python nlp

在python中使用停止字清除tweets列时出现KeyError,python,nlp,nltk,tokenize,stop-words,Python,Nlp,Nltk,Tokenize,Stop Words,我有一个tweet的数据框，我正在尝试清理我的“tweet”栏——删除停止词并使用柠檬化下面是我的代码： stop_words = set(stopwords.words('english')) lemmatizer= WordNetLemmatizer() sentence = df['tweet'].apply(nltk.sent_tokenize) 0 [ 'country year happy'] 1 [ 'wish happy year'] 2 [ 'live year t

我有一个tweet的数据框，我正在尝试清理我的“tweet”栏——删除停止词并使用柠檬化

下面是我的代码：

stop_words = set(stopwords.words('english'))
lemmatizer= WordNetLemmatizer()

sentence = df['tweet'].apply(nltk.sent_tokenize)

 0 [ 'country year happy']
 1 [ 'wish happy year']
 2 [ 'live year together']

for i in range(len(sentence)): 
    words=nltk.word_tokenize(str(sentence[i]))
    words=[lemmatizer.lemmatize(word) for word in words if word not in 
          set(stopwords.words('english'))]
    sentence[i]=' '.join(words)

上面的代码给了我以下错误：（我包括了所有的回溯）

keyrerror回溯（最近一次调用）
在里面
1表示范围内的i（len（句子））：
---->2个单词=nltk.word_标记化（str（句子[i]））
3 words=[lemmatizer.如果单词不在单词中，则对单词中的单词进行lemmatize（单词）
set（stopwords.words（'english'））]
4句[i]=''。连接（单词）
~\anaconda3\lib\site packages\pandas\core\series.py in\uuuuu getitem\uuuuuuuuuu（self，key）
869 key=com.apply\u如果可调用（key，self）
870尝试：
-->871结果=self.index.get_值（self，key）
872
873如果不是标量（结果）：
获取值（self）中的~\anaconda3\lib\site packages\pandas\core\index\base.py，
系列，键）
4403 k=self.\u convert\u scalar\u indexer（k，kind=“getitem”）
4404尝试：
->4405返回自引擎。获取值（s，k，
tz=getattr（series.dtype，“tz”，无））
4406除键错误为e1外：
4407如果len（self）>0且（self.holds_integer（）或
self.is_boolean（））：
pandas\\u libs\index.pyx在pandas.\u libs.index.IndexEngine.get\u value（）中
pandas\\u libs\index.pyx在pandas.\u libs.index.IndexEngine.get\u value（）中
熊猫\\u libs\index.pyx在熊猫中。\ u libs.index.IndexEngine.get_loc（）
中的pandas\\u libs\hashtable\u class\u helper.pxi
pandas._libs.hashtable.Int64HashTable.get_item（）
中的pandas\\u libs\hashtable\u class\u helper.pxi
pandas._libs.hashtable.Int64HashTable.get_item（）
关键字错误：34

我如何修复错误

另外，我如何在数据框中获得结果？添加另一列结果？

使用

语句.iloc[I]

而不是

语句[I]

解释

KeyError

表示

df.index

中没有

句子

是一个系列；当您访问

语句[i]

时，Pandas将首先尝试使用基于索引的索引（

df.loc

），但如果您的索引是非数字的，将退回到基于位置的索引（

df.iloc

）。因此，如果您的索引恰好是非数字的，那么这段代码可能会起作用，但在其他情况下，它并没有达到您所期望的效果。可以通过显式使用基于位置的索引（

df.iloc

）来修复此错误

对于一个独立的示例：

不起作用作品

提示：与手动遍历数据帧中的行相比，将逻辑编写为函数并使用

df通常更安全、更高效。apply

错误已经解决，但我编写的循环似乎没有达到我想要的效果。句子栏中仍然有停止词。我认为问题在于这一部分：

words=[lemmatizer.lemmatize（word）for words in words，如果word不在停止词中]

。有可能未柠檬化的单词不是停止词，而是在柠檬化后变成停止词。尝试添加另一个

words=[words for words if word not in stop\u words]

step。哦，还有另一个问题：如果您选中

nltk.word\u tokenize（str（句子.iloc[i]）

，它可能也没有做您想做的事情，因为您正在将

['hello world']

转换为

['、'hello'、'world'、'world'、“、'、'、']

，这意味着你有额外的引号和括号。您的

tweet

列表中有什么内容？如果只有一行您可以执行

nltk.word\u标记化（句子.iloc[0][0]）

，否则您需要正确地迭代列表，而不是将其转换为字符串。谢谢，我添加了对“words”（您的第一条评论）的更正，并修复了“word\u标记化”问题。它仍然没有解决问题。栏中有停止词和未语法化的词。你对此还有其他建议吗？技术上这是一个单独的问题；你能（1）把你的新代码添加到你的问题中，（2）展示一些显示问题的输入/输出示例吗？

 KeyError  Traceback (most recent call last)
<ipython-input-384-f4bb836363e1> in <module>
  1 for i in range(len(sentence)):
----> 2     words=nltk.word_tokenize(str(sentence[i]))
  3     words=[lemmatizer.lemmatize(word) for word in words if word not in 
      set(stopwords.words('english'))]
  4     sentence[i]=' '.join(words)

~\anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
   869         key = com.apply_if_callable(key, self)
   870         try:
   --> 871     result = self.index.get_value(self, key)
   872 
   873             if not is_scalar(result):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, 
  series, key)
  4403         k = self._convert_scalar_indexer(k, kind="getitem")
  4404         try:
  -> 4405             return self._engine.get_value(s, k, 
  tz=getattr(series.dtype, "tz", None))
  4406         except KeyError as e1:
  4407             if len(self) > 0 and (self.holds_integer() or 
  self.is_boolean()):

  pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

  pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()

  pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

  pandas\_libs\hashtable_class_helper.pxi in 
  pandas._libs.hashtable.Int64HashTable.get_item()

  pandas\_libs\hashtable_class_helper.pxi in 
  pandas._libs.hashtable.Int64HashTable.get_item()

  KeyError: 34

import pandas as pd
df = pd.DataFrame({'index': [10,20], 'tweets': [['hello world'],['foo bar']]}).set_index('index')
sentence = df['tweets']

for i in range(len(sentence)):
    print(sentence[i])

import pandas as pd
df = pd.DataFrame({'index': [10,20], 'tweets': [['hello world'],['foo bar']]}).set_index('index')
sentence = df['tweets']

for i in range(len(sentence)):
    print(sentence.iloc[i])