Python 无法从数据帧中删除英文停止字_Python_Pandas_Nltk_Sentiment Analysis_Stop Words

Python 无法从数据帧中删除英文停止字

python pandas

Python 无法从数据帧中删除英文停止字,python,pandas,nltk,sentiment-analysis,stop-words,Python,Pandas,Nltk,Sentiment Analysis,Stop Words,我一直在尝试对一个电影评论数据集进行情感分析，我陷入了无法从数据中删除英语停止词的境地。我做错了什么 from nltk.corpus import stopwords stop = stopwords.words("English") list_ = [] for file_ in dataset: dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item

我一直在尝试对一个电影评论数据集进行情感分析，我陷入了无法从数据中删除英语停止词的境地。我做错了什么

from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
    dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
    list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)

通过您的评论，我认为您不需要在

数据集上循环。（可能dataset
只包含名为Content
的单列）
您可以简单地执行以下操作：
 dataset["Content"] = dataset["Content"].str.split(",").apply(lambda x: [item for item in x if item not in stop])

您正在数据集上循环，但每次都会附加整个帧，而不使用文件。\utry:
from nltk.corpus import stopwords
stop = stopwords.words("English")
dataset['Cleaned'] = dataset['Content'].apply(lambda x: ','.join([item for item in x.split(',') if item not in stop]))

如果要将其展平为单个列表，则返回包含单词列表的序列：
flat_list = [item for sublist in list(dataset['Cleaned'].values) for item in sublist]

戴上帽尖，试试看是否有泥土味：
>>> from earthy.wordlist import punctuations, stopwords
>>> from earthy.preprocessing import remove_stopwords
>>> result = dataset['Content'].apply(remove_stopwords)

请参见
我认为到目前为止，代码应该与信息一起使用。我的假设是，在使用逗号分隔时，数据有额外的空间。下面是测试运行：（希望有帮助！）
用停止字输入：
                          Content
0   i, am, the, computer, machine
1                   i, play, game

输出：
                Content
 0  [computer, machine]
 1         [play, game]

您得到的错误是什么？@open-source没有错误-当我执行此代码时不会发生任何事情。您的内容是否符合“我，我，计算机，机器”的格式。
？你能发布一行你希望删除的停止字吗？可能这就是你需要的=）我得到一个类型错误：字符串索引必须是整数
我得到一个类型错误：字符串索引也必须是整数<代码>数据集
是类型<代码>数据帧顺便说一句。啊，好吧，那不清楚，你想要的结果是什么形式？一个单词列表，还是一行一个列表？我更新了我的答案，给你两个选项。我假设dataset['Content']元素包含一个逗号分隔的单词列表，如果不是，请给出一个dataset示例，并澄清您在两个示例中都遇到了这些错误，因为在dataframe上迭代实际上是在列而不是行上迭代。为此，您可以使用iterrows，但在本例中，您可以使用apply，如图所示，因为iterrows返回元组。如果你真的想做一些像你的代码一样的事情，你也可以迭代数据集的索引。是的，数据集是由电影评论组成的逗号分隔的数据框。已从每行中删除标点符号。预期输出：3行有大约50个字，这些行中有2、5、7个停止字。输出应该是逗号分隔的48、45和43个字的数据帧；P
                Content
 0  [computer, machine]
 1         [play, game]