Python 如何从csv文件中删除停止字

Python 如何从csv文件中删除停止字,python,pandas,dataframe,Python,Pandas,Dataframe,目前我正在从事一个分析Twitter数据的项目。我正处于预处理阶段,正在努力让我的应用程序从数据集中删除停止字 import pandas as pd import json import re import nltk from nltk.corpus import stopwords nltk.download('stopwords') self.file_name = filedialog.askopenfilename(initialdir='/Desktop',

目前我正在从事一个分析Twitter数据的项目。我正处于预处理阶段,正在努力让我的应用程序从数据集中删除停止字

import pandas as pd
import json
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

self.file_name = filedialog.askopenfilename(initialdir='/Desktop',
                                                        title='Select file',
                                                        filetypes=(('csv file', '*.csv'),
                                                                   ('csv file', '*.csv')))

for tw in df["txt"]:

            column_list = ["txt"]
            clean_tw = []
            df = pd.read_csv(self.file_name, usecols=column_list)
            stop_words = set(stopwords.words('english'))

            tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", tw.lower()).split())
            if tw not in stop_words:
            filtered_tw = [w for w in tw if not w in stopwords.words('english')]
            clean_tw.append(filtered_tw)
                        
我当前得到的错误是:

Exception in Tkinter callback
Traceback (most recent call last):
  File "...", line 1884, in __call__
    return self.func(*args)
  File "...", line 146, in clean_csv
    if tweet not in stop_words:
TypeError: unhashable type: 'list'


您正在尝试检查列表(来自正则表达式的结果)是否在集合中。。。此操作无法完成。您需要循环浏览列表(或执行某种设置操作,例如
set(tw).difference(stop_words)

为了清楚起见:

>>tw=(re.sub(([^0-9A-Za-z\t]))|(\w+:\/\/\S+(RT)),“”,initial.lower()).split())
>>>tw
['this'、'is'、'an'、'example']
>>>设置(tw).差异(停止字)
{'example'}
然后,只需在
clean_tw
中添加差异:)类似于:

clean_tw=[]
df=pd.read\u csv(self.file\u名称,usecols=col\u列表)
stop_words=set(stopwords.words('english'))
tw=(re.sub(([^0-9A-Za-z\t])|(\w+:\/\/\S+(RT)),“”,tw.lower()).split())
clean_tw.append(set(tw).difference(stop_words))

最后,您可以在循环之外定义
stop_单词
,因为它将始终是相同的集合,所以您可以在性能上提高一点:)

根据错误消息,很可能
tweet
是一个列表,而
stop_单词
是一个集合或字典

>>> tweet = ['a','b']
>>> stop_words = set('abcdefg')
>>> tweet not in stop_words
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'


仅供参考,当有这么好的软件包时,你不应该用regex删除stopwords

我建议使用
nltk
标记化和非标记化

对于csv中的每一行:

import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import stopwords

nltk.download('stopwords')

# get your stopwords from nltk
stop_words = set(stopwords.words('english'))

# loop through your rows
for sent in sents:

    # tokenize
    tokenized_sent = nltk.word_tokenize(sent)

    # remove stops
    tokenized_sent_no_stops = [
        tok for tok in tokenized_sent 
        if tok not in stop_words
    ]

    # untokenize 
    untokenized_sent = TreebankWordDetokenizer().detokenize(
        tokenized_sent_no_stops
    )

您的函数有一个变量“tweet”,但我认为它应该被标记为“tw”,就像在for中一样loop@Matt我已经更改了它,同样的错误弹出。发布完整的错误你能添加文本的样本,以及你使用哪个库来获取停止词吗?乍一看,我看到:-在第3行中,
tw
.lower()
之间有一个空格-在第5行中,stopwords没有上一行中的下划线?我想它们是两个不同的集合?以虚拟方式执行,并设置
stop\u words=[“an”]
它在我的终端上运行。。。你能发布完整的stacktrace吗?@DeepakDixit我已经添加了完整的错误输出。
if stop_words.isdisjoint(tweet):
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import stopwords

nltk.download('stopwords')

# get your stopwords from nltk
stop_words = set(stopwords.words('english'))

# loop through your rows
for sent in sents:

    # tokenize
    tokenized_sent = nltk.word_tokenize(sent)

    # remove stops
    tokenized_sent_no_stops = [
        tok for tok in tokenized_sent 
        if tok not in stop_words
    ]

    # untokenize 
    untokenized_sent = TreebankWordDetokenizer().detokenize(
        tokenized_sent_no_stops
    )