Python 如何从csv文件中删除停止字
目前我正在从事一个分析Twitter数据的项目。我正处于预处理阶段,正在努力让我的应用程序从数据集中删除停止字Python 如何从csv文件中删除停止字,python,pandas,dataframe,Python,Pandas,Dataframe,目前我正在从事一个分析Twitter数据的项目。我正处于预处理阶段,正在努力让我的应用程序从数据集中删除停止字 import pandas as pd import json import re import nltk from nltk.corpus import stopwords nltk.download('stopwords') self.file_name = filedialog.askopenfilename(initialdir='/Desktop',
import pandas as pd
import json
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
self.file_name = filedialog.askopenfilename(initialdir='/Desktop',
title='Select file',
filetypes=(('csv file', '*.csv'),
('csv file', '*.csv')))
for tw in df["txt"]:
column_list = ["txt"]
clean_tw = []
df = pd.read_csv(self.file_name, usecols=column_list)
stop_words = set(stopwords.words('english'))
tw = (re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+(RT))", "", tw.lower()).split())
if tw not in stop_words:
filtered_tw = [w for w in tw if not w in stopwords.words('english')]
clean_tw.append(filtered_tw)
我当前得到的错误是:
Exception in Tkinter callback
Traceback (most recent call last):
File "...", line 1884, in __call__
return self.func(*args)
File "...", line 146, in clean_csv
if tweet not in stop_words:
TypeError: unhashable type: 'list'
您正在尝试检查列表(来自正则表达式的结果)是否在集合中。。。此操作无法完成。您需要循环浏览列表(或执行某种设置操作,例如
set(tw).difference(stop_words)
为了清楚起见:
>>tw=(re.sub(([^0-9A-Za-z\t]))|(\w+:\/\/\S+(RT)),“”,initial.lower()).split())
>>>tw
['this'、'is'、'an'、'example']
>>>设置(tw).差异(停止字)
{'example'}
然后,只需在clean_tw
中添加差异:)类似于:
clean_tw=[]
df=pd.read\u csv(self.file\u名称,usecols=col\u列表)
stop_words=set(stopwords.words('english'))
tw=(re.sub(([^0-9A-Za-z\t])|(\w+:\/\/\S+(RT)),“”,tw.lower()).split())
clean_tw.append(set(tw).difference(stop_words))
最后,您可以在循环之外定义
stop_单词
,因为它将始终是相同的集合,所以您可以在性能上提高一点:)根据错误消息,很可能tweet
是一个列表,而stop_单词
是一个集合或字典
>>> tweet = ['a','b']
>>> stop_words = set('abcdefg')
>>> tweet not in stop_words
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
或
仅供参考,当有这么好的软件包时,你不应该用regex删除stopwords 我建议使用
nltk
标记化和非标记化
对于csv中的每一行:
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
# get your stopwords from nltk
stop_words = set(stopwords.words('english'))
# loop through your rows
for sent in sents:
# tokenize
tokenized_sent = nltk.word_tokenize(sent)
# remove stops
tokenized_sent_no_stops = [
tok for tok in tokenized_sent
if tok not in stop_words
]
# untokenize
untokenized_sent = TreebankWordDetokenizer().detokenize(
tokenized_sent_no_stops
)
您的函数有一个变量“tweet”,但我认为它应该被标记为“tw”,就像在for中一样loop@Matt我已经更改了它,同样的错误弹出。发布完整的错误你能添加文本的样本,以及你使用哪个库来获取停止词吗?乍一看,我看到:-在第3行中,
tw
和.lower()
之间有一个空格-在第5行中,stopwords没有上一行中的下划线?我想它们是两个不同的集合?以虚拟方式执行,并设置stop\u words=[“an”]
它在我的终端上运行。。。你能发布完整的stacktrace吗?@DeepakDixit我已经添加了完整的错误输出。
if stop_words.isdisjoint(tweet):
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
# get your stopwords from nltk
stop_words = set(stopwords.words('english'))
# loop through your rows
for sent in sents:
# tokenize
tokenized_sent = nltk.word_tokenize(sent)
# remove stops
tokenized_sent_no_stops = [
tok for tok in tokenized_sent
if tok not in stop_words
]
# untokenize
untokenized_sent = TreebankWordDetokenizer().detokenize(
tokenized_sent_no_stops
)