Python 从数据帧中删除停止字
我有下面的脚本&在最后一行中,我试图从名为“response”的列中的字符串中删除stopwords 问题是,不是“有点恼火”变成“有点恼火”,它实际上连字母都掉了——所以,有点恼火会变成有点不高兴。因为“a”是一个停止词 有人能给我建议吗Python 从数据帧中删除停止字,python,pandas,nltk,Python,Pandas,Nltk,我有下面的脚本&在最后一行中,我试图从名为“response”的列中的字符串中删除stopwords 问题是,不是“有点恼火”变成“有点恼火”,它实际上连字母都掉了——所以,有点恼火会变成有点不高兴。因为“a”是一个停止词 有人能给我建议吗 import pandas as pd from textblob import TextBlob import numpy as np import os import nltk nltk.download('stopw
import pandas as pd
from textblob import TextBlob
import numpy as np
import os
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
path = 'Desktop/fanbase2.csv'
df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")
#remove punctuation
df['response'] = df.response.str.replace("[^\w\s]", "")
#make it all lower case
df['response'] = df.response.apply(lambda x: x.lower())
#Handle strange character in source
df['response'] = df.response.str.replace("‰Ûª", "''")
df['response'] = df['response'].apply(lambda x: [item for item in x if item not in stop])
在列表理解(最后一行)中,您正在对照停止词检查每个单词,如果该单词不在停止词中,您将返回它。但是你正在传递一个字符串给它。您需要拆分字符串以使LC正常工作
df = pd.DataFrame({'response':['This is one type of response!', 'Though i like this one more', 'and yet what is that?']})
df['response'] = df.response.str.replace("[^\w\s]", "").str.lower()
df['response'] = df['response'].apply(lambda x: [item for item in x.split() if item not in stop])
0 [one, type, response]
1 [though, like, one]
2 [yet]
如果要以字符串形式返回响应,请将最后一行更改为
df['response'] = df['response'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))
0 one type response
1 though like one
2 yet
谢谢,这很好用!很抱歉提出这个愚蠢的问题,但是.split()如何知道在没有明确定义的情况下在空格处拆分?拆分的默认分隔符是空格。如果字符串由其他分隔符分隔,则需要指定该分隔符,但句子中很少出现这种情况:)谢谢您的帮助!:):)