Python 删除数据帧的每个标记行中的停止字
我试图从数据帧的每一行中删除stopwords,并将其放入新的数据帧列S中 我试过下面的代码,但似乎不起作用Python 删除数据帧的每个标记行中的停止字,python,pandas,nltk,stop-words,Python,Pandas,Nltk,Stop Words,我试图从数据帧的每一行中删除stopwords,并将其放入新的数据帧列S中 我试过下面的代码,但似乎不起作用 from nltk.corpus import stopwords stopwords = stopwords.words('english') df['S'] = df.apply(lambda row: (word for word in row['remarks_tokenized'] if word.lower() not in stopwords), axis=1) 我在另
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
df['S'] = df.apply(lambda row: (word for word in row['remarks_tokenized'] if word.lower() not in stopwords), axis=1)
我在另一个语料库中尝试了这个方法,效果很好
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
def remove_stopwords(sentence):
word_tokens = word_tokenize(sentence)
clean_tokens = [w for w in word_tokens if not w in stop_words]
return clean_tokens
df['S'] = df['remarks'].apply(remove_stopwords)
输出:
0 [microsoft, word, arma2011paper353, prediction...
1 [2504, 0478, matava, qxd, gulf, mexico, mature...
2 [lithospheric, structure, texas, gulf, mexico,...
4 [int, see, discussions, stats, author, profile...
5 [bltn9556, authors, thomas, r, taylor, shell, ...
7 [high, resolution, reservoir, characterization...
8 [untitled, journal, sedimentary, research, v, ...
9 [doi, j, epsl, www, elsevier, com, locate, eps...
10 [authors, dale, e, bird, department, geoscienc...
11 [spe, ms, spe, ms, taking, co2, enhanced, oil,...
我还喜欢使用texthero对语料库进行预处理。如果您以前没有尝试过,请强烈推荐。谢谢您的编辑。收到留言谢谢!成功了!