Python 删除数据帧的每个标记行中的停止字_Python_Pandas_Nltk_Stop Words

Python 删除数据帧的每个标记行中的停止字

python pandas

Python 删除数据帧的每个标记行中的停止字,python,pandas,nltk,stop-words,Python,Pandas,Nltk,Stop Words,我试图从数据帧的每一行中删除stopwords，并将其放入新的数据帧列S中我试过下面的代码，但似乎不起作用 from nltk.corpus import stopwords stopwords = stopwords.words('english') df['S'] = df.apply(lambda row: (word for word in row['remarks_tokenized'] if word.lower() not in stopwords), axis=1) 我在另

我试图从数据帧的每一行中删除stopwords，并将其放入新的数据帧列S中

我试过下面的代码，但似乎不起作用

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

df['S'] = df.apply(lambda row: (word for word in row['remarks_tokenized'] if word.lower() not in stopwords), axis=1)

我在另一个语料库中尝试了这个方法，效果很好

from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize  
stop_words = set(stopwords.words('english'))  

def remove_stopwords(sentence):
    word_tokens = word_tokenize(sentence)  
    clean_tokens = [w for w in word_tokens if not w in stop_words]  
    
    return clean_tokens
    
df['S'] = df['remarks'].apply(remove_stopwords)

输出：

0     [microsoft, word, arma2011paper353, prediction...
1     [2504, 0478, matava, qxd, gulf, mexico, mature...
2     [lithospheric, structure, texas, gulf, mexico,...
4     [int, see, discussions, stats, author, profile...
5     [bltn9556, authors, thomas, r, taylor, shell, ...
7     [high, resolution, reservoir, characterization...
8     [untitled, journal, sedimentary, research, v, ...
9     [doi, j, epsl, www, elsevier, com, locate, eps...
10    [authors, dale, e, bird, department, geoscienc...
11    [spe, ms, spe, ms, taking, co2, enhanced, oil,...

我还喜欢使用texthero对语料库进行预处理。如果您以前没有尝试过，请强烈推荐。谢谢您的编辑。收到留言谢谢！成功了！