Python Keras：文本预处理（停止字删除等）_Python_Keras

Python Keras：文本预处理（停止字删除等）

python keras

Python Keras：文本预处理（停止字删除等）,python,keras,Python,Keras,我正在使用Keras完成一项多标签分类任务（Kaggle上的有毒评论文本分类）我正在使用标记器类来进行如下预处理： tokenizer = Tokenizer(num_words=10000) tokenizer.fit_on_texts(train_sentences) train_sentences_tokenized = tokenizer.texts_to_sequences(train_sentences) max_len = 250 X_train = pad_sequences(

我正在使用Keras完成一项多标签分类任务（Kaggle上的有毒评论文本分类）

我正在使用

标记器

类来进行如下预处理：

tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(train_sentences)
train_sentences_tokenized = tokenizer.texts_to_sequences(train_sentences)
max_len = 250
X_train = pad_sequences(train_sentences_tokenized, maxlen=max_len)

这是一个很好的开始，但我没有删除停止词、词干词等。为了删除停止词，我在上面所做的是：

def filter_stop_words(train_sentences, stop_words):
    for i, sentence in enumerate(train_sentences):
        new_sent = [word for word in sentence.split() if word not in stop_words]
        train_sentences[i] = ' '.join(new_sent)
    return train_sentences

stop_words = set(stopwords.words("english"))
train_sentences = filter_stop_words(train_sentences, stop_words)

在Keras中，难道不应该有更简单的方法来实现这一点吗？我们希望也有阻止功能，但文档没有指出有：

任何关于停止字删除和词干生成的最佳实践的帮助都将非常棒

谢谢

不，Keras不是一个自然语言处理库。您必须自己处理任何复杂的处理。在这个阶段，使用实际的NLP库和Python接口（如或）可能是有意义的

Tokenizer

是一个用于基本自然语言任务的小型实用程序类，您可以自己将其扩展到某一点，但NLP库将提供更多功能，包括标记化、词性标记和词干分析