Python 如何使用pandas/sklearn删除停止短语/停止ngram(多单词字符串)?

Python 如何使用pandas/sklearn删除停止短语/停止ngram(多单词字符串)?,python,pandas,scikit-learn,nlp,Python,Pandas,Scikit Learn,Nlp,我想防止某些短语潜入我的模型。例如,我想阻止“红玫瑰”进入我的分析。我理解如何通过这样做添加中给出的单个停止词: from sklearn.feature_extraction import text additional_stop_words=['red','roses'] 然而,这也导致其他ngrams,如“红郁金香”或“蓝玫瑰”未被检测到 我正在构建一个TfidfVectorizer作为我的模型的一部分,我意识到我需要的处理可能必须在这个阶段之后进入,但我不确定如何做到这一点 我的最终目

我想防止某些短语潜入我的模型。例如,我想阻止“红玫瑰”进入我的分析。我理解如何通过这样做添加中给出的单个停止词:

from sklearn.feature_extraction import text
additional_stop_words=['red','roses']
然而,这也导致其他ngrams,如“红郁金香”或“蓝玫瑰”未被检测到

我正在构建一个TfidfVectorizer作为我的模型的一部分,我意识到我需要的处理可能必须在这个阶段之后进入,但我不确定如何做到这一点

我的最终目标是在一段文本上进行主题建模。下面是我正在编写的一段代码(几乎直接从中借用):

from sklearn import decomposition

from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']

sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=5
)

dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
    n_topics=num_topics,
    random_state=1
)

doctopic = m_clf.fit_transform(dtm)
topic_words = []

for topic in m_clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
    print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))
编辑

示例数据帧(我尝试插入尽可能多的边缘案例),df:


对于熊猫,您希望使用列表压缩

.apply(lambda x: [item for item in x if item not in stop])

通过传递关键字参数
tokenizer

原稿是这样的:

def build_tokenizer(self):
    """Return a function that splits a string into a sequence of tokens"""
    if self.tokenizer is not None:
        return self.tokenizer
    token_pattern = re.compile(self.token_pattern)
    return lambda doc: token_pattern.findall(doc)
让我们做一个函数,删除所有你不想要的单词组合。首先,让我们定义不需要的表达式:

unwanted_expressions = [('red','roses'), ('foo', 'bar')]
函数需要如下所示:

token_pattern_str = r"(?u)\b\w\w+\b"
def my_tokenizer(doc):
    """split a string into a sequence of tokens
    and remove some words along the way."""

    token_pattern = re.compile(token_pattern_str)
    tokens = token_pattern.findall(doc)
    for i in range(len(tokens)):
        for expr in unwanted_expressions:
            found = True
            for j, word in enumerate(expr):
                found = found and (tokens[i+j] == word)
            if found:
                tokens[i:i+len(expr)] = len(expr) * [None]
    tokens = [x for x in tokens if x is not None]
    return tokens
我自己并没有专门尝试过这个,但我以前已经关掉了标记器。它工作得很好


祝您好运:)

在将df传递给mod_矢量器之前,您应该使用类似以下内容:

df=["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]

df=[ i.lower() for i in df]
df=[i if 'red roses' not in i else i.replace('red roses','') for i in df]
如果您检查的“红玫瑰”不止一行,则将上面的最后一行替换为:

stop_phrases=['red roses']
def filterPhrase(data,stop_phrases):
 for i in range(len(data)):
     for x in stop_phrases:
         if x in data[i]:
             data[i]=data[i].replace(x,'')
 return data
df=filterPhrase(df, stop_phrases)

TFIDFvectorier
允许自定义预处理器。您可以使用此选项进行任何必要的调整

例如,要从示例语料库中删除所有出现的连续“红色”+“玫瑰色”标记(不区分大小写),请使用:

现在
vocab
删除了所有
redroses
参考

print(sorted(vocab))

['Could buy',
 'It impossible',
 'It impossible blue',
 'It quite',
 'It quite unusual',
 'John loves',
 'John loves color',
 'Mary favorite',
 'Mary favorite flowers',
 'blue roses',
 'blue tulips',
 'color Mary',
 'color Mary favorite',
 'favorite flowers',
 'flowers roses',
 'flowers roses favorite',
 'impossible blue',
 'impossible blue roses',
 'like blue',
 'like blue tulips',
 'like like',
 'like like blue',
 'like red',
 'like red flowers',
 'loves color',
 'loves color Mary',
 'quite unusual',
 'quite unusual red',
 'red flowers',
 'red flowers roses',
 'red tulips',
 'roses favorite',
 'unusual red',
 'unusual red tulips']
更新(每个评论线程):

要将所需的停止短语以及自定义停止词传递给包装函数,请使用:

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)

谢谢你,菲利普·斯塔克。当我调用TFIDFvectorier时,我是否基本上只给出参数'tokenizer=my_tokenizer(df['content'])?我已经编辑了我的文章,提供了一个带有内容列的示例df。实际上,您只需要给tokenizer=my_tokenizer。先别说了。它是一个函数对象。矢量器会在适当的时候调用它。请参阅我对原始代码的链接,以了解它的确切功能。它对于测试数据帧运行良好,但是对于不同的数据帧,我遇到了“IndexError:list index out range”(索引器:列表索引超出范围)错误。对于多个停止短语,代码(在列表停止短语中)的更改是什么?只需添加到
停止短语
列表中即可。该函数循环遍历每个短语并将其从语料库中删除。比如:
[“红玫瑰”、“蓝郁金香”]
谢谢!有没有一种方法可以将停止词列表作为参数传递给remove_stop_phrases()函数?我的用例要求我在一个更大的函数中执行上述所有处理,根据需要向该函数输入停止短语。
tfidfvectorier
中的预处理器不接受其他参数。一个选项是将自定义停止短语列表传递给包装器函数,然后让
删除\u停止短语
引用包装器参数。我在我的答案中添加了一个更新来演示。它非常有效,但我面临两个问题:1)它似乎忽略了传递的停止词。抽样如下:
print(sorted(vocab))

['Could buy',
 'It impossible',
 'It impossible blue',
 'It quite',
 'It quite unusual',
 'John loves',
 'John loves color',
 'Mary favorite',
 'Mary favorite flowers',
 'blue roses',
 'blue tulips',
 'color Mary',
 'color Mary favorite',
 'favorite flowers',
 'flowers roses',
 'flowers roses favorite',
 'impossible blue',
 'impossible blue roses',
 'like blue',
 'like blue tulips',
 'like like',
 'like like blue',
 'like red',
 'like red flowers',
 'loves color',
 'loves color Mary',
 'quite unusual',
 'quite unusual red',
 'red flowers',
 'red flowers roses',
 'red tulips',
 'roses favorite',
 'unusual red',
 'unusual red tulips']
desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)