Python sklearn TfidfVectorizer：如何使一些单词成为功能中bi gram的一部分 >我希望代码< TFIDFVECURTHER 的特征化考虑一些预定义的单词，如“代码>”脚本、“规则”、“只用于双字节。_Python_Scikit Learn_Tfidfvectorizer

Python sklearn TfidfVectorizer：如何使一些单词成为功能中bi gram的一部分 >我希望代码< TFIDFVECURTHER 的特征化考虑一些预定义的单词，如“代码>”脚本、“规则”、“只用于双字节。

python scikit-learn

Python sklearn TfidfVectorizer：如何使一些单词成为功能中bi gram的一部分 >我希望代码< TFIDFVECURTHER 的特征化考虑一些预定义的单词，如“代码>”脚本、“规则”、“只用于双字节。,python,scikit-learn,tfidfvectorizer,Python,Scikit Learn,Tfidfvectorizer,如果我有文本“脚本包含”是一个包含规则的脚本，该规则包含业务规则“ 如果我使用 tfidf = TfidfVectorizer(ngram_range=(1,2),stop_words='english') 我应该 ['script include','business rule','include','business'] TfidfVectorizer允许您提供自己的标记器，您可以执行以下操作。但是你会丢失词汇表中的其他单词信息 from sklearn.feature_extracti

如果我有文本

“脚本包含”是一个包含规则的脚本，该规则包含业务规则“

如果我使用

tfidf = TfidfVectorizer(ngram_range=(1,2),stop_words='english')

我应该

['script include','business rule','include','business']

TfidfVectorizer

允许您提供自己的标记器，您可以执行以下操作。但是你会丢失词汇表中的其他单词信息

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["Script include is a script that has rule which has a business rule"]

vectorizer = TfidfVectorizer(ngram_range=(1,2),tokenizer=lambda corpus: [ "script", "rule"],stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

对代码进行注释以解释它在做什么

基本上，您希望根据您的特殊单词（我将其称为函数中的感兴趣的单词来定制语法创建）。我已经为您定制了默认设置

def custom_word_ngrams(tokens, stop_words=None, interested_words=None):
    """Turn tokens into a sequence of n-grams after stop words filtering"""

    original_tokens = tokens
    stop_wrds_inds = np.where(np.isin(tokens,stop_words))[0]
    intersted_wrds_inds = np.where(np.isin(tokens,interested_words))[0]

    tokens = [w for w in tokens if w not in stop_words+interested_words] 

    n_original_tokens = len(original_tokens)

    # bind method outside of loop to reduce overhead
    tokens_append = tokens.append
    space_join = " ".join

    for i in xrange(n_original_tokens - 1):
        if  not any(np.isin(stop_wrds_inds, [i,i+1])):
            tokens_append(space_join(original_tokens[i: i + 2]))

    return tokens

现在，我们可以在通常的TfidfVectorizer中插入此函数，如下所示

import numpy as np
from sklearn.externals.six.moves import xrange
from sklearn.feature_extraction.text import  TfidfVectorizer,CountVectorizer
from sklearn.feature_extraction import  text


def analyzer():
    base_vect = CountVectorizer()
    stop_words = list(text.ENGLISH_STOP_WORDS)
    preprocess = base_vect.build_preprocessor()
    tokenize = base_vect.build_tokenizer()

    return lambda doc: custom_word_ngrams(
        tokenize(preprocess(base_vect.decode(doc))), stop_words, ['script', 'rule']) 
    #feed your special words list here

vectorizer = TfidfVectorizer(analyzer=analyzer())
vectorizer.fit(["Script include is a script that has rule which has a business rule"])
vectorizer.get_feature_names()

[‘业务’、‘业务规则’、‘包括’、‘脚本包括’]

为什么“include script”不在输出中，因为“include is a script”中的“include is a script”是“a”是停止字，而您正在删除停止字。你能不能澄清一下我到底想要什么。。但是给了我一个方向。。谢谢

import numpy as np
from sklearn.externals.six.moves import xrange
from sklearn.feature_extraction.text import  TfidfVectorizer,CountVectorizer
from sklearn.feature_extraction import  text


def analyzer():
    base_vect = CountVectorizer()
    stop_words = list(text.ENGLISH_STOP_WORDS)
    preprocess = base_vect.build_preprocessor()
    tokenize = base_vect.build_tokenizer()

    return lambda doc: custom_word_ngrams(
        tokenize(preprocess(base_vect.decode(doc))), stop_words, ['script', 'rule']) 
    #feed your special words list here

vectorizer = TfidfVectorizer(analyzer=analyzer())
vectorizer.fit(["Script include is a script that has rule which has a business rule"])
vectorizer.get_feature_names()