Optimization 在scikit learn中添加自建词汇表？_Optimization_Scikit Learn_Feature Detection_Vocabulary_Sklearn Pandas

Optimization 在scikit learn中添加自建词汇表？

optimization scikit-learn

Optimization 在scikit learn中添加自建词汇表？,optimization,scikit-learn,feature-detection,vocabulary,sklearn-pandas,Optimization,Scikit Learn,Feature Detection,Vocabulary,Sklearn Pandas,在sklearn.feature\u extraction.text.TfidfVectorizer中，我们可以使用模型的词汇表参数注入我们自己的词汇表。但在这种情况下，模型只使用我自己选择的词我想在自定义词汇表中使用自动检测的功能解决此问题的一种方法是创建模型并使用 vocab=vectorizer.get_feature_names() 在vocab上添加我的列表 vocab + vocabulary 再次构建模型有没有一种方法可以在一个步骤中完成整个过程？我认为没有比这更简单的方

在

sklearn.feature\u extraction.text.TfidfVectorizer

中，我们可以使用模型的

词汇表

参数注入我们自己的词汇表。但在这种情况下，模型只使用我自己选择的词

我想在自定义词汇表中使用自动检测的功能

解决此问题的一种方法是创建模型并使用

vocab=vectorizer.get_feature_names()

在vocab上添加我的列表

vocab + vocabulary

再次构建模型

有没有一种方法可以在一个步骤中完成整个过程？

我认为没有比这更简单的方法来实现你想要的。您可以做的一件事是使用CountVectorizer代码来创建词汇表。我浏览了源代码，方法是

_count_vocab(self, raw_documents, fixed_vocab)

调用时使用

固定\u vocab=False

因此，我建议您在运行

TfidfVectorizer

之前调整以下代码（）以创建词汇表

def _count_vocab(self, raw_documents, fixed_vocab):
        """Create sparse feature matrix, and vocabulary where fixed_vocab=False
        """
        if fixed_vocab:
            vocabulary = self.vocabulary_
        else:
            # Add a new value when a new vocabulary item is seen
            vocabulary = defaultdict()
            vocabulary.default_factory = vocabulary.__len__

        analyze = self.build_analyzer()
        j_indices = _make_int_array()
        indptr = _make_int_array()
        indptr.append(0)
        for doc in raw_documents:
            for feature in analyze(doc):
                try:
                    j_indices.append(vocabulary[feature])
                except KeyError:
                    # Ignore out-of-vocabulary items for fixed_vocab=True
                    continue
            indptr.append(len(j_indices))

        if not fixed_vocab:
            # disable defaultdict behaviour
            vocabulary = dict(vocabulary)
            if not vocabulary:
                raise ValueError("empty vocabulary; perhaps the documents only"
                                 " contain stop words")

        j_indices = frombuffer_empty(j_indices, dtype=np.intc)
        indptr = np.frombuffer(indptr, dtype=np.intc)
        values = np.ones(len(j_indices))

        X = sp.csr_matrix((values, j_indices, indptr),
                          shape=(len(indptr) - 1, len(vocabulary)),
                          dtype=self.dtype)
        X.sum_duplicates()
        return vocabulary, X

你能再解释一下吗？