Machine learning 在tfidf之前使用CountVectorizer（max_df）的管道_Machine Learning_Scikit Learn_Nlp_Tf Idf_Countvectorizer

Machine learning 在tfidf之前使用CountVectorizer（max_df）的管道

machine-learning scikit-learn nlp

Machine learning 在tfidf之前使用CountVectorizer（max_df）的管道,machine-learning,scikit-learn,nlp,tf-idf,countvectorizer,Machine Learning,Scikit Learn,Nlp,Tf Idf,Countvectorizer,目前我不确定这个等式是针对stackoverflow还是另一个更为理论化的统计QA。但我对以下几点感到困惑我正在做一个binairy tekst分类任务。对于此任务，我使用管道，下面是一个示例代码： pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', LogisticRegression()) ]) parameters = { 'vect__ngram_r

目前我不确定这个等式是针对stackoverflow还是另一个更为理论化的统计QA。但我对以下几点感到困惑

我正在做一个binairy tekst分类任务。对于此任务，我使用管道，下面是一个示例代码：

pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])

parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],  
    'vect__stop_words': [None, stopwords.words('dutch'), stopwordList],
    'clf__C': [0.1, 1, 10, 100, 1000]
}

所以这并不奇怪，但是我开始玩参数选项/设置，并注意到下面的代码（代码中的步骤和参数）具有最高的精度分数（f1分数）：

因此，我很高兴能找出哪些参数设置和方法我得到了最高的分数，但我不知道确切的含义。与“vectorizor”步骤一样，max_df（忽略出现在20%以上文档中的术语）的设置在tfidf之前应用似乎很奇怪（或者以某种方式加倍）

此外，它还使用了10.000的max_特性。在max_df或max_功能之前使用的步骤是什么？我如何解释设置此参数并随后执行tfidf的max_功能。然后是否对10.000个功能执行tfidf

对我来说，在使用诸如max_df和max_features之类的参数之后执行tfidf似乎很奇怪？我说得对吗？为什么？或者我应该做能带来最高结果的事

我希望有人能在正确的方向上帮助我，提前多谢

您发现的精度是在单列车测试集上计算的，还是通过交叉验证？交叉验证，就像我使用gridsearch一样。您发现的精度是在单列车测试集上计算的，还是通过交叉验证？交叉验证，就像我使用gridsearch一样。

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
    ])

parameters = {
    'vect__ngram_range': [(1,1)],  
    'vect__stop_words': [None],
    'vect__max_df': [0.2], 
    'vect__max_features': [10000],
    'clf__C': [100]
}