Scikit learn 使用更多的n-gram阶数如何降低多项式朴素贝叶斯分类器的精度？_Scikit Learn_Nlp_Nltk_Tf Idf_Tfidfvectorizer

Scikit learn 使用更多的n-gram阶数如何降低多项式朴素贝叶斯分类器的精度？

scikit-learn nlp

Scikit learn 使用更多的n-gram阶数如何降低多项式朴素贝叶斯分类器的精度？,scikit-learn,nlp,nltk,tf-idf,tfidfvectorizer,Scikit Learn,Nlp,Nltk,Tf Idf,Tfidfvectorizer,我正在用nltk和sklearn构建一个文本分类模型，并在sklearn的20个新闻组数据集中对其进行训练（每个文档大约有130个单词）我的预处理包括删除停止字和柠檬化标记接下来，在我的管道中，我将其传递给tfidfVectorizer（），并希望操纵矢量器的一些输入参数以提高精度。我读到过n-grams（通常，n小于提高了精度，但当我使用tfidf中的ngram_range=（1,2）和ngram_range=（1,3）使用multinomialNB（）分类器对矢量器输出进行分类时，精度会

我正在用nltk和

sklearn

构建一个文本分类模型，并在

sklearn

的20个新闻组数据集中对其进行训练（每个文档大约有130个单词）

我的预处理包括删除停止字和柠檬化标记

接下来，在我的管道中，我将其传递给

tfidfVectorizer（）

，并希望操纵矢量器的一些输入参数以提高精度。我读到过n-grams（通常，n小于提高了精度，但当我使用tfidf中的

ngram_range=（1,2）

和

ngram_range=（1,3）

使用

multinomialNB（）

分类器对矢量器输出进行分类时，精度会降低。有人能解释一下原因吗

编辑：下面是一个请求的样本数据，以及我用来获取它并剥离标题的代码：

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all', remove="headers")
#example of data text (no header)
print(news.data[0])

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final regular season game.          PENS RULE!!!

这是我的管道，运行代码来训练模型和打印精度：

    test1_pipeline=Pipeline([('clean', clean()),
                         ('vectorizer', TfidfVectorizer(ngram_range=(1,2))), 
                         ('classifier', MultinomialNB())])

train(test1_pipeline, news_group_train.data, news_group_train.target)

当然！作为编辑添加：-）@Seralouka实际上，这是一个非常好的问题！如果不删除stopwords会发生什么；p请在

clean（）中添加代码