Python CountVectorizer不尊重正则表达式_Python_Nltk

Python CountVectorizer不尊重正则表达式

python

Python CountVectorizer不尊重正则表达式,python,nltk,Python,Nltk,我使用以下代码获取文档术语矩阵： from nltk.stem import SnowballStemmer from sklearn.feature_extraction.text import CountVectorizer stemmer = SnowballStemmer("english", ignore_stopwords=True) class StemmedCountVectorizer(CountVectorizer): def build_analyzer(se

我使用以下代码获取文档术语矩阵：

from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

stemmer = SnowballStemmer("english", ignore_stopwords=True)


class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

stemmed_count_vect = StemmedCountVectorizer(stop_words='english', 
                                            ngram_range=(1,1), 
                                            token_pattern=r'\b\w+\b', 
                                            min_df=1, 
                                            max_df=0.6)

然而，我仍然收到如下物品：

如何修复此问题？

此模式

token\u pattern=r'\b\w+\b'

表示它希望单词边界之间有一个或多个

\w

字符类的成员。这个角色类

[m] 匹配Unicode单词字符；这包括任何语言中可以作为单词一部分的大多数字符，以及数字和下划线

所以在我看来，你们需要更少的字符类（去掉数字作为开始）