Python CountVectorizer（）不使用单字母单词_Python_Machine Learning_Scikit Learn_Countvectorizer

Python CountVectorizer（）不使用单字母单词

python machine-learning scikit-learn

Python CountVectorizer（）不使用单字母单词,python,machine-learning,scikit-learn,countvectorizer,Python,Machine Learning,Scikit Learn,Countvectorizer,考虑我必须对以下数据应用CountVectorizer（）： words = [ 'A am is', 'This the a', 'the am is', 'this a am', ] 我做了以下工作： from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus

考虑我必须对以下数据应用CountVectorizer（）：

words = [
     'A am is',
     'This the a',
     'the am is',
     'this a am',
]

我做了以下工作：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[1 1 0 0]
 [0 0 1 1]
 [1 1 1 0]
 [1 0 0 1]]

它返回以下内容：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[1 1 0 0]
 [0 0 1 1]
 [1 1 1 0]
 [1 0 0 1]]

参考

print（vectorizer.get_feature_names（））

prints

['am'，'is'，'the'，'this']

为什么不读“a” 是不是单字母单词在CountVectorizer（）中不算作单词

令牌模式

表示什么构成一个 “令牌”，仅在analyzer=='word'时使用。默认的regexp选择 2个或2个以上字母数字字符的标记（标点符号完全相同）忽略并始终视为标记分隔符）

默认标记器将忽略所有单字符标记。这就是缺少

的原因

如果希望词汇表中包含单个字符标记，则必须使用服装标记器

示例代码输出：

['a', 'am', 'is', 'the', 'this']

“tokenizer=lambda txt:txt.split（）”如何工作？这里发生了什么？

lambda txt:txt.split（）

是一个函数，它获取文本并在空格处进行拆分，每个空格现在都是

CountVectorizer使用的标记