Python 2.7 sklearn计数器矢量器

Python 2.7 sklearn计数器矢量器,python-2.7,machine-learning,scikit-learn,countvectorizer,Python 2.7,Machine Learning,Scikit Learn,Countvectorizer,我对使用词汇表有疑问。get,代码如下。 如下图所示,我在一个机器学习练习中使用了CountVectorizer,以获得特定单词的出现次数 from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() s1 = 'KJ YOU WILL BE FINE' s2 = 'ABHI IS MY BESTIE' s3 = 'sam is my bestie' frnd_list = [

我对使用词汇表有疑问。get,代码如下。 如下图所示,我在一个机器学习练习中使用了CountVectorizer,以获得特定单词的出现次数

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
s1 = 'KJ YOU WILL BE FINE'
s2 = 'ABHI IS MY BESTIE'
s3 = 'sam is my bestie'
frnd_list = [s1,s2,s3]
bag_of_words = vectorizer.fit(frnd_list)
bag_of_words = vectorizer.transform(frnd_list)
print(bag_of_words)
# To get the feature word number from word 
#for eg:
print(vectorizer.vocabulary_.get('bestie'))
print(vectorizer.vocabulary_.get('BESTIE'))
输出:

Bag_of_words is :
(0, 1)  1
(0, 3)  1
(0, 5)  1
(0, 8)  1
(0, 9)  1
(1, 0)  1
(1, 2)  1
(1, 4)  1
(1, 6)  1
(2, 2)  1
(2, 4)  1
(2, 6)  1
(2, 7)  1

'bestie' has  feature number:
 2
'BESTIE' has feature number:
 None

因此,我怀疑为什么“bistie”显示的是正确的功能编号,即2,而“BESTIE”显示的是没有。词汇表“”是否可以很好地使用大写向量?

countvectorier
采用默认为
True
的参数
lowercase
,如文档中所述:


如果您想区别对待小写和大写,请将其更改为
False

countvectorizer接受一个参数“小写”,默认情况下其值为true

如果我们想要区分大写字母和小写字母,那么设置lowercase=False

有关更多信息,请单击此处

lowercase : boolean, True by default
    Convert all characters to lowercase before tokenizing.