Python 矢量器fit_变换如何在sklearn中工作？_Python_Machine Learning_Scikit Learn

Python 矢量器fit_变换如何在sklearn中工作？

python machine-learning scikit-learn

Python 矢量器fit_变换如何在sklearn中工作？,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我试图理解以下代码 from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] X = vectorize

我试图理解以下代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

当我尝试打印X以查看将返回什么时，我得到了以下结果：

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

但是，我不明白这个结果的含义？

它将文本转换为数字。因此，使用其他函数，您将能够计算给定数据集中每个单词存在的次数。我是编程新手，所以可能还有其他字段可以使用。

您可以将其解释为“（句子索引、功能索引）计数”

因为有3个句子：它从0开始，在2结束

特征索引是你可以从矢量器中获得的单词索引_

->词汇{词典{单词：特征{索引，}

所以对于例子（0，1）1

如果使用tfidf矢量器，它将给出u个tfidf值，而不是计数矢量器。

我希望我说得很清楚，

正如@Himanshu所写，这是一个“（句子索引，功能索引）计数”

这里，计数部分是“一个单词在文档中出现的次数”

比如说,

（0,1）1

（0,2）1

（0,6）1

（0,3）1

（0,8）1

（1，5）2仅在本例中，计数“2”表示“and”一词在本文档/句子中出现两次

（1,1）1

（1,6）1

（1,3）1

（1,8）1

（2,4）1

（2,7）1

（2,0）1

（2,6）1

（3,1）1

（3,2）1

（3,6）1

（3，3）1

（3,8）1

让我们更改代码中的语料库。基本上，我在语料库列表的第二句话中添加了两次“second”

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)