Python TfidfVectorizer.fit_transfrom和tfidf.transform之间有什么区别？_Python_Scikit Learn_Nlp_Tfidfvectorizer

Python TfidfVectorizer.fit_transfrom和tfidf.transform之间有什么区别？

python scikit-learn nlp

Python TfidfVectorizer.fit_transfrom和tfidf.transform之间有什么区别？,python,scikit-learn,nlp,tfidfvectorizer,Python,Scikit Learn,Nlp,Tfidfvectorizer,在Tfidf.fit_变换中，我们仅使用参数X，而没有使用y来拟合数据集。是这样吗？我们只为训练集的参数生成tfidf矩阵。我们不使用ytrain拟合模型。那么我们如何对测试数据集进行预测呢？很好地解释了为什么它被称为fit（），transform（）和fit\u transform（）总而言之 fit（）：将矢量器/模型适配到训练数据中，并将矢量器/模型保存到变量中（返回sklearn.feature\u extraction.text.tfidfvectorier） transfo

在Tfidf.fit_变换中，我们仅使用参数X，而没有使用y来拟合数据集。是这样吗？我们只为训练集的参数生成tfidf矩阵。我们不使用ytrain拟合模型。那么我们如何对测试数据集进行预测呢？很好地解释了为什么它被称为

fit（）

，

transform（）

和

fit\u transform（）

总而言之

```
fit（）
```
：将矢量器/模型适配到训练数据中，并将矢量器/模型保存到变量中（返回
sklearn.feature\u extraction.text.tfidfvectorier
）

transform（）
：使用
fit（）
到transformer验证/测试数据的变量输出（返回
scipy.sparse.csr.csr\u矩阵
）

fit\u transform（）
：有时您需要直接转换训练数据，因此您将
fit（）
+
transform（）
一起使用，从而
fit\u transform（）
。（返回scipy.sparse.csr.csr\u矩阵）

例如
[out]：

# Learns the vocabulary of vectorizer based on the initialized parameter. >>> vectorizer = vectorizer.fit(dataset) # Apply the vectorizer to new sentence. >>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]) <1x15 sparse matrix of type '<class 'numpy.float64'>' with 6 stored elements in Compressed Sparse Row format> # Output to array form. >>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray() array([[0. , 0.31342551, 0. , 0.38714286, 0. , 0. , 0.31342551, 0. , 0. , 0. , 0. , 0. , 0.38714286, 0.51249178, 0.49104163]]) # When you don't need to save the vectorizer for re-using. >>> vectorizer.fit_transform(dataset) <4x15 sparse matrix of type '<class 'numpy.float64'>' with 28 stored elements in Compressed Sparse Row format> >>> vectorizer.fit_transform(dataset).toarray() array([[0. , 0.49642852, 0. , 0.30659399, 0.30659399, 0. , 0.24821426, 0.30659399, 0. , 0.30659399, 0.38887561, 0. , 0. , 0.40586285, 0. ], [0. , 0.32107915, 0. , 0. , 0.39659663, 0. , 0.32107915, 0.39659663, 0.50303254, 0.39659663, 0. , 0. , 0. , 0.26250325, 0. ], [0.76012588, 0.24258925, 0.38006294, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.29964599, 0.29964599, 0.19833261, 0. ], [0. , 0. , 0. , 0.34049544, 0. , 0.4318753 , 0.27566041, 0. , 0. , 0. , 0. , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]]) >>> type(vectorizer) <class 'sklearn.feature_extraction.text.TfidfVectorizer'> >>> type(vectorizer.fit_transform(dataset)) <class 'scipy.sparse.csr.csr_matrix'> >>> type(vectorizer.transform(dataset)) <class 'scipy.sparse.csr.csr_matrix'>

#根据初始化参数学习矢量器的词汇表。 >>>矢量器=矢量器.fit（数据集） #将矢量器应用于新句子。 >>>变换（[“棕色玫瑰穿过巧克力色的狗。”） #输出到数组形式。 >>>矢量器。变换（[“棕色玫瑰穿过巧克力色的狗。”]）。toarray（）数组（[[0,0.31342551,0,0.38714286,0,， 0. , 0.31342551, 0. , 0. , 0. , 0. , 0. , 0.38714286, 0.51249178, 0.49104163]]) #当您不需要保存矢量器以便重新使用时。 >>>矢量器.fit_变换（数据集） >>>矢量器.fit_变换（数据集）.toarray（）数组（[[0,0.49642852,0,0.30659399,0.30659399， 0. , 0.24821426, 0.30659399, 0. , 0.30659399, 0.38887561, 0. , 0. , 0.40586285, 0. ], [0. , 0.32107915, 0. , 0. , 0.39659663, 0. , 0.32107915, 0.39659663, 0.50303254, 0.39659663, 0. , 0. , 0. , 0.26250325, 0. ], [0.76012588, 0.24258925, 0.38006294, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.29964599, 0.29964599, 0.19833261, 0. ], [0. , 0. , 0. , 0.34049544, 0. , 0.4318753 , 0.27566041, 0. , 0. , 0. , 0. , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]]) >>>类型（矢量器） >>>类型（矢量器.fit_变换（数据集）） >>>类型（矢量器转换（数据集）） =）TfidfVectorizer不用于预测，这就是为什么我们不在其中使用y\u train。无论是在安装过程中还是在转换过程中。谢谢您的解释。你说fit_transform不存储模型，但是你发布的链接显示它存储了模型。啊，是的，对不起，我错过了信息。它不返回模型，但矢量器仍存储它=） # Learns the vocabulary of vectorizer based on the initialized parameter. >>> vectorizer = vectorizer.fit(dataset) # Apply the vectorizer to new sentence. >>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]) <1x15 sparse matrix of type '<class 'numpy.float64'>' with 6 stored elements in Compressed Sparse Row format> # Output to array form. >>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray() array([[0. , 0.31342551, 0. , 0.38714286, 0. , 0. , 0.31342551, 0. , 0. , 0. , 0. , 0. , 0.38714286, 0.51249178, 0.49104163]]) # When you don't need to save the vectorizer for re-using. >>> vectorizer.fit_transform(dataset) <4x15 sparse matrix of type '<class 'numpy.float64'>' with 28 stored elements in Compressed Sparse Row format> >>> vectorizer.fit_transform(dataset).toarray() array([[0. , 0.49642852, 0. , 0.30659399, 0.30659399, 0. , 0.24821426, 0.30659399, 0. , 0.30659399, 0.38887561, 0. , 0. , 0.40586285, 0. ], [0. , 0.32107915, 0. , 0. , 0.39659663, 0. , 0.32107915, 0.39659663, 0.50303254, 0.39659663, 0. , 0. , 0. , 0.26250325, 0. ], [0.76012588, 0.24258925, 0.38006294, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.29964599, 0.29964599, 0.19833261, 0. ], [0. , 0. , 0. , 0.34049544, 0. , 0.4318753 , 0.27566041, 0. , 0. , 0. , 0. , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]]) >>> type(vectorizer) <class 'sklearn.feature_extraction.text.TfidfVectorizer'> >>> type(vectorizer.fit_transform(dataset)) <class 'scipy.sparse.csr.csr_matrix'> >>> type(vectorizer.transform(dataset)) <class 'scipy.sparse.csr.csr_matrix'>