Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/340.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python TfidfVectorizer.fit_transfrom和tfidf.transform之间有什么区别?_Python_Scikit Learn_Nlp_Tfidfvectorizer - Fatal编程技术网

Python TfidfVectorizer.fit_transfrom和tfidf.transform之间有什么区别?

Python TfidfVectorizer.fit_transfrom和tfidf.transform之间有什么区别?,python,scikit-learn,nlp,tfidfvectorizer,Python,Scikit Learn,Nlp,Tfidfvectorizer,在Tfidf.fit_变换中,我们仅使用参数X,而没有使用y来拟合数据集。 是这样吗? 我们只为训练集的参数生成tfidf矩阵。我们不使用ytrain拟合模型。 那么我们如何对测试数据集进行预测呢?很好地解释了为什么它被称为fit(),transform()和fit\u transform() 总而言之 fit():将矢量器/模型适配到训练数据中,并将矢量器/模型保存到变量中(返回sklearn.feature\u extraction.text.tfidfvectorier) transfo

在Tfidf.fit_变换中,我们仅使用参数X,而没有使用y来拟合数据集。 是这样吗? 我们只为训练集的参数生成tfidf矩阵。我们不使用ytrain拟合模型。 那么我们如何对测试数据集进行预测呢?很好地解释了为什么它被称为
fit()
transform()
fit\u transform()

总而言之

  • fit()
    :将矢量器/模型适配到训练数据中,并将矢量器/模型保存到变量中(返回
    sklearn.feature\u extraction.text.tfidfvectorier

  • transform()
    :使用
    fit()
    到transformer验证/测试数据的变量输出(返回
    scipy.sparse.csr.csr\u矩阵

  • fit\u transform()
    :有时您需要直接转换训练数据,因此您将
    fit()
    +
    transform()
    一起使用,从而
    fit\u transform()
    。(返回scipy.sparse.csr.csr\u矩阵)


例如

[out]:

# Learns the vocabulary of vectorizer based on the initialized parameter.
>>> vectorizer =  vectorizer.fit(dataset)

# Apply the vectorizer to new sentence.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."])
<1x15 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

# Output to array form.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray()
array([[0.        , 0.31342551, 0.        , 0.38714286, 0.        ,
        0.        , 0.31342551, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.38714286, 0.51249178, 0.49104163]])

# When you don't need to save the vectorizer for re-using.
>>> vectorizer.fit_transform(dataset)
<4x15 sparse matrix of type '<class 'numpy.float64'>'
    with 28 stored elements in Compressed Sparse Row format>

>>> vectorizer.fit_transform(dataset).toarray()
array([[0.        , 0.49642852, 0.        , 0.30659399, 0.30659399,
        0.        , 0.24821426, 0.30659399, 0.        , 0.30659399,
        0.38887561, 0.        , 0.        , 0.40586285, 0.        ],
       [0.        , 0.32107915, 0.        , 0.        , 0.39659663,
        0.        , 0.32107915, 0.39659663, 0.50303254, 0.39659663,
        0.        , 0.        , 0.        , 0.26250325, 0.        ],
       [0.76012588, 0.24258925, 0.38006294, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.29964599, 0.29964599, 0.19833261, 0.        ],
       [0.        , 0.        , 0.        , 0.34049544, 0.        ,
        0.4318753 , 0.27566041, 0.        , 0.        , 0.        ,
        0.        , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]])


>>> type(vectorizer)
<class 'sklearn.feature_extraction.text.TfidfVectorizer'>

>>> type(vectorizer.fit_transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

>>> type(vectorizer.transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>
#根据初始化参数学习矢量器的词汇表。
>>>矢量器=矢量器.fit(数据集)
#将矢量器应用于新句子。
>>>变换([“棕色玫瑰穿过巧克力色的狗。”)
#输出到数组形式。
>>>矢量器。变换([“棕色玫瑰穿过巧克力色的狗。”])。toarray()
数组([[0,0.31342551,0,0.38714286,0,,
0.        , 0.31342551, 0.        , 0.        , 0.        ,
0.        , 0.        , 0.38714286, 0.51249178, 0.49104163]])
#当您不需要保存矢量器以便重新使用时。
>>>矢量器.fit_变换(数据集)
>>>矢量器.fit_变换(数据集).toarray()
数组([[0,0.49642852,0,0.30659399,0.30659399,
0.        , 0.24821426, 0.30659399, 0.        , 0.30659399,
0.38887561, 0.        , 0.        , 0.40586285, 0.        ],
[0.        , 0.32107915, 0.        , 0.        , 0.39659663,
0.        , 0.32107915, 0.39659663, 0.50303254, 0.39659663,
0.        , 0.        , 0.        , 0.26250325, 0.        ],
[0.76012588, 0.24258925, 0.38006294, 0.        , 0.        ,
0.        , 0.        , 0.        , 0.        , 0.        ,
0.        , 0.29964599, 0.29964599, 0.19833261, 0.        ],
[0.        , 0.        , 0.        , 0.34049544, 0.        ,
0.4318753 , 0.27566041, 0.        , 0.        , 0.        ,
0.        , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]])
>>>类型(矢量器)
>>>类型(矢量器.fit_变换(数据集))
>>>类型(矢量器转换(数据集))

=)TfidfVectorizer不用于预测,这就是为什么我们不在其中使用
y\u train
。无论是在安装过程中还是在转换过程中。谢谢您的解释。你说fit_transform不存储模型,但是你发布的链接显示它存储了模型。啊,是的,对不起,我错过了信息。它不返回模型,但矢量器仍存储它=)
# Learns the vocabulary of vectorizer based on the initialized parameter.
>>> vectorizer =  vectorizer.fit(dataset)

# Apply the vectorizer to new sentence.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."])
<1x15 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

# Output to array form.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray()
array([[0.        , 0.31342551, 0.        , 0.38714286, 0.        ,
        0.        , 0.31342551, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.38714286, 0.51249178, 0.49104163]])

# When you don't need to save the vectorizer for re-using.
>>> vectorizer.fit_transform(dataset)
<4x15 sparse matrix of type '<class 'numpy.float64'>'
    with 28 stored elements in Compressed Sparse Row format>

>>> vectorizer.fit_transform(dataset).toarray()
array([[0.        , 0.49642852, 0.        , 0.30659399, 0.30659399,
        0.        , 0.24821426, 0.30659399, 0.        , 0.30659399,
        0.38887561, 0.        , 0.        , 0.40586285, 0.        ],
       [0.        , 0.32107915, 0.        , 0.        , 0.39659663,
        0.        , 0.32107915, 0.39659663, 0.50303254, 0.39659663,
        0.        , 0.        , 0.        , 0.26250325, 0.        ],
       [0.76012588, 0.24258925, 0.38006294, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.29964599, 0.29964599, 0.19833261, 0.        ],
       [0.        , 0.        , 0.        , 0.34049544, 0.        ,
        0.4318753 , 0.27566041, 0.        , 0.        , 0.        ,
        0.        , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]])


>>> type(vectorizer)
<class 'sklearn.feature_extraction.text.TfidfVectorizer'>

>>> type(vectorizer.fit_transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

>>> type(vectorizer.transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>