Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/299.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用带countVectorizer.fit_transform()的pickled分类器标记数据_Python_Scikit Learn_Text Classification - Fatal编程技术网

Python 如何使用带countVectorizer.fit_transform()的pickled分类器标记数据

Python 如何使用带countVectorizer.fit_transform()的pickled分类器标记数据,python,scikit-learn,text-classification,Python,Scikit Learn,Text Classification,我在一组短文档上训练分类器,并在获得二进制分类任务的合理f1和准确度分数后对其进行酸洗 在培训期间,我使用sciki learnCountVectoriercv: cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000) 然后使用fit_transform()和transform()方法获得转换后的列车和测试集: transformed_feat_train = numpy.zeros((

我在一组短文档上训练分类器,并在获得二进制分类任务的合理f1和准确度分数后对其进行酸洗

在培训期间,我使用sciki learn
CountVectorier
cv:

    cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000) 
然后使用
fit_transform()
transform()
方法获得转换后的列车和测试集:

    transformed_feat_train = numpy.zeros((0,0,))
    transformed_feat_test = numpy.zeros((0,0,))

    transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
    transformed_feat_test = cv.transform(testingTextFeat).toarray()
这一切对于分类器的训练和测试都非常有效。但是,我不知道如何使用
fit_transform()
transform()
和经过训练的分类器的pickle版本来预测未看到、未标记数据的标签

我正在提取未标记数据上的特征,方法与训练/测试分类器时完全相同:

## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)

## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)

transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()

## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)
错误消息:

    Traceback (most recent call last):
      File "../clf.py", line 615, in <module>
        if __name__=="__main__": main()
      File "../clf.py", line 579, in main
        cv.fit_transform(pickledClassifierFile)
      File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
        vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
      File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
        raise ValueError("empty vocabulary; perhaps the documents only"
    ValueError: empty vocabulary; perhaps the documents only contain stop words
回溯(最近一次呼叫最后一次):
文件“./clf.py”,第615行,在
如果uuuu name_uuuuuu==“uuuuuuu main_uuuuuuuu”:main()
文件“./clf.py”,第579行,主
cv.fit_变换(pickledClassifierFile)
文件“./sklearn/feature\u extraction/text.py”,第780行,在拟合变换中
词汇表,X=self.\u count\u vocab(原始文档,self.fixed\u词汇表)
文件“./sklearn/feature\u extraction/text.py”,第727行,在
raise VALUERROR(“空词汇表;可能仅限于文档”
ValueError:词汇表为空;文档可能只包含停止词

您应该使用相同的向量器实例来转换训练和测试数据。您可以通过使用向量器+分类器创建管道、在训练集中训练管道、酸洗整个管道来实现。稍后加载酸洗管道并调用predict

请参阅此相关问题:。

谢谢你,Olivier(@ogrisel)。你的解决方案看起来很不错。我将尝试一下。但是,我使用了一种简单的方法找到了解决方案:1)我将训练功能、训练标签和分类器与训练数据分开,2)通过执行fit_transform()获得了一个新的矢量器使用“pickled training features”(pickled training features)(pickled training features)(3)将看不见的未标记数据的特征转换为训练数据的DIM,4)用转换后的训练特征和“pickled label”(pickled label)拟合“pickled分类器”,5)然后在转换后的未标记数据特征上预测标签。