Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/354.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 培训和发展数据,值错误:维度不匹配_Python_Scikit Learn - Fatal编程技术网

Python 培训和发展数据,值错误:维度不匹配

Python 培训和发展数据,值错误:维度不匹配,python,scikit-learn,Python,Scikit Learn,我构建了以下分类模型: def buildData(x): count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(x) tf_transformer = TfidfTransformer().fit(X_train_counts) X_train_tf = tf_transformer.transform(X_train_counts) return X_train

我构建了以下分类模型:

def buildData(x):
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(x)
    tf_transformer = TfidfTransformer().fit(X_train_counts)
    X_train_tf = tf_transformer.transform(X_train_counts)
    return X_train_tf

x, y = parseXml('data/training.xml')
xDev, yDev = parseXml('data/dev.xml')

x = buildData(x)
clf = MultinomialNB().fit(x, y)
predicted = clf.predict(x)
print( 'Accuracy: ', accuracy_score(y, predicted))
我使用训练数据“x”拟合模型,并在“x”上进行测试

问题是,如果我想在xDev上预测它(predicted=clf.predict(xDev)),它会显示一个错误

我认为这是因为数据没有准备好(以Tf_idf矩阵的形式),所以我将xDev数据传递给了相同的函数:

xDev = buildData(xDev)
但不幸的是,出现了以下错误:

Traceback (most recent call last):   File "C:/Users/BG/Desktop/P2/E2.py", line 43, in <module>
    predicted = clf.predict(xDev)   File "C:\Python35\lib\site-packages\sklearn\naive_bayes.py", line 66, in predict
    jll = self._joint_log_likelihood(X)   File "C:\Python35\lib\site-packages\sklearn\naive_bayes.py", line 725, in
_joint_log_likelihood
    return (safe_sparse_dot(X, self.feature_log_prob_.T) +   File "C:\Python35\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
    ret = a * b   File "C:\Python35\lib\site-packages\scipy\sparse\base.py", line 476, in
__mul__
    raise ValueError('dimension mismatch') ValueError: dimension mismatch
Traceback(最近一次调用last):文件“C:/Users/BG/Desktop/P2/E2.py”,第43行,在
predicted=clf.predict(xDev)文件“C:\Python35\lib\site packages\sklearn\naive_bayes.py”,predicte中第66行
jll=self.\u joint\u log\u likelion(X)文件“C:\Python35\lib\site packages\sklearn\naive\u bayes.py”,第725行,在
_联合对数似然
返回(safe\u sparse\u dot(X,self.feature\u log\u prob\u.T)+文件“C:\Python35\lib\site packages\sklearn\utils\extmath.py”,第135行,在safe\u sparse\u dot中
ret=a*b文件“C:\Python35\lib\site packages\scipy\sparse\base.py”,第476行,在
__骡子__
提升值错误(“维度不匹配”)值错误:维度不匹配

您需要从第一次(培训)调用中保存并持久化tf_transformer。tf_transformer取决于您提供给它的数据的词汇表。在您的情况下,语料库词汇表(即组合文档中所有唯一单词的集合)在x和xDev之间可能不同。这在文本分类用例中很常见。例如,您的x可能有1000个单词,而xDev可能有800个单词(不同或类似)

如果从第一次调用buildData()时就保留tf_transformer,并使用相同的tf_transformer转换xDev数据,而不是在buildData()中再次创建另一个tf_transformer,则错误将消失

换句话说,tf_transformer应该创建一次,适合于培训数据并重用。它不应该像您在测试数据上再次调用buildData时的代码那样,每次都为测试或生产数据重新创建。下面是创建NLP/分类管道的另一种方法,该管道重用一次创建的transformer

TL;DR-在应用程序工作流中将变量设置为一次以下,而不是多次

tf_transformer = TfidfTransformer().fit(X_train_counts)