Python 3.x 是否可以将训练好的情绪分类模型与其他数据一起用于预测?

Python 3.x 是否可以将训练好的情绪分类模型与其他数据一起用于预测?,python-3.x,pandas,jupyter-notebook,nltk,sentiment-analysis,Python 3.x,Pandas,Jupyter Notebook,Nltk,Sentiment Analysis,我在使用scikit learn方面是新手,但我已经使用Naive Bayes创建了预测情绪的代码,使用.csv文件作为培训文件,现在我想将培训过的代码与新的.csv文件一起使用,其中包含正确预测的新信息,但是当我尝试使用与第一个文件相同的步骤时(因为我必须准备好它,它有很好的标记和词干)我发现了以下错误: --------------------------------------------------------------------------- ValueError

我在使用scikit learn方面是新手,但我已经使用Naive Bayes创建了预测情绪的代码,使用.csv文件作为培训文件,现在我想将培训过的代码与新的.csv文件一起使用,其中包含正确预测的新信息,但是当我尝试使用与第一个文件相同的步骤时(因为我必须准备好它,它有很好的标记和词干)我发现了以下错误:

    ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-414670b5f455> in <module>
      5 
      6 #Prediccion muestras
----> 7 r = clf.predict_proba(tfidf_x1)
      8 data1['Prediccion']  = r

~\anaconda3\lib\site-packages\sklearn\naive_bayes.py in predict_proba(self, X)
    116             order, as they appear in the attribute :term:`classes_`.
    117         """
--> 118         return np.exp(self.predict_log_proba(X))
    119 
    120 

~\anaconda3\lib\site-packages\sklearn\naive_bayes.py in predict_log_proba(self, X)
     96         check_is_fitted(self)
     97         X = self._check_X(X)
---> 98         jll = self._joint_log_likelihood(X)
     99         # normalize by P(x) = P(f_1, ..., f_n)
    100         log_prob_x = logsumexp(jll, axis=1)

~\anaconda3\lib\site-packages\sklearn\naive_bayes.py in _joint_log_likelihood(self, X)
    775     def _joint_log_likelihood(self, X):
    776         """Calculate the posterior log probability of the samples X"""
--> 777         return (safe_sparse_dot(X, self.feature_log_prob_.T) +
    778                 self.class_log_prior_)
    779 

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~\anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    151             ret = np.dot(a, b)
    152     else:
--> 153         ret = a @ b
    154 
    155     if (sparse.issparse(a) and sparse.issparse(b)

~\anaconda3\lib\site-packages\scipy\sparse\base.py in __matmul__(self, other)
    558             raise ValueError("Scalar operands are not allowed, "
    559                              "use '*' instead")
--> 560         return self.__mul__(other)
    561 
    562     def __rmatmul__(self, other):

~\anaconda3\lib\site-packages\scipy\sparse\base.py in __mul__(self, other)
    514 
    515             if other.shape[0] != self.shape[1]:
--> 516                 raise ValueError('dimension mismatch')
    517 
    518             result = self._mul_multivector(np.asarray(other))

ValueError: dimension mismatch
因此,程序运行正常,直到出现“tfidf_x1”的预测,此时出现错误消息。 我是否必须使用另一个步骤才能使其正常工作?或者我是否使用了错误的功能

我正在使用python 3.7

先谢谢你

# -⁻- coding: UTF-8 -*-    
import pandas as pd, numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from nltk.stem import PorterStemmer
data = pd.read_csv(r"C:\Users\Propietario\Desktop\machine learning comentarios\Naives Bayes muestra 2_ corregido.csv", encoding = 'utf-8', sep = ';', header = 0)
data = data.dropna()
def limpiar_tokenizar(texto):     
    # Se convierte todo el texto a minúsculas
    nuevo_texto = texto.lower()
    # Eliminación de signos de puntuación
    regex = '[\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\@\\[\\\\\\]\\^_\\`\\{\\|\\}\\~]'
    nuevo_texto = re.sub(regex , ' ', nuevo_texto)
    # Eliminación de números
    nuevo_texto = re.sub("\d+", ' ', nuevo_texto)
    # Eliminación de espacios en blanco múltiples
    nuevo_texto = re.sub("\\s+", ' ', nuevo_texto)
    # Tokenización por palabras individuales
    nuevo_texto = nuevo_texto.split(sep = ' ')
    # Eliminación de tokens con una longitud < 1
    nuevo_texto = [token for token in nuevo_texto if len(token) > 1]
    # Realizando stemming
    stemming = PorterStemmer()
    nuevo_texto = [stemming.stem(word) for word in nuevo_texto]
    return(nuevo_texto)
datos_X = data.Comentarios
datos_y = data.Tipo

stop_words = list(stopwords.words('spanish'))

X_train, X_test, y_train, y_test = train_test_split(datos_X, datos_y, test_size = 0.3, random_state = 120)
tfidf_vectorizador = TfidfVectorizer(tokenizer = limpiar_tokenizar, min_df = 3, stop_words = stop_words)
tfidf_vectorizador.fit(X_train)
tfidf_vectorizador.fit(X_test)
tfidf_train = tfidf_vectorizador.transform(X_train)
tfidf_test  = tfidf_vectorizador.transform(X_test)
clf = MultinomialNB(alpha = 0.6,fit_prior = False).fit(tfidf_train, y_train)
predicciones_test = clf.predict(X=tfidf_test)
data1 = pd.read_csv(r"C:\Users\Propietario\Desktop\machine learning comentarios\excels trabajados\Muestras\Muestra1.csv", encoding = 'utf-8', sep = ';', header = 0)
data1 = data1.dropna()
x1 = data1['Comments']
tfidf_vectorizador.fit(x1)
tfidf_x1 = tfidf_vectorizador.transform(x1)
data1['Prediccion']  = clf.predict_proba(tfidf_x1) 
data1.to_excel('Naive Bayes 5 Muestras.xlsx', sheet_name='Muestra1')