Python 在新（看不见的）文本上部署文本分类模型_Python_Machine Learning_Scikit Learn_Nlp_Text Classification

Python 在新（看不见的）文本上部署文本分类模型

python machine-learning scikit-learn nlp

Python 在新（看不见的）文本上部署文本分类模型,python,machine-learning,scikit-learn,nlp,text-classification,Python,Machine Learning,Scikit Learn,Nlp,Text Classification,我正在研究一个文本分类问题。我附加了一个我训练过的文本分类模型的简单虚拟片段如何在新文本上部署模型？当模型用于检查预测时，它会正确地对文本进行分类，但是，当使用新数据时，分类是不正确的这是因为新的文本需要矢量化吗？我错过了一些基本的东西吗 from collections import Counter from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score import

我正在研究一个文本分类问题。我附加了一个我训练过的文本分类模型的简单虚拟片段

如何在新文本上部署模型？当模型用于检查预测时，它会正确地对文本进行分类，但是，当使用新数据时，分类是不正确的

这是因为新的文本需要矢量化吗？我错过了一些基本的东西吗

from collections import Counter
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score

df = pd.read_csv("/Users/veg.csv")
print (df)

无论您输入到模型中的是什么（新）文本，都必须经过与训练数据完全相同的预处理步骤-此处是已安装在

X\U训练

中的

计数矢量器

：

new_data_vectorized = cv.transform(new_data) # NOT fit_transform
new_predictions = naive_bayes.predict(new_data_vectorized)

check_predictions = []
for i in range(len(X_test)):   
    if predictions[i] == 0:
        check_predictions.append('vegetable')
    if predictions[i] == 1:
        check_predictions.append('fruit')
    if predictions[i] == 2:
        check_predictions.append('tree')
        
dummy_df = pd.DataFrame({'actual_label': list(y_test), 'prediction': check_predictions, 'Text':list(X_test)})
dummy_df.replace(to_replace=0, value='vegetable', inplace=True)
dummy_df.replace(to_replace=1, value='fruit', inplace=True)
dummy_df.replace(to_replace=2, value='tree', inplace=True)
print("DUMMY DF")
print(dummy_df.head(10))

new_data=['carrot', 'grapes',
          'banana', 'potato',
          'birch','carrot', 'grapes',
          'banana', 'potato', 'birch','carrot','grapes',
          'banana', 'potato',
          'birch','carrot', 'grapes',
          'banana', 'potato', 'birch','grapes',
          'banana', 'potato', 'birch']

new_predictions = []
for i in range(len(new_data)):    
    if predictions[i] == 0:
        new_predictions.append('vegetable')
    if predictions[i] == 1:
        new_predictions.append('fruit')
    if predictions[i] == 2:
        new_predictions.append('tree')
        
new_df = pd.DataFrame({'actual_label': list(y_test), 'prediction': new_predictions, 'Text':list(new_data)})        
new_df.replace(to_replace=0, value='vegetable', inplace=True)
new_df.replace(to_replace=1, value='fruit', inplace=True)
new_df.replace(to_replace=2, value='tree', inplace=True)
print("NEW DF")
print(new_df.head(10))

new_data_vectorized = cv.transform(new_data) # NOT fit_transform
new_predictions = naive_bayes.predict(new_data_vectorized)