Scikit learn 基于TF-IDF的电影收视率预测

Scikit learn 基于TF-IDF的电影收视率预测,scikit-learn,tf-idf,python-textprocessing,Scikit Learn,Tf Idf,Python Textprocessing,我有一个具有以下格式的数据集- 电影名称、TomatoCritics、目标变量 在这里,TomatoCritics属性为不同的电影提供来自不同用户的自由文本。而Target_变量是一个二进制值(0或1),指示是否应观看此电影 我使用TF-IDF来处理这个问题,我的代码如下- import numpy as np import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from sklearn.

我有一个具有以下格式的数据集-

电影名称、TomatoCritics、目标变量

在这里,
TomatoCritics
属性为不同的电影提供来自不同用户的自由文本。而
Target_变量
是一个二进制值(0或1),指示是否应观看此电影

我使用TF-IDF来处理这个问题,我的代码如下-

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer


# Read textual training data-
text_training = pd.read_csv("Textual-Training_Data.csv")

# Read textual testing data-
text_testing = pd.read_csv("Textual-Testing_Data.csv")

# Get dimensions of training data-
text_training.shape
# (95, 3)

# Get dimensions of testing data-
text_testing.shape
# (224, 3)


# Check for missing values in training data-
text_training.isnull().values.any()
# True

# Check for missing values in testing data-
text_testing.isnull().values.any()
# True

# Remove any row having missing value from training data-
text_training_nona = text_training.dropna(axis = 0, how='any')

# Remove any row having missing value from testing data-
text_testing_nona = text_testing.dropna(axis = 0, how = 'any')

# Get dimensions of training data AFTER removing empty rows-
text_training_nona.shape
# (73, 3)

# Get dimensions of testing data AFTER removing empty rows-
text_testing_nona.shape
# (158, 3)


# Attributes to use for training and testing sets for ML-
cols_train = ['tomatoConsensus', 'goodforairplanes']
cols_test = ['tomatoConsensus', 'goodforairplanes']



# Split training dataset into features (X) and label (y) for training-
X_train = text_training_nona['tomatoConsensus']
y_train = text_training_nona['goodforairplanes']


# Split training dataset into features (X) and label (y) for testing-
X_test = text_testing_nona["tomatoConsensus"]
y_test = text_testing_nona['goodforairplanes']




# Initialize Count Vectorizer using TF-IDF ->
cv = TfidfVectorizer(min_df = 1, stop_words='english')

# Convert text to TF-IDF ->
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)

# Multinomial Naive Bayes classifier-
mnb = MultinomialNB()

# Train model on training data-
mnb.fit(X_train_cv, y_train)

print(X_test_cv[0])
'''
(0, 1168)   0.20066499253877468
  (0, 31)   0.2419027475877309
  (0, 1090) 0.22790133982975397
  (0, 5)    0.2616366234663056
  (0, 877)  0.2616366234663056
  (0, 1279) 0.2419027475877309
  (0, 850)  0.1786670002268731
  (0, 1341) 0.2616366234663056
  (0, 2)    0.2616366234663056
  (0, 695)  0.2616366234663056
  (0, 1221) 0.2419027475877309
  (0, 884)  0.1786670002268731
  (0, 1070) 0.2616366234663056
  (0, 782)  0.2616366234663056
  (0, 252)  0.20066499253877468
  (0, 1259) 0.2419027475877309
  (0, 1093) 0.20816746395117927
  (0, 122)  0.2170410042381541
'''

y_pred = mnb.predict(X_test_cv[0])
最后一行使用
mnb.predict()
给出错误-

值错误:维度不匹配

怎么了


谢谢

您应该
fit_transform
一次,然后使用现有的
cv
和经过训练的
cv
对象进行变换。改变

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)
-这会解决你的问题

如果您使用附加数据再次调用
fit\u transofrm
,它可能包含其他数量的唯一单词,并将生成另一个大小的词汇表-然后,使用其他数据和其他大小的变量训练的
mnb
的维度将不同-这就是值错误:维度不匹配

编辑
只需检查这两种情况下的
X\u-test\u-cv
X\u-train\u-cv
——如果对
X\u-train
X\u-test
进行
fit\u-transform
fit\u-transform,则会给出不同的形状,但如果将第二个fit\u-transform替换为t-transform,则形状相同