Python ValueError：输入具有n_特征=10，而模型已使用n_特征=4261进行训练_Python_Machine Learning_Scikit Learn

Python ValueError：输入具有n_特征=10，而模型已使用n_特征=4261进行训练

python machine-learning scikit-learn

Python ValueError：输入具有n_特征=10，而模型已使用n_特征=4261进行训练,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我尝试使用经过训练的BoW、tfidf和SVM模型进行预测： def bagOfWords(files_data): count_vector = sklearn.feature_extraction.text.CountVectorizer() return count_vector.fit_transform(files_data) files = sklearn.datasets.load_files(dir_path) word_counts = util.bagOfW

我尝试使用经过训练的BoW、tfidf和SVM模型进行预测：

def bagOfWords(files_data):
    count_vector = sklearn.feature_extraction.text.CountVectorizer()
    return count_vector.fit_transform(files_data)

files = sklearn.datasets.load_files(dir_path)
word_counts = util.bagOfWords(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
X = tf_transformer.transform(word_counts)
clf = sklearn.svm.LinearSVC()
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=test_size)

我可以运行以下命令：

clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)

但以下内容将出现错误：

clf.fit(X_train, y_train)
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"]) 
ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)

我想我已经在使用之前的tf_变换了，不知道为什么仍然会出现错误。非常感谢您的帮助

您没有保留最初用于拟合数据的CountVector

这个BagoWords调用在它自己的作用域中安装了一个单独的CountVectorizer

new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])

您希望使用适合您的训练套件的

你也在用整个X训练你的变形金刚，包括X_测试。您希望从任何培训中排除测试，包括转换

试试这样的

files = sklearn.datasets.load_files(dir_path)

# Split in train/test
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(files.data, file.target)

# Fit and tranform with X_train
count_vector = sklearn.feature_extraction.text.CountVectorizer()
word_counts = count_vector.fit_transform(X_train)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
X_train = tf_transformer.fit_transform(word_counts)

clf = sklearn.svm.LinearSVC()

clf.fit(X_train, y_train)

# Transform X_test
test_word_counts = count_vector.transform(X_test) 
ready_to_be_predicted = tf_transformer.transform(test_word_counts)
X_test = clf.predict(ready_to_be_predicted)

# Test example
new_word_counts = count_vector.transform["a place to listen to music it smaking its way to the us"]) 

ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)

当然，将这些变压器合并到管道中要简单得多。

您没有保留最初用于拟合数据的CountVectorizer

这个BagoWords调用在它自己的作用域中安装了一个单独的CountVectorizer

new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])

您希望使用适合您的训练套件的

你也在用整个X训练你的变形金刚，包括X_测试。您希望从任何培训中排除测试，包括转换

试试这样的

files = sklearn.datasets.load_files(dir_path)

# Split in train/test
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(files.data, file.target)

# Fit and tranform with X_train
count_vector = sklearn.feature_extraction.text.CountVectorizer()
word_counts = count_vector.fit_transform(X_train)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
X_train = tf_transformer.fit_transform(word_counts)

clf = sklearn.svm.LinearSVC()

clf.fit(X_train, y_train)

# Transform X_test
test_word_counts = count_vector.transform(X_test) 
ready_to_be_predicted = tf_transformer.transform(test_word_counts)
X_test = clf.predict(ready_to_be_predicted)

# Test example
new_word_counts = count_vector.transform["a place to listen to music it smaking its way to the us"]) 

ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)

当然，将这些变压器合并到管道中要简单得多。