Python Pyspark:SVM_model.predict()->&引用;“尺寸不匹配”;

Python Pyspark:SVM_model.predict()->&引用;“尺寸不匹配”;,python,apache-spark,pyspark,svm,apache-spark-mllib,Python,Apache Spark,Pyspark,Svm,Apache Spark Mllib,为了进行二元分类,我试图在约20k个具有约5000个特征的标签点集合上测试各种候选模型 我在输入空间中使用VectorAssembler创建了一个包含数字特征、稀疏向量等的特征向量,然后将这些向量(连同标签)转换为标签点 assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], ou

为了进行二元分类,我试图在约20k个具有约5000个特征的标签点集合上测试各种候选模型

我在输入空间中使用VectorAssembler创建了一个包含数字特征、稀疏向量等的特征向量,然后将这些向量(连同标签)转换为标签点

assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], outputCol="features")
training_set_vector = assembler.transform(training_set).select("Target", "features")
training_set_labeled = training_set_vector.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray()))).map(lambda row: LabeledPoint(row[1], [row[0]]))
training_set_vector和training_set_label的输出如下:

[Row(Target=0,features=SparseVector(2273,{0:8397.0,1:13.0,2: 12.0,3:1.0,3:1.0,3:1.0,8:3.3147,82:5.721,370:0.25466,410:7.0356,418:5.378,418:5.37865,418:5,418:5,418:5,418:5,418:5,418:5.3786,429,429:5,429,429:9:12.429,429:12.429:12.4219,4219,623:12.4219,623:12.4219,623:7:7:7:7:7:7:7:7:7:7:7:7.397,7,7:7:12.4219,7:12.4219,3:12.4219,623:7:7:12.4219,623:7:7:7:7:7:7:7,7,0.0,0.0])]

然后,我用标签点拟合随机森林、SVM和GBT模型

RF_model = RandomForest.trainClassifier(training_set_labeled, numClasses=2, categoricalFeaturesInfo = {}, numTrees=10, featureSubsetStrategy = "auto", impurity='gini', maxDepth=4, maxBins=32)
SVM_model = SVMWithSGD.train(training_set_labeled, iterations=500)
GBT_model = GradientBoostedTrees.trainClassifier(training_set_labeled, categoricalFeaturesInfo = {}, maxDepth=4, maxBins=32, numIterations=5)
到目前为止,一切都进展顺利,当我尝试将这些模型应用到测试集(与训练集维度相同)时,问题出现了。以下是我用于应用于测试集的代码:

predictions_RF = RF_model.predict(test_set_labeled.map(lambda r: r.features))
predictions_SVM = SVM_model.predict(test_set_labeled.map(lambda r: r.features))
predictions_GBT = GBT_model.predict(test_set_labeled.map(lambda r: r.features))
RF和GBT型号已成功完成:

print predictions_RF.take(5)
print predictions_GBT.take(5)
[0.0,0.0,0.0,0.0,0.0,0.0] [0.0,0.0,1.0,0.0,0.0]

但当应用SVM模型时,我得到以下错误:

AssertionError:维度不匹配

运行“打印SVM_模型”显示只有约500个权重,但有约5k个特征。我假设这就是问题所在,但我不完全确定如何处理。有没有人有类似的问题,能告诉我如何将这个SVM模型(或其他SVM模型)应用到测试集