Python Pyspark:SVM_model.predict（）->&引用；“尺寸不匹配”；_Python_Apache Spark_Pyspark_Svm_Apache Spark Mllib

Python Pyspark:SVM_model.predict（）->&引用；“尺寸不匹配”；

python apache-spark pyspark

Python Pyspark:SVM_model.predict（）->&引用；“尺寸不匹配”；,python,apache-spark,pyspark,svm,apache-spark-mllib,Python,Apache Spark,Pyspark,Svm,Apache Spark Mllib,为了进行二元分类，我试图在约20k个具有约5000个特征的标签点集合上测试各种候选模型我在输入空间中使用VectorAssembler创建了一个包含数字特征、稀疏向量等的特征向量，然后将这些向量（连同标签）转换为标签点 assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], ou

为了进行二元分类，我试图在约20k个具有约5000个特征的标签点集合上测试各种候选模型

我在输入空间中使用VectorAssembler创建了一个包含数字特征、稀疏向量等的特征向量，然后将这些向量（连同标签）转换为标签点

assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], outputCol="features")
training_set_vector = assembler.transform(training_set).select("Target", "features")
training_set_labeled = training_set_vector.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray()))).map(lambda row: LabeledPoint(row[1], [row[0]]))

training_set_vector和training_set_label的输出如下：

[Row（Target=0，features=SparseVector（2273，{0:8397.0，1:13.0，2: 12.0，3:1.0，3:1.0，3：1.0，8:3.3147，82:5.721，370:0.25466，410:7.0356，418:5.378，418:5.37865，418:5，418:5，418:5，418:5，418:5，418:5.3786，429，429:5，429，429:9:12.429，429:12.429:12.4219，4219，623:12.4219，623:12.4219，623:7:7:7:7:7:7:7:7:7:7:7：7.397，7，7:7:12.4219，7:12.4219，3:12.4219，623:7：7:12.4219，623:7：7：7：7：7：7：7，7，0.0,0.0]）]

然后，我用标签点拟合随机森林、SVM和GBT模型

RF_model = RandomForest.trainClassifier(training_set_labeled, numClasses=2, categoricalFeaturesInfo = {}, numTrees=10, featureSubsetStrategy = "auto", impurity='gini', maxDepth=4, maxBins=32)
SVM_model = SVMWithSGD.train(training_set_labeled, iterations=500)
GBT_model = GradientBoostedTrees.trainClassifier(training_set_labeled, categoricalFeaturesInfo = {}, maxDepth=4, maxBins=32, numIterations=5)

到目前为止，一切都进展顺利，当我尝试将这些模型应用到测试集（与训练集维度相同）时，问题出现了。以下是我用于应用于测试集的代码：

predictions_RF = RF_model.predict(test_set_labeled.map(lambda r: r.features))
predictions_SVM = SVM_model.predict(test_set_labeled.map(lambda r: r.features))
predictions_GBT = GBT_model.predict(test_set_labeled.map(lambda r: r.features))

RF和GBT型号已成功完成：

print predictions_RF.take(5)
print predictions_GBT.take(5)

[0.0,0.0,0.0,0.0,0.0,0.0] [0.0,0.0,1.0,0.0,0.0]

但当应用SVM模型时，我得到以下错误：

AssertionError:维度不匹配

运行“打印SVM_模型”显示只有约500个权重，但有约5k个特征。我假设这就是问题所在，但我不完全确定如何处理。有没有人有类似的问题，能告诉我如何将这个SVM模型（或其他SVM模型）应用到测试集