Scala BinaryClassificationMetrics中Spark 2.4.4度量属性出错
我试图复制这个,但当我试图从已处理的.csv文件中提取一些指标时,我出现了一个错误 我的代码片段:Scala BinaryClassificationMetrics中Spark 2.4.4度量属性出错,scala,apache-spark,metrics,spark2.4.4,Scala,Apache Spark,Metrics,Spark2.4.4,我试图复制这个,但当我试图从已处理的.csv文件中提取一些指标时,我出现了一个错误 我的代码片段: val splitSeed = 5043 val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed) val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) trai
val splitSeed = 5043
val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
trainingData.show(20);
// Fit the model
val model = lr.fit(trainingData)
// Print the coefficients and intercept for logistic regression
println(s"Coefficients: ${model.coefficients} Intercept: ${model.intercept}")
// run the model on test features to get predictions**
val predictions = model.transform(testData)
//As you can see, the previous model transform produced a new columns: rawPrediction, probablity and prediction.**
testData.show()
// run the model on test features to get predictions**
val predictions = model.transform(testData)
//As you can see, the previous model transform produced a new columns: rawPrediction, probablity and prediction.**
predictions.show()
// use MLlib to evaluate, convert DF to RDD**
val myRdd = predictions.select("rawPrediction", "label").rdd
val predictionAndLabels = myRdd.map(x => (x(0).asInstanceOf[DenseVector](1), x(1).asInstanceOf[Double]))
// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
// A Precision-Recall curve plots (precision, recall) points for different threshold values, while a
// receiver operating characteristic, or ROC, curve plots (recall, false positive rate) points.
// The closer the area Under ROC is to 1, the better the model is making predictions.**
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
| id|thickness|size|shape|madh|epsize|bnuc|bchrom|nNuc|mit|clas|clasLogistic| features|label| rawPrediction| probability|prediction|
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
| 63375| 9.0| 1.0| 2.0| 6.0| 4.0|10.0| 7.0| 7.0|2.0| 4| 1|[9.0,1.0,2.0,6.0,...| 1.0|[0.36391634252951...|[0.58998813846052...| 0.0|
|128059| 1.0| 1.0| 1.0| 1.0| 2.0| 5.0| 5.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[0.81179252636135...|[0.69249134920886...| 0.0|
|145447| 8.0| 4.0| 4.0| 1.0| 2.0| 9.0| 3.0| 3.0|1.0| 4| 1|[8.0,4.0,4.0,1.0,...| 1.0|[0.06964047482828...|[0.51740308582457...| 0.0|
|183913| 1.0| 2.0| 2.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,2.0,2.0,1.0,...| 0.0|[0.96139876234944...|[0.72340177322811...| 0.0|
|342245| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,3.0,1.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|434518| 3.0| 1.0| 1.0| 1.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[3.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|493452| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,3.0,1.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|508234| 7.0| 4.0| 5.0|10.0| 2.0|10.0| 3.0| 8.0|2.0| 4| 1|[7.0,4.0,5.0,10.0...| 1.0|[-0.0809133769755...|[0.47978268474014...| 1.0|
|521441| 5.0| 1.0| 1.0| 2.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[5.0,1.0,1.0,2.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|527337| 4.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|534555| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|535331| 3.0| 1.0| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0|1.0| 2| 0|[3.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|558538| 4.0| 1.0| 3.0| 3.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,3.0,3.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|560680| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|601265| 10.0| 4.0| 4.0| 6.0| 2.0|10.0| 2.0| 3.0|1.0| 4| 1|[10.0,4.0,4.0,6.0...| 1.0|[-0.0034290346398...|[0.49914274218002...| 1.0|
|603148| 4.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|606722| 5.0| 5.0| 7.0| 8.0| 6.0|10.0| 7.0| 4.0|1.0| 4| 1|[5.0,5.0,7.0,8.0,...| 1.0|[-0.3103173938140...|[0.42303726852941...| 1.0|
|616240| 5.0| 3.0| 4.0| 3.0| 4.0| 5.0| 4.0| 7.0|1.0| 2| 0|[5.0,3.0,4.0,3.0,...| 0.0|[0.43719456056061...|[0.60759034803682...| 0.0|
|640712| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|654546| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|8.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
only showing top 20 rows
当我试图了解属性AreaUnderbr
时,我遇到了以下错误:
val splitSeed = 5043
val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
trainingData.show(20);
// Fit the model
val model = lr.fit(trainingData)
// Print the coefficients and intercept for logistic regression
println(s"Coefficients: ${model.coefficients} Intercept: ${model.intercept}")
// run the model on test features to get predictions**
val predictions = model.transform(testData)
//As you can see, the previous model transform produced a new columns: rawPrediction, probablity and prediction.**
testData.show()
// run the model on test features to get predictions**
val predictions = model.transform(testData)
//As you can see, the previous model transform produced a new columns: rawPrediction, probablity and prediction.**
predictions.show()
// use MLlib to evaluate, convert DF to RDD**
val myRdd = predictions.select("rawPrediction", "label").rdd
val predictionAndLabels = myRdd.map(x => (x(0).asInstanceOf[DenseVector](1), x(1).asInstanceOf[Double]))
// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
// A Precision-Recall curve plots (precision, recall) points for different threshold values, while a
// receiver operating characteristic, or ROC, curve plots (recall, false positive rate) points.
// The closer the area Under ROC is to 1, the better the model is making predictions.**
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
| id|thickness|size|shape|madh|epsize|bnuc|bchrom|nNuc|mit|clas|clasLogistic| features|label| rawPrediction| probability|prediction|
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
| 63375| 9.0| 1.0| 2.0| 6.0| 4.0|10.0| 7.0| 7.0|2.0| 4| 1|[9.0,1.0,2.0,6.0,...| 1.0|[0.36391634252951...|[0.58998813846052...| 0.0|
|128059| 1.0| 1.0| 1.0| 1.0| 2.0| 5.0| 5.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[0.81179252636135...|[0.69249134920886...| 0.0|
|145447| 8.0| 4.0| 4.0| 1.0| 2.0| 9.0| 3.0| 3.0|1.0| 4| 1|[8.0,4.0,4.0,1.0,...| 1.0|[0.06964047482828...|[0.51740308582457...| 0.0|
|183913| 1.0| 2.0| 2.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,2.0,2.0,1.0,...| 0.0|[0.96139876234944...|[0.72340177322811...| 0.0|
|342245| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,3.0,1.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|434518| 3.0| 1.0| 1.0| 1.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[3.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|493452| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,3.0,1.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|508234| 7.0| 4.0| 5.0|10.0| 2.0|10.0| 3.0| 8.0|2.0| 4| 1|[7.0,4.0,5.0,10.0...| 1.0|[-0.0809133769755...|[0.47978268474014...| 1.0|
|521441| 5.0| 1.0| 1.0| 2.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[5.0,1.0,1.0,2.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|527337| 4.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|534555| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|535331| 3.0| 1.0| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0|1.0| 2| 0|[3.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|558538| 4.0| 1.0| 3.0| 3.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,3.0,3.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|560680| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|601265| 10.0| 4.0| 4.0| 6.0| 2.0|10.0| 2.0| 3.0|1.0| 4| 1|[10.0,4.0,4.0,6.0...| 1.0|[-0.0034290346398...|[0.49914274218002...| 1.0|
|603148| 4.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|606722| 5.0| 5.0| 7.0| 8.0| 6.0|10.0| 7.0| 4.0|1.0| 4| 1|[5.0,5.0,7.0,8.0,...| 1.0|[-0.3103173938140...|[0.42303726852941...| 1.0|
|616240| 5.0| 3.0| 4.0| 3.0| 4.0| 5.0| 4.0| 7.0|1.0| 2| 0|[5.0,3.0,4.0,3.0,...| 0.0|[0.43719456056061...|[0.60759034803682...| 0.0|
|640712| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|654546| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|8.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
only showing top 20 rows
20/01/10 10:41:02警告TaskSetManager:在阶段56.0中丢失任务0.0
(TID 246,10.10.252.172,执行人1):
java.lang.ClassNotFoundException:
预测。TestCancerOriginal$$anonfun$1在
java.net.URLClassLoader.findClass(URLClassLoader.java:382)位于
loadClass(ClassLoader.java:424)位于
loadClass(ClassLoader.java:357)位于
java.lang.Class.forName0(本机方法)位于
java.lang.Class.forName(Class.java:348)位于
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
在
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
在
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
在
java.io.ObjectInputStream.ReadOrderinaryObject(ObjectInputStream.java:2042)
位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
在
ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
在
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
在
ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
在
ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
在
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
在
ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
在
ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
在
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
在
ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
位于java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
位于java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
在
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
在
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
在
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:88)
在
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
位于org.apache.spark.scheduler.Task.run(Task.scala:123)
org.apache.spark.executor.executor$TaskRunner$$anonfun$10.apply(executor.scala:408)
位于org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
在
org.apache.spark.executor.executor$TaskRunner.run(executor.scala:414)
在
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
在
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
运行(Thread.java:748)
我的预测。显示结果:
val splitSeed = 5043
val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
trainingData.show(20);
// Fit the model
val model = lr.fit(trainingData)
// Print the coefficients and intercept for logistic regression
println(s"Coefficients: ${model.coefficients} Intercept: ${model.intercept}")
// run the model on test features to get predictions**
val predictions = model.transform(testData)
//As you can see, the previous model transform produced a new columns: rawPrediction, probablity and prediction.**
testData.show()
// run the model on test features to get predictions**
val predictions = model.transform(testData)
//As you can see, the previous model transform produced a new columns: rawPrediction, probablity and prediction.**
predictions.show()
// use MLlib to evaluate, convert DF to RDD**
val myRdd = predictions.select("rawPrediction", "label").rdd
val predictionAndLabels = myRdd.map(x => (x(0).asInstanceOf[DenseVector](1), x(1).asInstanceOf[Double]))
// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
println("area under the precision-recall curve: " + metrics.areaUnderPR)
println("area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
// A Precision-Recall curve plots (precision, recall) points for different threshold values, while a
// receiver operating characteristic, or ROC, curve plots (recall, false positive rate) points.
// The closer the area Under ROC is to 1, the better the model is making predictions.**
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
| id|thickness|size|shape|madh|epsize|bnuc|bchrom|nNuc|mit|clas|clasLogistic| features|label| rawPrediction| probability|prediction|
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
| 63375| 9.0| 1.0| 2.0| 6.0| 4.0|10.0| 7.0| 7.0|2.0| 4| 1|[9.0,1.0,2.0,6.0,...| 1.0|[0.36391634252951...|[0.58998813846052...| 0.0|
|128059| 1.0| 1.0| 1.0| 1.0| 2.0| 5.0| 5.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[0.81179252636135...|[0.69249134920886...| 0.0|
|145447| 8.0| 4.0| 4.0| 1.0| 2.0| 9.0| 3.0| 3.0|1.0| 4| 1|[8.0,4.0,4.0,1.0,...| 1.0|[0.06964047482828...|[0.51740308582457...| 0.0|
|183913| 1.0| 2.0| 2.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,2.0,2.0,1.0,...| 0.0|[0.96139876234944...|[0.72340177322811...| 0.0|
|342245| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,3.0,1.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|434518| 3.0| 1.0| 1.0| 1.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[3.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|493452| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,3.0,1.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|508234| 7.0| 4.0| 5.0|10.0| 2.0|10.0| 3.0| 8.0|2.0| 4| 1|[7.0,4.0,5.0,10.0...| 1.0|[-0.0809133769755...|[0.47978268474014...| 1.0|
|521441| 5.0| 1.0| 1.0| 2.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[5.0,1.0,1.0,2.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|527337| 4.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|534555| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|535331| 3.0| 1.0| 1.0| 1.0| 3.0| 1.0| 2.0| 1.0|1.0| 2| 0|[3.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|558538| 4.0| 1.0| 3.0| 3.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,3.0,3.0,...| 0.0|[0.95750903648839...|[0.72262279564412...| 0.0|
|560680| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|601265| 10.0| 4.0| 4.0| 6.0| 2.0|10.0| 2.0| 3.0|1.0| 4| 1|[10.0,4.0,4.0,6.0...| 1.0|[-0.0034290346398...|[0.49914274218002...| 1.0|
|603148| 4.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|1.0| 2| 0|[4.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
|606722| 5.0| 5.0| 7.0| 8.0| 6.0|10.0| 7.0| 4.0|1.0| 4| 1|[5.0,5.0,7.0,8.0,...| 1.0|[-0.3103173938140...|[0.42303726852941...| 1.0|
|616240| 5.0| 3.0| 4.0| 3.0| 4.0| 5.0| 4.0| 7.0|1.0| 2| 0|[5.0,3.0,4.0,3.0,...| 0.0|[0.43719456056061...|[0.60759034803682...| 0.0|
|640712| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 2.0| 1.0|1.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.10995557408198...|[0.75212082898242...| 0.0|
|654546| 1.0| 1.0| 1.0| 1.0| 2.0| 1.0| 1.0| 1.0|8.0| 2| 0|[1.0,1.0,1.0,1.0,...| 0.0|[1.11079628977456...|[0.75227753466134...| 0.0|
+------+---------+----+-----+----+------+----+------+----+---+----+------------+--------------------+-----+--------------------+--------------------+----------+
only showing top 20 rows
我在这里看到的一个错误是,您将
rawPrediction
列传递给binaryclassionmetrics
对象,而不是prediction
列rawPrediction
包含一个数组,每个类具有某种“概率”,而BinaryClassificationMetrics
需要一个双精度值,由其签名指定:
newbinaryclassionmetrics(scoreAndLabels:RDD[(Double,Double)])
你可以看到细节
我已经用这个修改做了一个快速测试,它似乎有效,下面是代码片段:
import org.apache.spark.sql.{Encoders,SparkSession}
导入org.apache.spark.ml.classification.logisticReturnal
导入org.apache.spark.ml.feature.StringIndexer
导入org.apache.spark.ml.feature.VectorAssembler
导入org.apache.spark.sql.functions_
导入org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
外壳等级Obs(id:Int,厚度:Double,尺寸:Double,形状:Double,madh:Double,
epsize:Double,bnuc:Double,bchrom:Double,nNuc:Double,mit:Double,clas:Double)
val obsSchema=Encoders.product[Obs].schema
val spark=SparkSession.builder
.appName(“StackoverflowQuestions”)
.master(“本地[*]”)
.getOrCreate()
//使用.as[]方法将DataFrame转换为Dataset是必需的
导入spark.implicits_
val df=spark.read
.schema(obsSchema)
.csv(“威斯康星州乳腺癌数据”)
.drop(“id”)
.带列(“clas”,当(列(“clas”)。等于(4.0),1.0时。否则(0.0))
.na.drop()//确保删除空值,否则功能装配将失败
//定义要放入要素向量的要素列**
val featureCols=数组(“厚度”、“大小”、“形状”、“madh”、“epsize”、“bnuc”、“bchrom”、“nNuc”、“mit”)
//设置输入和输出列名**
val assembler=new VectorAssembler().setInputCols(featureCols).setOutputCol(“features”)
//返回向量列中包含所有要素列的数据帧**
val df2=汇编程序.转换(df)
//使用StringIndexer创建标签列**
val labelIndex=new StringIndexer().setInputCol(“clas”).setOutputCol(“标签”)
val df3=labelIndexer.fit(df2).transform(df2)
val splitSeed=5043
val数组(trainingData,testData)=df3.randomSplit(数组(0.7,0.3),splitSeed)
val lr=新逻辑回归()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
培训数据显示(20);
//符合模型
val模型=lr.配合(培训数据)
//打印逻辑回归的系数和截距
println(s“系数:${model.coverties}截距:${model.Intercept}”)
//在测试功能上运行模型以获得预测**
val预测=model.transform(testData)
//尽你所能