Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 为什么Spark MLLib';s参数网格使精度降低?_Scala_Apache Spark_Machine Learning_Apache Spark Mllib_Logistic Regression - Fatal编程技术网

Scala 为什么Spark MLLib';s参数网格使精度降低?

Scala 为什么Spark MLLib';s参数网格使精度降低?,scala,apache-spark,machine-learning,apache-spark-mllib,logistic-regression,Scala,Apache Spark,Machine Learning,Apache Spark Mllib,Logistic Regression,做什么: 使用LogisticRegression对train.csv()进行二进制分类。 “train.csv”是泰坦尼克号乘客名单csv文件。 标签为“幸存” 拆分前100行是测试集,订单是训练集 问题: 第一:当我使用参数网格时: val paramGrid = new ParamGridBuilder() .addGrid(lr.regParam, Array(0.1, 0.01)) .addGrid(lr.threshold, Array(0.2, 0.3, 0.35, 0.4)) .

做什么:
使用LogisticRegression对train.csv()进行二进制分类。
“train.csv”是泰坦尼克号乘客名单csv文件。
标签为“幸存”
拆分前100行是测试集,订单是训练集

问题:
第一:当我使用参数网格时:

val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.threshold, Array(0.2, 0.3, 0.35, 0.4))
.build()
结果日志为

training set:  
marked "Survived" of [Prediction/Label]'s Count : 273 / 259
marked "Death" of [Prediction/Label]'s Count : 363 / 377
Accuracy is : 97.79874213836479% (622 / 636)

test set:  
marked "Survived" of [Prediction/Label]'s Count : 33 / 31
marked "Death" of [Prediction/Label]'s Count : 45 / 47
Accuracy is : 97.43589743589743% (76 / 78)
第二:不使用参数网格:

//just for don't change code, meaning is not using parameter grid.
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1))
.addGrid(lr.threshold, Array(0.4))
.build()
结果日志为:

training set:  
marked "Survived" of [Prediction/Label]'s Count : 259 / 259
marked "Death" of [Prediction/Label]'s Count : 377 / 377
Accuracy is : 100.0% (636 / 636)

test set:  
marked "Survived" of [Prediction/Label]'s Count : 31 / 31
marked "Death" of [Prediction/Label]'s Count : 47 / 47
Accuracy is : 100.0% (78 / 78)
我的参数网格包括第二个参数的单个参数值。
所以Spark election必须选择100%精度的模型。
但97%的准确率模型似乎被选上了

如果存在99%的模型,但选择98%的模型。那是可能的。因为模型评估方法可能不符合精度规范的标准。
但如果不是当选的话,这个模型的准确率是100%,我想这个故事会有些不同。 当分类时,我认为100%的准确度意味着F1分数、准确度、混淆矩阵和其他评估“完美”。
即使在测试集为100%。
所以我不明白为什么100%的模特没有当选

代码:

我的环境:
IntelliJ(U)
Win7 x64
Spark 2.2.0
Scala 2.11.5

ML设置:
使用的功能是
1) P等级(票务等级,1=1、2=2、3=3)
2) 年龄
3) sibsp(泰坦尼克号上兄弟姐妹/配偶的数量)
4) parch(泰坦尼克号上的父母/子女人数)
5) 票价(乘客票价)
标签是“幸存”

//df_for_columns is DataFrame with preprocessed (drop some parameter columns, drop rows with age is 0)
val features = df_for_columns.columns
val lr = new LogisticRegression()
.setMaxIter(100)
.setFeaturesCol("features")

val assembler = new 
    VectorAssembler().setInputCols(features).setOutputCol("features")
val pipeline = new Pipeline().setStages(Array(assembler, lr))

//cross-validation check
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)  //using naive BinaryclassificationEvaluator
.setEstimatorParamMaps(paramGrid)  //paramGrid is builded by upper codes.
.setNumFolds(10) 

//df_training is training set
val lrModel = cv.fit(df_training)

val bmodel = lrModel.bestModel
val result = bmodel.transform(df_training)
val result_test = bmodel.transform(df_test)