Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/291.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyspark.ml:计算精度和召回率时出现类型错误_Python_Apache Spark_Machine Learning_Pyspark_Apache Spark Ml - Fatal编程技术网

Python pyspark.ml:计算精度和召回率时出现类型错误

Python pyspark.ml:计算精度和召回率时出现类型错误,python,apache-spark,machine-learning,pyspark,apache-spark-ml,Python,Apache Spark,Machine Learning,Pyspark,Apache Spark Ml,我试图使用pyspark.ml计算分类器的精度、召回率和F1: model = completePipeline.fit(training) predictions = model.transform(test) mm = MulticlassMetrics(predictions.select(["label", "prediction"]).rdd) labels = sorted(predictions.select("prediction").rdd.distinct().map(

我试图使用
pyspark.ml
计算分类器的精度、召回率和F1:

model = completePipeline.fit(training)
predictions = model.transform(test)

mm = MulticlassMetrics(predictions.select(["label", "prediction"]).rdd)


labels = sorted(predictions.select("prediction").rdd.distinct().map(lambda r: r[0]).collect())

for label in labels:
    print labels
    print "Precision = %s" % mm.precision(label=label) 
    print "Recall = %s" % mm.recall(label=label) 
    print "F1 Score = %s" % mm.fMeasure(label=label)

metrics = pandas.DataFrame([(label, mm.precision(label=label), mm.recall(label=label), mm.fMeasure(label=label)) for label in labels],
                            columns=["Precision", "Recall", "F1"])
结果数据帧的架构
预测

[('features', 'vector'), ('label', 'int'), ('rawPrediction', 'vector'), ('probability', 'vector'), ('prediction', 'double')]
调用
mm.precision
时触发的错误消息:

回溯(最近一次呼叫最后一次):
文件“ml\U管道\U工厂\U测试”,第1行,in
文件“管道工厂测试”,第92行,在管道工厂测试中
文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/mllib/evaluation.py”,第240行,精度
返回自调用(“精度”,浮点(标签))
调用文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/mllib/common.py”,第146行
返回callJavaFunc(self.\u sc,getattr(self.\u java\u model,name),*a)
callJavaFunc中的文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/mllib/common.py”,第123行
return _java2py(sc,func(*args))
文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/py4j/java_gateway.py”,第1160行,在调用中__
回答,self.gateway\u客户端,self.target\u id,self.name)
文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/sql/utils.py”,第63行,装饰图
返回f(*a,**kw)
文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/py4j/protocol.py”,第320行,在get_返回值中
格式(目标id,“.”,名称),值)
Py4JJavaError:调用o371.precision时出错。
:org.apache.SparkException:作业因阶段失败而中止:阶段22.0中的任务7失败4次,最近的失败:阶段22.0中的任务7.3丢失(TID 153,dhbpdn12.de.t-internal.com,executor 4):org.apache.spark.api.python.python异常:回溯(最近一次调用):
文件“/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda env/lib/python2.7/site packages/pyspark/worker.py”,第245行,主文件
过程()
文件“/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda env/lib/python2.7/site packages/pyspark/worker.py”,第240行,正在处理中
serializer.dump_流(func(拆分索引,迭代器),outfile)
文件“/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda-env/lib/python2.7/site-packages/pyspark/serializers.py”,第372行,位于dump_流中
vs=列表(itertools.islice(迭代器,批处理))
文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/sql/session.py”,第677行,在prepare中
文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/sql/types.py”,第1421行,在verify中
验证结构中的文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/sql/types.py”,第1402行
文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/sql/types.py”,第1421行,在verify中
文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/sql/types.py”,第1415行,在默认设置中
验证类型中的文件“/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda env/lib/python2.7/site packages/pyspark/sql/types.py”,第1310行
TypeError:字段预测:DoubleType无法接受类型中的对象0

如错误消息中所示:


问题是“标签”(
('label',int')
)而不是“预测”。请检查是否编辑帮助,这修复了类型错误,谢谢。仅仅是我,还是对分类器的评估期望标签为double类型,这很奇怪?是的,但这是标准的Spark行为。这一切都是关于实现的简单性。原始
mllib
API对分类和回归模型使用
LabeledPoint
。这就是为什么这里使用了
Double
。此外,所有在封面下使用的低级库都是对FP数字进行操作的,不支持整数。同样的事情也会发生(总是有一些BLA在那里,它不支持整数),它只是transpartent。
TypeError: field prediction: DoubleType can not accept object 0 in type <type 'int'>
predictions = (predictions
    .withColumn("label", predictions["label"].cast("double")))