PySpark至PMML-“;“字段标签不存在”;错误

PySpark至PMML-“;“字段标签不存在”;错误,pyspark,apache-spark-ml,pmml,Pyspark,Apache Spark Ml,Pmml,我是PySpark的新手,所以这可能是一个基本问题。我正在尝试使用JPMML-SparkML库将PySpark代码导出到PMML。 从网站运行示例时: from pyspark.ml import Pipeline from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.feature import RFormula df = spark.read.csv("Iris.csv", header =

我是PySpark的新手,所以这可能是一个基本问题。我正在尝试使用JPMML-SparkML库将PySpark代码导出到PMML。 从网站运行示例时:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula

df = spark.read.csv("Iris.csv", header = True, inferSchema = True)
formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)
我收到一个错误
字段“label”不存在
。从同一页面运行Scala代码时,会弹出相同的错误。有人知道这个标签字段指的是什么吗?这似乎是隐藏在后台执行的Spark代码中的东西。我怀疑这个标签字段是否是Iris数据集的一部分

完整错误消息:

Traceback (most recent call last): File "/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o48.fit. :
 java.lang.IllegalArgumentException: Field "label" does not exist. at
 org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264) at
 org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264) at
 scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:59) at
 org.apache.spark.sql.types.StructType.apply(StructType.scala:263) at 
 org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71) at 
 org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) at
 org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Cla
 ssifier.scala:58) at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42) at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$vali
 dateAndTransformSchema(ProbabilisticClassifier.scala:53) at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37) at
 org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53) at
 org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:122) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
 java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)

谢谢,Michal

您需要提供要预测的列作为标签。您可以将dataframe中的列别名为“label”并使用分类器,也可以在分类器的构造函数中提供列作为labelCol参数

classifier = DecisionTreeClassifier(labelCol='some prediction field')

您需要将要预测的列作为标签提供。您可以将dataframe中的列别名为“label”并使用分类器,也可以在分类器的构造函数中提供列作为labelCol参数

classifier = DecisionTreeClassifier(labelCol='some prediction field')

确实-提供labelCol参数帮助,此处也可以找到一个工作示例:确实-提供labelCol参数帮助,此处也可以找到一个工作示例: