无法收集模型统计信息的PySpark数据帧或将其转换为RDD_Pyspark_Apache Spark Ml

无法收集模型统计信息的PySpark数据帧或将其转换为RDD

pyspark

无法收集模型统计信息的PySpark数据帧或将其转换为RDD,pyspark,apache-spark-ml,Pyspark,Apache Spark Ml,在尝试从recallByThreshold返回的数据帧中提取与最高调用值相关联的阈值时，我遇到了令人困惑的PySpark错误。有趣的是，这些错误只有在集群模式下运行应用程序时才会发生 training, testing = data.randomSplit([0.7, 0.3], seed=100) train = training.coalesce(200) test = testing.coalesce(100) train.persist() test.persist() model =

在尝试从

recallByThreshold

返回的数据帧中提取与最高调用值相关联的阈值时，我遇到了令人困惑的PySpark错误。有趣的是，这些错误只有在集群模式下运行应用程序时才会发生

training, testing = data.randomSplit([0.7, 0.3], seed=100)
train = training.coalesce(200)
test = testing.coalesce(100)
train.persist()
test.persist()
model = LogisticRegression(labelCol='label',
                           featuresCol='features',
                           weightCol='importance',
                           maxIter=30,
                           regParam=0.3,
                           elasticNetParam=0.2)
trained_model = model.fit(train)
threshold = trained_model.summary.recallByThreshold.rdd.max(key=lambda x: x["recall"])["threshold"]

最后一行代码生成

AttributeError:“NoneType”对象没有属性“setCallSite”

。进一步细分，当我尝试

trained\u model.summary.recallByThreshold.rdd

时，我得到另一个不同的错误

***AttributeError:“NoneType”对象没有属性“sc”

此问题似乎与有关，但在本例中，我根本无法收集数据帧（产生相同的错误）。我从主节点上的IPython启动了我的应用程序，那么

SparkContext

是否应该通过

SparkSession

（使用Spark 2.1.0版）提供