Apache spark 如何在Spark2中实际应用已保存的RF模型并进行预测？_Apache Spark_Hadoop_Pyspark_Apache Spark Sql_Rdd

Apache spark 如何在Spark2中实际应用已保存的RF模型并进行预测？

apache-spark hadoop pyspark

Apache spark 如何在Spark2中实际应用已保存的RF模型并进行预测？,apache-spark,hadoop,pyspark,apache-spark-sql,rdd,Apache Spark,Hadoop,Pyspark,Apache Spark Sql,Rdd,这是一个新手问题，因为我似乎找不到简单的方法我正在用天气数据做航空公司数据集，如果>15分钟，预测航班延误航空公司数据集（2007年和2008年）：天气： wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2007.csv.gz -O /tmp/weather_2007.csv.gz wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2008.csv.gz -O

这是一个新手问题，因为我似乎找不到简单的方法

我正在用天气数据做航空公司数据集，如果>15分钟，预测航班延误

航空公司数据集（2007年和2008年）：

天气：

wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2007.csv.gz -O /tmp/weather_2007.csv.gz
wget ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2008.csv.gz -O /tmp/weather_2008.csv.gz

我的代码来自此URL，但我将其更改为Spark 2.3：

df_airline_2007 = sqlContext.read.format("csv").option("header", "true").load("/ACMEAirDB/2007/2007.csv")
df_weather_2007 = sqlContext.read.format("csv").option("header", "false").load("/ACMEAirDB/weather_2007/weather_2007.csv")
df_airline_2008 = sqlContext.read.format("csv").option("header", "true").load("/ACMEAirDB/2008/2008.csv")
df_weather_2008 = sqlContext.read.format("csv").option("header", "false").load("/ACMEAirDB/weather_2008/weather_2008.csv")

df_airline_raw = df_airline_2007.unionAll(df_airline_2008)
df_weather_raw = df_weather_2007.unionAll(df_weather_2008)


#Function to create year,month,day into date for airline to join on to weather
def to_date(year,month,day): 
    dt = "%04d%02d%02d" % (year, month, day)
    return dt

sqlContext.udf.register("to_date", to_date)

#Function to discrentize time in airline
def discretize_tod(val):
    hour = int(val[:2])
    if hour < 8:
        return 0
    if hour < 16:
        return 1
    return 2

sqlContext.udf.register("discretize_tod", discretize_tod)

df_airline_raw.registerTempTable("df_airpline_raw")
df_weather_raw.registerTempTable("df_weather_raw")

#Create Final Airline transformation
df_airline = sqlContext.sql("""SELECT 
                            Year as year, Month as month, DayofMonth as day, DayOfWeek as dow,
                            CarrierDelay as carrier, Origin as origin, Dest as dest, Distance as distance, 
                            discretize_tod(DepTime) as tod, CASE WHEN DepDelay >= 15 THEN 1 ELSE 0 END as delay, 
                            to_date(cast(Year as int), cast(Month as int), cast(DayofMonth as int)) As date 
                            FROM df_airpline_raw
                            WHERE Cancelled = 0 AND Origin = 'ORD'""")

#Create Base Weather Transformation Table
df_weather = sqlContext.sql("""SELECT 
                                _C0 AS station,
                                _C1 As date,
                                _C2 As metric,
                                _C3 As value, 
                                _C4 As t1, 
                                _C5 As t2, 
                                _C6 As t3, 
                                _C7 As time
                                FROM df_weather_raw
                                """)


# df_weather.show(10)    


#Create Tmin and Tmax Weather DF
df_weather.registerTempTable("df_weather")

#Create DFs for Weather Tmin and Tmax Values 
df_weather_tmin = sqlContext.sql("""SELECT 
                                        date, 
                                        value as temp_min 
                                    FROM df_weather 
                                    WHERE station = 'USW00094846' 
                                    AND metric = 'TMIN'""")

df_weather_tmax = sqlContext.sql("""SELECT 
                                        date, 
                                        value as temp_max 
                                    FROM df_weather 
                                    WHERE station = 'USW00094846' 
                                    AND metric = 'TMAX'""")

#Join Airline with Weather Tmin and Tmax Dataframes
df_airline_tmin = df_airline.join(df_weather_tmin, 
                                  df_weather_tmin.date == df_airline.date, 
                                  "inner").drop(df_weather_tmin.date)

df_airline_tmin_and_tmax = df_airline_tmin.join(df_weather_tmax, 
                                                df_weather_tmax.date == df_airline_tmin.date, 
                                                "inner").drop(df_weather_tmax.date)

df_airline_tmin_and_tmax.registerTempTable("df_airline_tmin_and_tmax")
df_all = sqlContext.sql("""SELECT 
                                delay,
                                year,
                                month, 
                                day, 
                                dow, 
                                cast (tod AS int) tod, 
                                distance, 
                                temp_min, 
                                temp_max
                            FROM df_airline_tmin_and_tmax""")

#Cache Dataframe because we split it later on
df_all.cache()

#Linear Regression
#import necessary librarys
from pyspark.mllib.regression import  LabeledPoint
# from pyspark.mllib.tree import DecisionTree, RandomForest
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.linalg import DenseVector


#Create labeledPoint Parser
def parseDF(row):
    values = [row.delay, row.month, row.day, row.dow, row.tod, row.distance, row.temp_min, row.temp_max]
    return LabeledPoint(values[0], DenseVector(values[1:]))

#Convert Dataframes to LabeledPoint for modeling
train_data = df_all.filter("year=2007").rdd.map(parseDF)
test_data = df_all.filter("year=2008").rdd.map(parseDF)


#Train Models

modelRF = RandomForest.trainClassifier(train_data, numClasses=2, categoricalFeaturesInfo={},
                                      numTrees=500, impurity='gini', maxDepth=5)


#Apply CART model on Test Data
predictionsRF = modelRF.predict(test_data.map(lambda x: x.features))
predictionsAndLabelsRFRDD = predictionsRF.zip(test_data.map(lambda lp: lp.label))
predictionsAndLabelsRF = predictionsAndLabelsRFRDD.collect()

import pandas as pd

#Create function

def confusion_matrix(predAndLabel):
    y_actual = pd.Series([x for x, y in predAndLabel], name = 'Actual')
    y_pred = pd.Series([y for x, y in predAndLabel], name = 'Predicted')

    matrix = pd.crosstab(y_actual,y_pred)
    accuracy = float(matrix[0][0] + matrix[1][1])/(matrix[0][0] + matrix[0][1] + matrix[1][0] + matrix[1][1])

    return matrix, accuracy


#RandomForest Confusion Matrix and Model Accuracy
df_confusion_RF, accuracy_RF = confusion_matrix(predictionsAndLabelsRF)

print('RF Confusion Matrix:')
print(df_confusion_RF)
print('\nRF Model Accuracy: {0}'.format(accuracy_RF))

所以我的问题是：现在我有了模型

predictionsRF

，我如何将其应用于“真实世界”的一项记录

这是我的新手尝试：

df_validation = sqlContext.sql("""SELECT 
                                1 delay,
                                2008 year,
                                6 month, 
                                19 day, 
                                4 dow, 
                                1 tod, 
                                925 distance, 
                                111 temp_min, 
                                272 temp_max
                            """)

validation_data = df_validation.rdd.map(parseDF)

df_validation.show(1)

validationsRF = modelRF.predict(validation_data.map(lambda x: x.features))
validationsAndLabelsRFRDD = validationsRF.zip(validation_data.map(lambda lp: lp.label))
validationsAndLabelsRF = validationsAndLabelsRFRDD.collect()

print(validationsRF.collect())

1。使用
validationsRF.collect（）
作为预测延迟结果是否正确？

2。如何从

df\u验证中删除delay 列，并且不出现错误（如下）？

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 284.0 failed 4 times, most recent failure: Lost task 0.3 in stage 284.0 (TID 4544, ip-172-31-40-184.us-west-2.compute.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 229, in main process() File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File "<stdin>", line 5, in parseDF File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 1561, in __getattr__ raise AttributeError(item) AttributeError: delay at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:938) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.GeneratedMethodAccessor184.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 229, in main process() File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File "<stdin>", line 5, in parseDF File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 1561, in __getattr__ raise AttributeError(item) Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时出错。：org.apache.SparkException:作业因阶段失败而中止：阶段284.0中的任务0失败4次，最近的失败：阶段284.0中的任务0.3丢失（TID 4544，ip-172-31-40-184.us-west-2.compute.internal，executor 1）：org.apache.spark.api.python.PythonException:回溯（最近一次调用）：文件“/usr/hdp/current/spark2 client/python/pyspark/worker.py”，第229行，在main中过程（）文件“/usr/hdp/current/spark2 client/python/pyspark/worker.py”，第224行，正在处理中 serializer.dump_流（func（拆分索引，迭代器），outfile）文件“/usr/hdp/current/spark2 client/python/pyspark/serializers.py”，第372行，在dump_流中 vs=列表（itertools.islice（迭代器，批处理））文件“”，第5行，在parseDF中文件“/usr/hdp/current/spark2 client/python/pyspark/sql/types.py”，第1561行，在__ 提高属性错误（项目）属性错误：延迟位于org.apache.spark.api.python.BasePythonRunner$readeriator.handlePythonException（PythonRunner.scala:298）位于org.apache.spark.api.python.PythonRunner$$anon$1.read（PythonRunner.scala:438）位于org.apache.spark.api.python.PythonRunner$$anon$1.read（PythonRunner.scala:421）位于org.apache.spark.api.python.BasePythonRunner$readerierator.hasNext（PythonRunner.scala:252）在org.apache.spark.interruptblediator.hasNext（interruptblediator.scala:37）位于scala.collection.Iterator$$anon$12.hasNext（Iterator.scala:439）位于scala.collection.Iterator$$anon$11.hasNext（Iterator.scala:408）位于org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext（SerDeUtil.scala:153）位于scala.collection.Iterator$class.foreach（Iterator.scala:893）位于org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach（SerDeUtil.scala:148）位于org.apache.spark.api.PythonRDD$.writeiteiteratortostream（PythonRDD.scala:204）位于org.apache.spark.api.python.PythonRunner$$anon$2.writeiteiteratortostream（PythonRunner.scala:407）位于org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply（PythonRunner.scala:215）位于org.apache.spark.util.Utils$.logUncaughtExceptions（Utils.scala:1988）位于org.apache.spark.api.python.BasePythonRunner$WriterThread.run（PythonRunner.scala:170）驱动程序堆栈跟踪：位于org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages（DAGScheduler.scala:1599）位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply（DAGScheduler.scala:1587）位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply（DAGScheduler.scala:1586）位于scala.collection.mutable.resizeblearray$class.foreach（resizeblearray.scala:59）位于scala.collection.mutable.ArrayBuffer.foreach（ArrayBuffer.scala:48）位于org.apache.spark.scheduler.DAGScheduler.abortStage（DAGScheduler.scala:1586）位于org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply（DAGScheduler.scala:831）位于org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply（DAGScheduler.scala:831）位于scala.Option.foreach（Option.scala:257）位于org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed（DAGScheduler.scala:831）位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive（DAGScheduler.scala:1820）位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala:1769）位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala:1758）位于org.apache.spark.util.EventLoop$$anon$1.run（EventLoop.scala:48）位于org.apache.spark.scheduler.DAGScheduler.runJob（DAGScheduler.scala:642）位于org.apache.spark.SparkContext.runJob（SparkContext.scala:2034）位于org.apache.spark.SparkContext.runJob（SparkContext.scala:2055）位于org.apache.spark.SparkContext.runJob（SparkContext.scala:2074）位于org.apache.spark.SparkContext.runJob（SparkContext.scala:2099）位于org.apache.spark.rdd.rdd$$anonfun$collect$1.apply（rdd.scala:939）位于org.apache.spark.rdd.RDDOperationScope$.withScope（RDDOperationScope.scala:151）位于org.apache.spark.rdd.RDDOperationScope$.withScope（RDDOperationScope.scala:112）位于org.apache.spark.rdd.rdd.withScope（rdd.scala:363）位于org.apache.spark.rdd.rdd.collect（rdd.scala:938）位于org.apache.spark.api.python.PythonRDD$.collectAndServe（PythonRDD.scala:153）位于org.apache.spark.api.python.PythonRDD.collectAndServe（PythonRDD.scala）位于sun.reflect.GeneratedMethodAccessor184.invoke（未知源）在sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43）中位于java.lang.reflect.Method.invoke（Method.java:498）位于py4j.reflection.MethodInvoker.invoke（MethodInvoker.java:244）位于py4j.reflection.ReflectionEngine.invoke（ReflectionEngine.java:357）在py4j.Gateway.invoke处（Gateway.java:282）位于py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:132）在py4j.commands.CallCommand.execute（CallCommand.java:79）在py4j.GatewayConnection.run处（GatewayConnection.java:214）运行（Thread.java:745）原因：org.apache.spark.api.python异常：回溯（最新） Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 284.0 failed 4 times, most recent failure: Lost task 0.3 in stage 284.0 (TID 4544, ip-172-31-40-184.us-west-2.compute.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 229, in main process() File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File "<stdin>", line 5, in parseDF File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 1561, in __getattr__ raise AttributeError(item) AttributeError: delay at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:938) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.GeneratedMethodAccessor184.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 229, in main process() File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File "<stdin>", line 5, in parseDF File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 1561, in __getattr__ raise AttributeError(item) values = [row.delay, ...] from pyspark.mllib.linalg import Vectors modelRF.predict(Vectors.dense([2008, 6, 19, 4, 1, 925, 111, 272]))