Pyspark ValueError:没有命名为列的轴<;rand()作为“交叉验证程序”>;对于对象类型<;类别';pandas.core.frame.DataFrame'&燃气轮机;

Pyspark ValueError:没有命名为列的轴<;rand()作为“交叉验证程序”>;对于对象类型<;类别';pandas.core.frame.DataFrame'&燃气轮机;,pyspark,Pyspark,我正在尝试使用pyspark.ml执行classification.RandomForest,使用交叉验证 我已将CSV格式的输入文件转换为数据帧格式。当我执行下面的代码时,我得到的错误是下面错误格式中提到的值错误 下面是python代码 import pyspark import pandas as pd import numpy as np from pyspark.sql import SQLContext from pyspark.ml import Pipeline from pysp

我正在尝试使用pyspark.ml执行classification.RandomForest,使用交叉验证

我已将CSV格式的输入文件转换为数据帧格式。当我执行下面的代码时,我得到的错误是下面错误格式中提到的值错误

下面是python代码

import pyspark
import pandas as pd
import numpy as np
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics

sc = pyspark.SparkContext()
sql = SQLContext(sc)

trainingData= pd.read_csv("CSVfilepath", index_col=0, parse_dates=True)

print trainingData
numFolds = 10 


rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="label", featuresCol="features", seed=42)
evaluator = MulticlassClassificationEvaluator().setLabelCol("V5409").setPredictionCol("prediction").setMetricName("accuracy") 

paramGrid = ParamGridBuilder().build()

pipeline = Pipeline(stages=[rf])
paramGrid=ParamGridBuilder().build()
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=numFolds)

model = crossval.fit(trainingData)
获得的误差为

Traceback (most recent call last):
  File "randomforest_cv.py", line 46, in <module>
    model = crossval.fit(trainingData)
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pyspark/ml/tuning.py", line 224, in _fit
    df = dataset.select("*", rand(seed).alias(randCol))
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 2085, in select
    axis = self._get_axis_number(axis)
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 353, in _get_axis_number
    .format(axis, type(self)))
ValueError: No axis named Column<rand(-4372709618522015412) AS `CrossValidator_42cab674dd6c1d100ef0_rand`> for object type <class 'pandas.core.frame.DataFrame'>
回溯(最近一次呼叫最后一次):
文件“randomforest_cv.py”,第46行,在
模型=交叉值拟合(训练数据)
文件“/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pyspark/ml/base.py”,第64行,适合
返回自拟合(数据集)
文件“/home/hadoopuser/anaconda2/lib/python2.7/site packages/pyspark/ml/tuning.py”,第224行,格式为
df=dataset.select(“*”,rand(seed).别名(randCol))
文件“/home/hadoopuser/anaconda2/lib/python2.7/site packages/pandas/core/generic.py”,第2085行,选择
轴=自身。获取轴编号(轴)
文件“/home/hadoopuser/anaconda2/lib/python2.7/site packages/pandas/core/generic.py”,第353行,输入轴号
.格式(轴,类型(自身)))
ValueError:对象类型没有轴命名列

有人能帮我解决这个问题吗。我想问题在于数据帧格式。

您已经创建了pandas数据帧。您必须创建spark数据帧。您已经创建了pandas数据帧。您必须创建spark数据帧。