Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/285.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyspark.sql.utils.IllegalArgumentException:“要求失败:初始容量无效”_Python_Apache Spark_Machine Learning_Pyspark - Fatal编程技术网

Python pyspark.sql.utils.IllegalArgumentException:“要求失败:初始容量无效”

Python pyspark.sql.utils.IllegalArgumentException:“要求失败:初始容量无效”,python,apache-spark,machine-learning,pyspark,Python,Apache Spark,Machine Learning,Pyspark,我试图使用ML库在Spark中使用决策树运行交叉验证,但调用cv.fittrain_数据集时出现此错误: pyspark.sql.utils.IllegalArgumentException:u“要求失败:初始容量无效” 除了数据帧为空之外,我还没有找到关于它可能是什么的更多信息,但事实并非如此。 这是我的代码: df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.

我试图使用ML库在Spark中使用决策树运行交叉验证,但调用cv.fittrain_数据集时出现此错误:

pyspark.sql.utils.IllegalArgumentException:u“要求失败:初始容量无效”

除了数据帧为空之外,我还没有找到关于它可能是什么的更多信息,但事实并非如此。 这是我的代码:

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
df.columns = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Schuked weight', 'Viscera weight', 'Shell weight', 'Rings']
train_dataset = sqlContext.createDataFrame(df)

column_types = train_dataset.dtypes

categoricalCols = []
numericCols = []

for ct in column_types:
    if ct[1] == 'string':
        categoricalCols += [ct[0]]
    else:
        numericCols += [ct[0]]

stages = []
for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    stages += [stringIndexer]

assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

labelIndexer = StringIndexer(inputCol='Rings', outputCol='indexedLabel')
stages += [labelIndexer]

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")

evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1,2,6])
             .addGrid(dt.maxBins, [20,40])
             .build())

stages += [dt]
pipeline = Pipeline(stages=stages)

cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=1)

cvModel = cv.fit(train_dataset)
train_dataset = cvModel.transform(train_dataset)
我在本地运行Spark单机版。有什么不对劲吗


谢谢

因此,问题是将交叉验证的numFolds参数设置为1。如果我想用ParamGrid进行参数调整,只需一次列车测试拆分,显然我需要使用TrainValidationSplit。

这条线路真的有效吗?assemblerInputs=maplambda c:c+索引,categoricalCols+您试图将映射附加到此处列表的数值Cols。这不应该像assemblerInputs=[x+分类列中x的索引]+NumericCols吗?我将其更改为您建议的值,但仍然得到相同的错误: