Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从决策树回归器拟合训练数据导致崩溃_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 从决策树回归器拟合训练数据导致崩溃

Python 从决策树回归器拟合训练数据导致崩溃,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,试图在一些训练数据上实现决策树回归算法,但调用fit()时出现错误 (trainingData, testData) = data.randomSplit([0.7, 0.3]) vecAssembler = VectorAssembler(inputCols=["_1", "_2", "_3", "_4", "_5", "_6", "_7", "_8", "_9", "_10"], outputCol="features") dt = DecisionTreeReg

试图在一些训练数据上实现决策树回归算法,但调用fit()时出现错误

    (trainingData, testData) = data.randomSplit([0.7, 0.3])

    vecAssembler = VectorAssembler(inputCols=["_1", "_2", "_3", "_4", "_5", "_6", "_7", "_8", "_9", "_10"], outputCol="features")

    dt = DecisionTreeRegressor(featuresCol="features", labelCol="_11")

    dt_model = dt.fit(trainingData)
生成错误

  File "spark.py", line 100, in <module>
    main()
  File "spark.py", line 87, in main
    dt_model = dt.fit(trainingData)
  File "/opt/spark/python/pyspark/ml/base.py", line 132, in fit
    return self._fit(dataset)
  File "/opt/spark/python/pyspark/ml/wrapper.py", line 295, in _fit
    java_model = self._fit_java(dataset)
  File "/opt/spark/python/pyspark/ml/wrapper.py", line 292, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
文件“spark.py”,第100行,在
main()
文件“spark.py”,第87行,主
dt_模型=dt.拟合(训练数据)
文件“/opt/spark/python/pyspark/ml/base.py”,第132行,适合
返回自拟合(数据集)
文件“/opt/spark/python/pyspark/ml/wrapper.py”,第295行,格式为
java\u model=self.\u fit\u java(数据集)
java中的文件“/opt/spark/python/pyspark/ml/wrapper.py”,第292行
返回self.\u java\u obj.fit(dataset.\u jdf)
文件“/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”,第1257行,in_u调用__
文件“/opt/spark/python/pyspark/sql/utils.py”,第79行,deco格式
引发IllegalArgumentException(s.split(“:”,1)[1],stackTrace)
pyspark.sql.utils.IllegalArgumentException:u“要求失败:列功能必须是struct类型,但实际上是struct。”

但是数据结构完全相同。

您缺少两个步骤。1.变换部分,和2。从转换的数据中选择要素和标签。我假设数据只包含数字数据,即没有分类数据。我将使用
pyspark.ml
编写一个通用的模型训练流程,以帮助您

from pyspark.ml.feature
from pyspark.ml.classification import DecisionTreeClassifier

#date processing part

vecAssembler = VectorAssembler(input_cols=['col_1','col_2',...,'col_10'],outputCol='features')

#you missed these two steps
trans_data = vecAssembler.transform(data)

final_data = trans_data.select('features','col_11') #your label column name is col_11

train_data, test_data = final_data.randomSplit([0.7,0.3])

#ml part

dt = DecisionTreeClassifier(featuresCol='features',labelCol='col_11')

dt_model = dt.fit(train_data)

dt_predictions = dt_model.transform(test_data)

#proceed with the model evaluation part after this