Apache spark 使用PySpark将多个数值列拟合到spark ml模型中

Apache spark 使用PySpark将多个数值列拟合到spark ml模型中,apache-spark,pyspark,apache-spark-ml,Apache Spark,Pyspark,Apache Spark Ml,我正在开发Spark 1.6.2,我有一个包含102列的DataFrame: f0, f1,....,f101 f0包含索引,f101包含标签,其他列是数字特征(浮动) 我想在此数据帧上训练一个随机林模型(sparkml) 因此,我使用了VectorAssembler来输出一个特征列,以适应模型 from pyspark.ml.feature import VectorAssembler ignore = ['f0', 'f101'] assembler = VectorAssembler(i

我正在开发Spark 1.6.2,我有一个包含102列的
DataFrame

f0, f1,....,f101
f0包含索引,f101包含标签,其他列是数字特征(浮动)

我想在此数据帧上训练一个随机林模型(
sparkml

因此,我使用了
VectorAssembler
来输出一个特征列,以适应模型

from pyspark.ml.feature import VectorAssembler
ignore = ['f0', 'f101']
assembler = VectorAssembler(inputCols=[x for x in df.columns if x not in ignore], outputCol='features')

assembler.transform(df)
df.show()
但如果不成功,这将导致以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling o255.transform.
: org.apache.spark.SparkException: VectorAssembler does not support the StringType type
是否有其他方法将这些多个列放入模型中

以下是我的
数据帧的前两行:(请注意,我的所有列都是字符串类型,这可能是问题的原因)


我们将使用我们定义的parse\ucode>udf
concat\ws

from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT

dd = sc.parallelize(['0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796','1|0.6452699899673462|0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181']).map(lambda x : x.split('|'))

df = sqlContext.createDataFrame(rdd, ['f1','f2','f3','f4','f5','f6'])

ignore = ['f1','f4'] # columns to ignore
keep = [x for x in df.columns if x not in ignore] # columns to keep

parse_ = udf(Vectors.parse, VectorUDT())
parsed = df.withColumn("features", F.concat(F.lit('['), F.concat_ws(",", *keep), F.lit(']'))). \
            withColumn("features", parse_("features"))

parsed.show(truncate=False)
# +---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |f1 |f2                  |f3                 |f4                 |f5                 |f6                |features                                                                        |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |0  |-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|[-0.38672998547554016,-1.5183000564575195,1.2288000583648682,0.7216399908065796]|
# |1  |0.6452699899673462  |0.528219997882843  |-0.5653899908065796|-0.4328500032424927|0.9352899789810181|[0.6452699899673462,0.528219997882843,-0.4328500032424927,0.9352899789810181]   |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+

这应该可以。我刚刚用了一个比你小的例子。

你有没有尝试在
VectorAssembler()
之外分配列表理解的结果,然后将其作为arg传递?是的,我遇到了同样的错误,我也尝试传递这个列表['f1',f2',并且出现了同样的错误。你能添加数据帧的模式吗?@eliash Done:)
from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT

dd = sc.parallelize(['0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796','1|0.6452699899673462|0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181']).map(lambda x : x.split('|'))

df = sqlContext.createDataFrame(rdd, ['f1','f2','f3','f4','f5','f6'])

ignore = ['f1','f4'] # columns to ignore
keep = [x for x in df.columns if x not in ignore] # columns to keep

parse_ = udf(Vectors.parse, VectorUDT())
parsed = df.withColumn("features", F.concat(F.lit('['), F.concat_ws(",", *keep), F.lit(']'))). \
            withColumn("features", parse_("features"))

parsed.show(truncate=False)
# +---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |f1 |f2                  |f3                 |f4                 |f5                 |f6                |features                                                                        |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |0  |-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|[-0.38672998547554016,-1.5183000564575195,1.2288000583648682,0.7216399908065796]|
# |1  |0.6452699899673462  |0.528219997882843  |-0.5653899908065796|-0.4328500032424927|0.9352899789810181|[0.6452699899673462,0.528219997882843,-0.4328500032424927,0.9352899789810181]   |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+