Apache spark 使用PySpark将多个数值列拟合到spark ml模型中
我正在开发Spark 1.6.2,我有一个包含102列的Apache spark 使用PySpark将多个数值列拟合到spark ml模型中,apache-spark,pyspark,apache-spark-ml,Apache Spark,Pyspark,Apache Spark Ml,我正在开发Spark 1.6.2,我有一个包含102列的DataFrame: f0, f1,....,f101 f0包含索引,f101包含标签,其他列是数字特征(浮动) 我想在此数据帧上训练一个随机林模型(sparkml) 因此,我使用了VectorAssembler来输出一个特征列,以适应模型 from pyspark.ml.feature import VectorAssembler ignore = ['f0', 'f101'] assembler = VectorAssembler(i
DataFrame
:
f0, f1,....,f101
f0包含索引,f101包含标签,其他列是数字特征(浮动)
我想在此数据帧上训练一个随机林模型(sparkml
)
因此,我使用了VectorAssembler
来输出一个特征列,以适应模型
from pyspark.ml.feature import VectorAssembler
ignore = ['f0', 'f101']
assembler = VectorAssembler(inputCols=[x for x in df.columns if x not in ignore], outputCol='features')
assembler.transform(df)
df.show()
但如果不成功,这将导致以下错误:
py4j.protocol.Py4JJavaError: An error occurred while calling o255.transform.
: org.apache.spark.SparkException: VectorAssembler does not support the StringType type
是否有其他方法将这些多个列放入模型中
以下是我的数据帧的前两行:(请注意,我的所有列都是字符串类型,这可能是问题的原因)
我们将使用我们定义的parse\ucode>udf
和concat\ws
from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT
dd = sc.parallelize(['0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796','1|0.6452699899673462|0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181']).map(lambda x : x.split('|'))
df = sqlContext.createDataFrame(rdd, ['f1','f2','f3','f4','f5','f6'])
ignore = ['f1','f4'] # columns to ignore
keep = [x for x in df.columns if x not in ignore] # columns to keep
parse_ = udf(Vectors.parse, VectorUDT())
parsed = df.withColumn("features", F.concat(F.lit('['), F.concat_ws(",", *keep), F.lit(']'))). \
withColumn("features", parse_("features"))
parsed.show(truncate=False)
# +---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |f1 |f2 |f3 |f4 |f5 |f6 |features |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |0 |-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|[-0.38672998547554016,-1.5183000564575195,1.2288000583648682,0.7216399908065796]|
# |1 |0.6452699899673462 |0.528219997882843 |-0.5653899908065796|-0.4328500032424927|0.9352899789810181|[0.6452699899673462,0.528219997882843,-0.4328500032424927,0.9352899789810181] |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
这应该可以。我刚刚用了一个比你小的例子。你有没有尝试在
VectorAssembler()
之外分配列表理解的结果,然后将其作为arg传递?是的,我遇到了同样的错误,我也尝试传递这个列表['f1',f2',并且出现了同样的错误。你能添加数据帧的模式吗?@eliash Done:)
from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT
dd = sc.parallelize(['0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796','1|0.6452699899673462|0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181']).map(lambda x : x.split('|'))
df = sqlContext.createDataFrame(rdd, ['f1','f2','f3','f4','f5','f6'])
ignore = ['f1','f4'] # columns to ignore
keep = [x for x in df.columns if x not in ignore] # columns to keep
parse_ = udf(Vectors.parse, VectorUDT())
parsed = df.withColumn("features", F.concat(F.lit('['), F.concat_ws(",", *keep), F.lit(']'))). \
withColumn("features", parse_("features"))
parsed.show(truncate=False)
# +---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |f1 |f2 |f3 |f4 |f5 |f6 |features |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |0 |-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|[-0.38672998547554016,-1.5183000564575195,1.2288000583648682,0.7216399908065796]|
# |1 |0.6452699899673462 |0.528219997882843 |-0.5653899908065796|-0.4328500032424927|0.9352899789810181|[0.6452699899673462,0.528219997882843,-0.4328500032424927,0.9352899789810181] |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+