在同一PySpark数据帧中将列表的列拆分为多列_Pyspark_Apache Spark Sql_Spark Dataframe

在同一PySpark数据帧中将列表的列拆分为多列

pyspark

在同一PySpark数据帧中将列表的列拆分为多列,pyspark,apache-spark-sql,spark-dataframe,Pyspark,Apache Spark Sql,Spark Dataframe,我有以下数据框，其中包含两列：第一列有列名第二列有值列表 +--------------------+--------------------+ | Column| Quantile| +--------------------+--------------------+ | rent|[4000.0, 4500.0, ...| | is_rent_changed|[0.0, 0.0, 0.0, 0...|

我有以下数据框，其中包含两列：

第一列有列名

第二列有值列表

+--------------------+--------------------+
|              Column|            Quantile|
+--------------------+--------------------+
|                rent|[4000.0, 4500.0, ...|
|     is_rent_changed|[0.0, 0.0, 0.0, 0...|
|               phone|[7.022372888E9, 7...|
|          Area_house|[1000.0, 1000.0, ...|
|       bedroom_count|[1.0, 1.0, 1.0, 1...|
|      bathroom_count|[1.0, 1.0, 1.0, 1...|
|    maintenance_cost|[0.0, 0.0, 0.0, 0...|
|            latitude|[12.8217605, 12.8...|
|            Max_rent|[9000.0, 10000.0,...|
|                Beds|[2.0, 2.0, 2.0, 2...|
|                Area|[1000.0, 1000.0, ...|
|            Avg_Rent|[3500.0, 4000.0, ...|
|      deposit_amount|[0.0, 0.0, 0.0, 0...|
|          commission|[0.0, 0.0, 0.0, 0...|
|        monthly_rent|[0.0, 0.0, 0.0, 0...|
|is_min_rent_guara...|[0.0, 0.0, 0.0, 0...|
|min_guarantee_amount|[0.0, 0.0, 0.0, 0...|
|min_guarantee_dur...|[1.0, 1.0, 1.0, 1...|
|        furnish_cost|[0.0, 0.0, 0.0, 0...|
|  owner_furnish_part|[0.0, 0.0, 0.0, 0...|
+--------------------+--------------------+

如何将第二列拆分为多个列以保留相同的数据集

我可以通过以下方式访问这些值：

univar_df10.select("Column", univar_df10.Quantile[0],univar_df10.Quantile[1],univar_df10.Quantile[2]).show()

+--------------------+-------------+-------------+------------+
|              Column|  Quantile[0]|  Quantile[1]| Quantile[2]|
+--------------------+-------------+-------------+------------+
|                rent|       4000.0|       4500.0|      5000.0|
|     is_rent_changed|          0.0|          0.0|         0.0|
|               phone|7.022372888E9|7.042022842E9|7.07333021E9|
|          Area_house|       1000.0|       1000.0|      1000.0|
|       bedroom_count|          1.0|          1.0|         1.0|
|      bathroom_count|          1.0|          1.0|         1.0|
|    maintenance_cost|          0.0|          0.0|         0.0|
|            latitude|   12.8217605|   12.8490502|   12.863517|
|            Max_rent|       9000.0|      10000.0|     11500.0|
|                Beds|          2.0|          2.0|         2.0|
|                Area|       1000.0|       1000.0|      1000.0|
|            Avg_Rent|       3500.0|       4000.0|      4125.0|
|      deposit_amount|          0.0|          0.0|         0.0|
|          commission|          0.0|          0.0|         0.0|
|        monthly_rent|          0.0|          0.0|         0.0|
|is_min_rent_guara...|          0.0|          0.0|         0.0|
|min_guarantee_amount|          0.0|          0.0|         0.0|
|min_guarantee_dur...|          1.0|          1.0|         1.0|
|        furnish_cost|          0.0|          0.0|         0.0|
|  owner_furnish_part|          0.0|          0.0|         0.0|
+--------------------+-------------+-------------+------------+
only showing top 20 rows

我希望我的新dataframe将我的第二列列表拆分为多个列，就像上面的数据集一样。提前感谢。

假设您的问题被标记为结束，因为您不清楚您要问的是什么，您的问题是分位数列中的列表有一定的长度，因此手工构建相应的命令并不方便，下面是一个使用列表添加和理解作为参数来选择的解决方案：

spark.version u'2.2.1' 制作一些玩具数据从pyspark.sql导入行 df=spark.createDataFrame[行[0,45,63,0,0,0,0]，第[0,0,0,85,0,69,0]行，第[0,89,56,0,0,0,0]]行， [“功能”] df.show 结果: +------------+ |特征| +------------+ |[0, 45, 63, 0, 0, 0, 0]| |[0, 0, 0, 85, 0, 69, 0]| |[0, 89, 56, 0, 0, 0, 0]| +------------+ 获取列表的长度，如果您还不知道，这里是7: 长度=lendf。选择“功能”。取1[0][0] 长 7. df.选择[df.features]+[df.features[i]作为rangelength中的i]。显示结果: +----------+------+------+------+------+------+------+------+ |特征|特征[0]|特征[1]|特征[2]|特征[3]|特征[4]|特征[5]|特征[6]| +----------+------+------+------+------+------+------+------+ |[0, 45, 63, 0, 0,...| 0| 45| 63| 0| 0| 0| 0| |[0, 0, 0, 85, 0, ...| 0| 0| 0| 85| 0| 69| 0| |[0, 89, 56, 0, 0,...| 0| 89| 56| 0| 0| 0| 0| +----------+------+------+------+------+------+------+------+ 所以，在你的情况下

univar_df10.选择[univar_df10.列]+[univar_df10.分位数[i]表示rangelength中的i] 在计算完长度后，应完成此项工作

长度=lenunivar_df10。选择“分位数”。取1[0][0]

下面是在scala中执行此操作的伪代码：-

import org.apache.spark.sql.functions.split 
import org.apache.spark.sql.functions.col

#Create column which you wanted to be .
val quantileColumn = Seq("quantile1","qunatile2","quantile3")

#Get the number of columns
val numberOfColums = quantileColumn.size

#Create a list of column
val columList = for (i <- 0 until numberOfColums ) yield split(col("Quantile"),",").getItem(i).alias(quantileColumn(i))

#Just perfom Select operation.
df.select(columList: _ *)

# If you want some columns to be added or dropped , use withColumn & dropp on df.

那么univar_df10.select有什么问题？问题是什么？你似乎已经找到了你想要的东西。new_df=univar_df10.selectColumn，univar_df10.Quantile[0]，univar_df10.Quantile[1]，univar_df10.Quantile[2]如何在scala spark中做到这一点？@jxn抱歉，不知道scala中的scala detailsHi@jxn我们可以实现这个t0o。我正在使用for和yield来实现这一点。检查我的答案，希望它有帮助。请使用下面的imports import org.apache.spark.sql.functions.split import org.apache.spark.sql.functions.col请不要使用注释添加材料-改为编辑和更新您的帖子。另外，请避免回答评论中的后续问题-现在的帖子显然是关于pysparkCool的，我会记下来。请将导入添加到答案中！