Python 将数组列拆分为行
我有一个类似于以下内容的Python 将数组列拆分为行,python,python-3.x,apache-spark,pyspark,Python,Python 3.x,Apache Spark,Pyspark,我有一个类似于以下内容的DataFrame: new_df = spark.createDataFrame([ ([['hello', 'productcode'], ['red','color']], 7), ([['hi', 'productcode'], ['blue', 'color']], 8), ([['hoi', 'productcode'], ['black','color']], 7) ], ["items", "frequency"]) new_df.
DataFrame
:
new_df = spark.createDataFrame([
([['hello', 'productcode'], ['red','color']], 7),
([['hi', 'productcode'], ['blue', 'color']], 8),
([['hoi', 'productcode'], ['black','color']], 7)
], ["items", "frequency"])
new_df.show(3, False)
# +------------------------------------------------------------+---------+
# |items |frequency|
# +------------------------------------------------------------+---------+
# |[WrappedArray(hello, productcode), WrappedArray(red, color)]|7 |
# |[WrappedArray(hi, productcode), WrappedArray(blue, color)] |8 |
# |[WrappedArray(hoi, productcode), WrappedArray(black, color)]|7 |
# +------------------------------------------------------------+---------+
# +-------------------------------------------
# |productcode | color |frequency|
# +-------------------------------------------
# |hello | red | 7 |
# |hi | blue | 8 |
# |hoi | black | 7 |
# +--------------------------------------------
我需要生成一个新的数据帧
,如下所示:
new_df = spark.createDataFrame([
([['hello', 'productcode'], ['red','color']], 7),
([['hi', 'productcode'], ['blue', 'color']], 8),
([['hoi', 'productcode'], ['black','color']], 7)
], ["items", "frequency"])
new_df.show(3, False)
# +------------------------------------------------------------+---------+
# |items |frequency|
# +------------------------------------------------------------+---------+
# |[WrappedArray(hello, productcode), WrappedArray(red, color)]|7 |
# |[WrappedArray(hi, productcode), WrappedArray(blue, color)] |8 |
# |[WrappedArray(hoi, productcode), WrappedArray(black, color)]|7 |
# +------------------------------------------------------------+---------+
# +-------------------------------------------
# |productcode | color |frequency|
# +-------------------------------------------
# |hello | red | 7 |
# |hi | blue | 8 |
# |hoi | black | 7 |
# +--------------------------------------------
您可以将项目转换为地图:
from pyspark.sql.functions import *
from operator import itemgetter
@udf("map<string, string>")
def as_map(vks):
return {k: v for v, k in vks}
remapped = new_df.select("frequency", as_map("items").alias("items"))
然后选择:
remapped.select([col("items")[key] for key in keys] + ["frequency"])
+------------+------------------+---------+
|items[color]|items[productcode]|frequency|
+------------+------------------+---------+
| red| hello| 7|
| blue| hi| 8|
| black| hoi| 7|
+------------+------------------+---------+
谢谢你的回复。但是我的dataframe有3个元素,预期的结果是不同的。我真的不需要在行
new\u df.select(col(“items”).getItem(0).getItem(0).别名('productcode')、col(“items”).getItem(1).getItem(0).别名('color')、col(“frequency”).show()