Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/290.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将地图/字典的Spark Dataframe列展平为多列_Python_Pyspark - Fatal编程技术网

Python 将地图/字典的Spark Dataframe列展平为多列

Python 将地图/字典的Spark Dataframe列展平为多列,python,pyspark,Python,Pyspark,我们有一个DataFrame,看起来像这样: DataFrame[event: string, properties: map<string,string>] newDf = df.withColumn("foo", col("properties")["foo"]) 产生的数据帧 DataFrame[event: string, properties: map<string,string>, foo: String] 您可以使用explode()函数-它通过为每个

我们有一个
DataFrame
,看起来像这样:

DataFrame[event: string, properties: map<string,string>]
newDf = df.withColumn("foo", col("properties")["foo"])
产生的数据帧

DataFrame[event: string, properties: map<string,string>, foo: String]
您可以使用
explode()
函数-它通过为每个条目创建两个附加列-
key
value
来展平地图:

>>> df.printSchema()
root
 |-- event: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

>>> df.select('event', explode('properties')).printSchema()
root
 |-- event: string (nullable = true)
 |-- key: string (nullable = false)
 |-- value: string (nullable = true)
如果有具有唯一值的列可以按其分组,则可以使用pivot。例如:

df.withColumn('id', monotonically_increasing_id()) \
    .select('id', 'event', explode('properties')) \
    .groupBy('id', 'event').pivot('key').agg(first('value'))

这很有趣。但我希望它可以根据可用的键扩展为N列。你知道有没有办法做到这一点吗?如果你有一个具有唯一值的列,你可以使用
pivot
。例如:
df.withColumn('id',单调递增的\u id()).select('id','event',explode('properties')).groupBy('id','event').pivot('key').agg(first('value'))
df.withColumn('id', monotonically_increasing_id()) \
    .select('id', 'event', explode('properties')) \
    .groupBy('id', 'event').pivot('key').agg(first('value'))