将列表转换为pyspark中的dataframe列
我有一个数据框,其中一个字符串类型列包含一个项目列表,我想分解这些项目,使其成为父数据框的一部分。我怎么做 以下是创建示例数据帧的代码:将列表转换为pyspark中的dataframe列,pyspark,Pyspark,我有一个数据框,其中一个字符串类型列包含一个项目列表,我想分解这些项目,使其成为父数据框的一部分。我怎么做 以下是创建示例数据帧的代码: from pyspark.sql import Row from collections import OrderedDict def convert_to_row(d: dict) -> Row: return Row(**OrderedDict(sorted(d.items()))) df=sc.parallelize([{"arg1":
from pyspark.sql import Row
from collections import OrderedDict
def convert_to_row(d: dict) -> Row:
return Row(**OrderedDict(sorted(d.items())))
df=sc.parallelize([{"arg1": "first", "arg2": "John", "arg3" : '[{"name" : "click", "datetime" : "1570103345039", "event" : "entry" }, {"name" : "drag", "datetime" : "1580133345039", "event" : "exit" }]'},{"arg1": "second", "arg2": "Joe", "arg3": '[{"name" : "click", "datetime" : "1670105345039", "event" : "entry" }, {"name" : "drop", "datetime" : "1750134345039", "event" : "exit" }]'},{"arg1": "third", "arg2": "Jane", "arg3" : '[{"name" : "click", "datetime" : "1580105245039", "event" : "entry" }, {"name" : "drop", "datetime" : "1650134345039", "event" : "exit" }]'}]) \
.map(convert_to_row).toDF()
运行此代码将创建一个数据帧,如下所示:
+------+----+--------------------+
| arg1|arg2| arg3|
+------+----+--------------------+
| first|John|[{"name" : "click...|
|second| Joe|[{"name" : "click...|
| third|Jane|[{"name" : "click...|
+------+----+--------------------+
arg3列包含一个列表,我想将其分解为详细列。我希望数据帧如下所示:
arg1 | arg2 | arg3 | name | datetime |事件
如何实现这一点?您需要在
from_json
函数中为模式指定数组:
from pyspark.sql.functions import explode, from_json
schema = 'array<struct<name:string,datetime:string,event:string>>'
df.withColumn('data', explode(from_json('arg3', schema))) \
.select(*df.columns, 'data.*') \
.show()
+------+----+--------------------+-----+-------------+-----+
| arg1|arg2| arg3| name| datetime|event|
+------+----+--------------------+-----+-------------+-----+
| first|John|[{"name" : "click...|click|1570103345039|entry|
| first|John|[{"name" : "click...| drag|1580133345039| exit|
|second| Joe|[{"name" : "click...|click|1670105345039|entry|
|second| Joe|[{"name" : "click...| drop|1750134345039| exit|
| third|Jane|[{"name" : "click...|click|1580105245039|entry|
| third|Jane|[{"name" : "click...| drop|1650134345039| exit|
+------+----+--------------------+-----+-------------+-----+
可能重复。不完全相同。一个不同点是我想展开的第三列,它是一个项目列表。我希望分解,以便它返回多行中的项。然后我可以从您提供的链接中所示的_json应用。我想知道如何将项目列表拆分为多行。
from pyspark.sql.types import ArrayType, StringType, StructType, StructField
schema = ArrayType(
StructType([
StructField('name',StringType())
, StructField('datetime',StringType())
, StructField('event',StringType())
])
)