在PySpark数组中展平嵌套结构
给出如下模式:在PySpark数组中展平嵌套结构,pyspark,spark-dataframe,pyspark-sql,Pyspark,Spark Dataframe,Pyspark Sql,给出如下模式: root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisors: struct | | | |-- advisor1: string | | | |-- advisor2: string root |-- first_
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisors: struct
| | | |-- advisor1: string
| | | |-- advisor2: string
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisor1: string
| | |-- advisor2: string
如何获得类似以下内容的模式:
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisors: struct
| | | |-- advisor1: string
| | | |-- advisor2: string
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisor1: string
| | |-- advisor2: string
目前,我分解数组,通过选择
advisor.*
将结构展平,然后按名字、姓氏
分组,并使用收集列表
重建数组。我希望有一个更干净/更短的方法来做到这一点。目前,重命名一些我不想在这里讨论的字段和内容时会遇到很多麻烦。谢谢 您可以使用udf更改dataframe中嵌套列的数据类型。
假设您已将数据帧读取为df1
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def foo(data):
return
(
list(map(
lambda x: (
x["school"],
x["advisors"]["advisor1"],
x["advisors"]["advisor1"]
),
data
))
)
struct = ArrayType(
StructType([
StructField("school", StringType()),
StructField("advisor1", StringType()),
StructField("advisor2", StringType())
])
)
udf_foo = udf(foo, struct)
df2 = df1.withColumn("degrees", udf_foo("degrees"))
df2.printSchema()
输出:
root
|-- degrees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- school: string (nullable = true)
| | |-- advisor1: string (nullable = true)
| | |-- advisor2: string (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
下面是一个更通用的解决方案,它可以展平多个嵌套结构层:
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
只需拨打以下电话:
my_flattened_df = flatten_df(my_df_having_structs, 3)
(第二个参数是要展平的层的级别,在我的例子中是3)