递归展平子级并返回完整模式(Pyspark)
我有一个json文件,其中包含属性“Tag”下同名集合的嵌套。此特定嵌套的数量各不相同。例如:递归展平子级并返回完整模式(Pyspark),pyspark,nested,Pyspark,Nested,我有一个json文件,其中包含属性“Tag”下同名集合的嵌套。此特定嵌套的数量各不相同。例如: { "Id" : "001", "Type" : "Work", "Tag" : [ { "Id" : "a123", "Location" : [
{
"Id" : "001",
"Type" : "Work",
"Tag" : [
{
"Id" : "a123",
"Location" : [
{
"LocName" : "Astro",
"LocCode" : "AST"
}
],
"displayName" : "Al"
},
{
"Id" : "e789",
"Location" : [
{
"LocName" : "Cosmos",
"LocCode" : "COS"
}
],
"displayName" : "Tom"
}
],
"version" : 2
}
我试图递归地展平嵌套的子级,以遵循此模式,以这种形式获得最终输出
root
|-- Id: string (nullable = true)
|-- Type: string (nullable = true)
|-- Tag: struct (nullable = true)
| |-- Tag.Id: string (nullable = true)
| |-- Tag.Location: struct (nullable = true)
| | |--Location.LocName:string (nullable = true)
| | |--Location.LocCode:string (nullable = true)
| |-- Tag.displayname: string (nullable = true)
|-- version: string (nullable = true)
+--+----+------+--------------------+--------------------+---------------+-------+
|Id|Type|Tag_Id|Tag_Location_LocName|Tag_Location_LocCode|Tag_displayName|version|
+--+----+------+--------------------+--------------------+---------------+-------+
001 Work a123 Astro AST Al 2
001 Work e789 Cosmos COS Tom 2
到目前为止,我们已经成功地使用了explode和denest作为第一组嵌套,并且在递归部分中遇到了困难(并且用管道将具有其余属性的扁平子级导出为新行)。有人能帮我分享一下完成这项任务的方法吗?所以目前在spark内置函数中没有这样做的方法。然而,我在下面创建了一种实现这一点的方法。然而,关于这段代码的一个假设是,我假设您试图处理的字典的长度不是非常大,无法一次性读入内存
from pyspark.sql import Row
inputs = {
"Id" : "001",
"Type" : "Work",
"Tag" : [
{
"Id" : "a123",
"Location" : [
{
"LocName" : "Astro",
"LocCode" : "AST"
}
],
"displayName" : "Al"
},
{
"Id" : "e789",
"Location" : [
{
"LocName" : "Cosmos",
"LocCode" : "COS"
}
],
"displayName" : "Tom"
}
],
"version" : 2
}
# Need to get all possible columns names beforehand
# This is so we can avoid schema conflicts
def get_column_map(input_dict, columns=[], key_stack=[]):
for k, v in input_dict.items():
if type(v) is list:
key_stack.append(k)
for list_item in v:
get_columns(list_item, columns, key_stack)
key_stack.pop()
elif type(v) is dict:
key_stack.append(k)
get_columns(list_item, columns, key_stack)
key_stack.pop()
else:
column_name = "_".join(key_stack + [k])
columns.append(column_name)
l = list(set(columns))
mapper = {}
for item in l:
mapper[item] = None
return mapper
# After knowing the column names, I can populate them
# One trick is that you should process all non-dict or list items first
# So you can easily append when you are at the last child in the nest
def process_map(input_dict, column_dict, key_stack=[], rows=[]):
def order_dict(x):
if type(x[1]) != list and type(x[1]) != dict:
return 1
else:
return 0
input_dict = sorted(
input_dict.items(),
key=lambda x: order_dict(x),
reverse=True
)
last_child = True
for k, v in input_dict:
if type(v) is list:
last_child = False
key_stack.append(k)
for list_item in v:
process_map(list_item, column_dict, key_stack, rows)
key_stack.pop()
elif type(v) is dict:
last_child = False
key_stack.append(k)
process_map(list_item, column_dict, key_stack, rows)
key_stack.pop()
else:
column_name = "_".join(key_stack + [k])
column_dict[column_name] = v
if last_child:
rows.append(Row(**column_dict))
return rows
# Can put this in a main or leave it in a functional way at bottom
mapper = get_column_map(inputs)
rows = process_map(inputs, mapper)
final_df = spark.createDataFrame(rows)
通过在我的环境中运行此代码,我得到
鉴于您已经声明了spark数据帧,我们可以使用它来展平您的模式。您可以通过两个步骤完成此操作:
从pyspark.sql.types导入StructType、StructField、ArrayType
从pyspark.sql.functions导入explode\u outer
def展平(df):
"""
创建一个基于平面模式的新数据框,
爆炸阵列并展平结构。
"""
f_df=df
选择\u expr=\u数组(element=f\u df.schema)
#当至少有一个数组时,请爆炸。
而“ArrayType”(“在f{f_df.schema})中):
f_df=f_df.选择表达式(选择表达式)
选择\u expr=\u数组(element=f\u df.schema)
#使结构扁平化
选择_expr=flattexpr(f_df.schema)
f_df=f_df.选择表达式(选择表达式)
返回f_df
def_数组(元素,根=无):
"""
将数组分解为新行,
它只爆炸一级阵列。
"""
el_类型=类型(元素)
expr=[]
尝试:
_path=f“{root+.”如果根目录为“”}{element.name}”
除属性错误外:
_path=“”
如果el_type==StructType:
对于t in元素:
res=_数组(t,根)
扩展表达式(res)
elif el_type==StructField和type(element.dataType)==ArrayType:
expr.append(f“explode_outer({u path})作为{u path.replace('.',''.'''.'))
elif el_type==StructField和type(element.dataType)==StructType:
expr.extend(_数组(element.dataType,_路径))
其他:
expr.append(f“{u-path}作为{u-path.replace('.',''''.}”)
返回表达式
def Flattexpr(元素,根=无):
"""
展平数据帧的结构
(在级别名称之间使用“25;”)
它对数组不起作用,
您需要确保输入模式中没有数组
"""
expr=[]
el_类型=类型(元素)
尝试:
_path=f“{root+.”如果根目录为“”}{element.name}”
除属性错误外:
_path=“”
如果el_type==StructType:
对于t in元素:
扩展(展平扩展(t,根))
elif el_type==StructField和type(element.dataType)==StructType:
expr.extend(flattexpr(element.dataType,_path))
elif el_type==StructField和type(element.dataType)==ArrayType:
#您应该使用扁平阵列以确保不会发生这种情况
expr.extend(flattexpr(element.dataType.elementType,f“{u path}[0]”)
其他:
expr.append(f“{u-path}作为{u-path.replace('.',''''.}”)
返回表达式
所以我们可以这样做:
json_test=spark.read.json(sc.parallelize([“{”Id:“001”,“Type:“Work”,“Tag:“{”Id:“a123”,“Location:“[{”LocName:“Astro”,“LocCode:“AST”}],“displayName:“Al”},{”Id:“e789”,“Location:“[{”LocName:“Cosmos”,“LocCode:“COS”}],“displayName:“Tom”}],“version:“version:”2}“]))
json_test.printSchema()
f_df=展平(json_测试)
f_df.printSchema()
f_df.show()
因此,您可以得到原始模式:
root
|-- Id: string (nullable = true)
|-- Tag: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Location: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- LocCode: string (nullable = true)
| | | | |-- LocName: string (nullable = true)
| | |-- displayName: string (nullable = true)
|-- Type: string (nullable = true)
|-- version: long (nullable = true)
新模式:
root
|-- Id: string (nullable = true)
|-- Tag_Id: string (nullable = true)
|-- Tag_Location_LocCode: string (nullable = true)
|-- Tag_Location_LocName: string (nullable = true)
|-- Tag_displayName: string (nullable = true)
|-- Type: string (nullable = true)
|-- version: long (nullable = true)
以及数据帧:
| Id|Tag_Id|Tag_Location_LocCode|Tag_Location_LocName|Tag_displayName|Type|version|
+---+------+--------------------+--------------------+---------------+----+-------+
|001| a123| AST| Astro| Al|Work| 2|
|001| e789| COS| Cosmos| Tom|Work| 2|
+---+------+--------------------+--------------------+---------------+----+-------+
我希望这能帮助你思考你的解决方案