递归展平子级并返回完整模式（Pyspark）_Pyspark_Nested

递归展平子级并返回完整模式（Pyspark）

pyspark

递归展平子级并返回完整模式（Pyspark）,pyspark,nested,Pyspark,Nested,我有一个json文件，其中包含属性“Tag”下同名集合的嵌套。此特定嵌套的数量各不相同。例如： { "Id" : "001", "Type" : "Work", "Tag" : [ { "Id" : "a123", "Location" : [

我有一个json文件，其中包含属性“Tag”下同名集合的嵌套。此特定嵌套的数量各不相同。例如：

{ 
    "Id" : "001", 
    "Type" : "Work", 
    "Tag" : [
        {
            "Id" : "a123", 
            "Location" : [
                {
                    "LocName" : "Astro", 
                    "LocCode" : "AST"
                }
            ],  
            "displayName" : "Al"
        }, 
        {
            "Id" : "e789", 
            "Location" : [
                {
                    "LocName" : "Cosmos", 
                    "LocCode" : "COS"
                }
            ], 
            "displayName" : "Tom"
        }
    ], 
    "version" : 2
}

我试图递归地展平嵌套的子级，以遵循此模式，以这种形式获得最终输出

root
 |-- Id: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Tag: struct (nullable = true)
 |    |-- Tag.Id: string (nullable = true)
 |    |-- Tag.Location: struct (nullable = true)
 |    |    |--Location.LocName:string (nullable = true)
 |    |    |--Location.LocCode:string (nullable = true)
 |    |-- Tag.displayname: string (nullable = true)
 |-- version: string (nullable = true)


+--+----+------+--------------------+--------------------+---------------+-------+
|Id|Type|Tag_Id|Tag_Location_LocName|Tag_Location_LocCode|Tag_displayName|version|
+--+----+------+--------------------+--------------------+---------------+-------+
001 Work  a123        Astro                  AST             Al             2
001 Work  e789        Cosmos                 COS             Tom            2

到目前为止，我们已经成功地使用了explode和denest作为第一组嵌套，并且在递归部分中遇到了困难（并且用管道将具有其余属性的扁平子级导出为新行）。有人能帮我分享一下完成这项任务的方法吗？

所以目前在spark内置函数中没有这样做的方法。然而，我在下面创建了一种实现这一点的方法。然而，关于这段代码的一个假设是，我假设您试图处理的字典的长度不是非常大，无法一次性读入内存

from pyspark.sql import Row

inputs = { 
    "Id" : "001", 
    "Type" : "Work", 
    "Tag" : [
        {
            "Id" : "a123", 
            "Location" : [
                {
                    "LocName" : "Astro", 
                    "LocCode" : "AST"
                }
            ],  
            "displayName" : "Al"
        }, 
        {
            "Id" : "e789", 
            "Location" : [
                {
                    "LocName" : "Cosmos", 
                    "LocCode" : "COS"
                }
            ], 
            "displayName" : "Tom"
        }
    ], 
    "version" : 2
}

# Need to get all possible columns names beforehand
# This is so we can avoid schema conflicts
def get_column_map(input_dict, columns=[], key_stack=[]):
  for k, v in input_dict.items():
    if type(v) is list:
      key_stack.append(k)
      for list_item in v:
        get_columns(list_item, columns, key_stack)
      key_stack.pop()
    elif type(v) is dict:
      key_stack.append(k)
      get_columns(list_item, columns, key_stack)
      key_stack.pop()
    else:
      column_name = "_".join(key_stack + [k])
      columns.append(column_name)
  l = list(set(columns))
  mapper = {}
  for item in l:
    mapper[item] = None
  return mapper

# After knowing the column names, I can populate them
# One trick is that you should process all non-dict or list items first
# So you can easily append when you are at the last child in the nest
def process_map(input_dict, column_dict, key_stack=[], rows=[]):
  def order_dict(x):
    if type(x[1]) != list and type(x[1]) != dict: 
      return 1 
    else: 
      return 0
    
  input_dict = sorted(
    input_dict.items(), 
    key=lambda x: order_dict(x), 
    reverse=True
  )
  
  last_child = True
  for k, v in input_dict:
    if type(v) is list:
      last_child = False
      key_stack.append(k)
      for list_item in v:
        process_map(list_item, column_dict, key_stack, rows)
      key_stack.pop()
    elif type(v) is dict:
      last_child = False
      key_stack.append(k)
      process_map(list_item, column_dict, key_stack, rows)
      key_stack.pop()
    else:
      column_name = "_".join(key_stack + [k])
      column_dict[column_name] = v
  if last_child:
    rows.append(Row(**column_dict))
  return rows


# Can put this in a main or leave it in a functional way at bottom
mapper = get_column_map(inputs)
rows = process_map(inputs, mapper)
final_df = spark.createDataFrame(rows)

通过在我的环境中运行此代码，我得到

鉴于您已经声明了spark数据帧，我们可以使用它来展平您的模式。您可以通过两个步骤完成此操作：

爆炸阵列

使结构扁平化

从pyspark.sql.types导入StructType、StructField、ArrayType
从pyspark.sql.functions导入explode\u outer
def展平（df）：
"""
创建一个基于平面模式的新数据框，
爆炸阵列并展平结构。
"""
f_df=df
选择\u expr=\u数组（element=f\u df.schema）
#当至少有一个数组时，请爆炸。
而“ArrayType”（“在f{f_df.schema}）中）：
f_df=f_df.选择表达式（选择表达式）
选择\u expr=\u数组（element=f\u df.schema）
#使结构扁平化
选择_expr=flattexpr（f_df.schema）
f_df=f_df.选择表达式（选择表达式）
返回f_df
def_数组（元素，根=无）：
"""
将数组分解为新行，
它只爆炸一级阵列。
"""
el_类型=类型（元素）
expr=[]
尝试：
_path=f“{root+.”如果根目录为“”}{element.name}”
除属性错误外：
_path=“”
如果el_type==StructType：
对于t in元素：
res=_数组（t，根）
扩展表达式（res）
elif el_type==StructField和type（element.dataType）==ArrayType:
expr.append（f“explode_outer（{u path}）作为{u path.replace（'.'，''.'''.'））
elif el_type==StructField和type（element.dataType）==StructType:
expr.extend（_数组（element.dataType，_路径））
其他：
expr.append（f“{u-path}作为{u-path.replace（'.'，''''.}”）
返回表达式
def Flattexpr（元素，根=无）：
"""
展平数据帧的结构
（在级别名称之间使用“25;”）
它对数组不起作用，
您需要确保输入模式中没有数组
"""
expr=[]
el_类型=类型（元素）
尝试：
_path=f“{root+.”如果根目录为“”}{element.name}”
除属性错误外：
_path=“”
如果el_type==StructType：
对于t in元素：
扩展（展平扩展（t，根））
elif el_type==StructField和type（element.dataType）==StructType:
expr.extend（flattexpr（element.dataType，_path））
elif el_type==StructField和type（element.dataType）==ArrayType:
#您应该使用扁平阵列以确保不会发生这种情况
expr.extend（flattexpr（element.dataType.elementType，f“{u path}[0]”）
其他：
expr.append（f“{u-path}作为{u-path.replace（'.'，''''.}”）
返回表达式

所以我们可以这样做：

json_test=spark.read.json（sc.parallelize（[“{”Id:“001”，“Type:“Work”，“Tag:“{”Id:“a123”，“Location:“[{”LocName:“Astro”，“LocCode:“AST”}]，“displayName:“Al”}，{”Id:“e789”，“Location:“[{”LocName:“Cosmos”，“LocCode:“COS”}]，“displayName:“Tom”}]，“version:“version:”2}“]））
json_test.printSchema（）
f_df=展平（json_测试）
f_df.printSchema（）
f_df.show（）

因此，您可以得到原始模式：


root
 |-- Id: string (nullable = true)
 |-- Tag: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- Location: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- LocCode: string (nullable = true)
 |    |    |    |    |-- LocName: string (nullable = true)
 |    |    |-- displayName: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- version: long (nullable = true)

新模式：

root
 |-- Id: string (nullable = true)
 |-- Tag_Id: string (nullable = true)
 |-- Tag_Location_LocCode: string (nullable = true)
 |-- Tag_Location_LocName: string (nullable = true)
 |-- Tag_displayName: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- version: long (nullable = true)

以及数据帧：

| Id|Tag_Id|Tag_Location_LocCode|Tag_Location_LocName|Tag_displayName|Type|version|
+---+------+--------------------+--------------------+---------------+----+-------+
|001|  a123|                 AST|               Astro|             Al|Work|      2|
|001|  e789|                 COS|              Cosmos|            Tom|Work|      2|
+---+------+--------------------+--------------------+---------------+----+-------+

我希望这能帮助你思考你的解决方案