递归展平子级并返回完整模式(Pyspark)

递归展平子级并返回完整模式(Pyspark),pyspark,nested,Pyspark,Nested,我有一个json文件,其中包含属性“Tag”下同名集合的嵌套。此特定嵌套的数量各不相同。例如: { "Id" : "001", "Type" : "Work", "Tag" : [ { "Id" : "a123", "Location" : [

我有一个json文件,其中包含属性“Tag”下同名集合的嵌套。此特定嵌套的数量各不相同。例如:

{ 
    "Id" : "001", 
    "Type" : "Work", 
    "Tag" : [
        {
            "Id" : "a123", 
            "Location" : [
                {
                    "LocName" : "Astro", 
                    "LocCode" : "AST"
                }
            ],  
            "displayName" : "Al"
        }, 
        {
            "Id" : "e789", 
            "Location" : [
                {
                    "LocName" : "Cosmos", 
                    "LocCode" : "COS"
                }
            ], 
            "displayName" : "Tom"
        }
    ], 
    "version" : 2
}
我试图递归地展平嵌套的子级,以遵循此模式,以这种形式获得最终输出

root
 |-- Id: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- Tag: struct (nullable = true)
 |    |-- Tag.Id: string (nullable = true)
 |    |-- Tag.Location: struct (nullable = true)
 |    |    |--Location.LocName:string (nullable = true)
 |    |    |--Location.LocCode:string (nullable = true)
 |    |-- Tag.displayname: string (nullable = true)
 |-- version: string (nullable = true)


+--+----+------+--------------------+--------------------+---------------+-------+
|Id|Type|Tag_Id|Tag_Location_LocName|Tag_Location_LocCode|Tag_displayName|version|
+--+----+------+--------------------+--------------------+---------------+-------+
001 Work  a123        Astro                  AST             Al             2
001 Work  e789        Cosmos                 COS             Tom            2

到目前为止,我们已经成功地使用了explode和denest作为第一组嵌套,并且在递归部分中遇到了困难(并且用管道将具有其余属性的扁平子级导出为新行)。有人能帮我分享一下完成这项任务的方法吗?

所以目前在spark内置函数中没有这样做的方法。然而,我在下面创建了一种实现这一点的方法。然而,关于这段代码的一个假设是,我假设您试图处理的字典的长度不是非常大,无法一次性读入内存

from pyspark.sql import Row

inputs = { 
    "Id" : "001", 
    "Type" : "Work", 
    "Tag" : [
        {
            "Id" : "a123", 
            "Location" : [
                {
                    "LocName" : "Astro", 
                    "LocCode" : "AST"
                }
            ],  
            "displayName" : "Al"
        }, 
        {
            "Id" : "e789", 
            "Location" : [
                {
                    "LocName" : "Cosmos", 
                    "LocCode" : "COS"
                }
            ], 
            "displayName" : "Tom"
        }
    ], 
    "version" : 2
}

# Need to get all possible columns names beforehand
# This is so we can avoid schema conflicts
def get_column_map(input_dict, columns=[], key_stack=[]):
  for k, v in input_dict.items():
    if type(v) is list:
      key_stack.append(k)
      for list_item in v:
        get_columns(list_item, columns, key_stack)
      key_stack.pop()
    elif type(v) is dict:
      key_stack.append(k)
      get_columns(list_item, columns, key_stack)
      key_stack.pop()
    else:
      column_name = "_".join(key_stack + [k])
      columns.append(column_name)
  l = list(set(columns))
  mapper = {}
  for item in l:
    mapper[item] = None
  return mapper

# After knowing the column names, I can populate them
# One trick is that you should process all non-dict or list items first
# So you can easily append when you are at the last child in the nest
def process_map(input_dict, column_dict, key_stack=[], rows=[]):
  def order_dict(x):
    if type(x[1]) != list and type(x[1]) != dict: 
      return 1 
    else: 
      return 0
    
  input_dict = sorted(
    input_dict.items(), 
    key=lambda x: order_dict(x), 
    reverse=True
  )
  
  last_child = True
  for k, v in input_dict:
    if type(v) is list:
      last_child = False
      key_stack.append(k)
      for list_item in v:
        process_map(list_item, column_dict, key_stack, rows)
      key_stack.pop()
    elif type(v) is dict:
      last_child = False
      key_stack.append(k)
      process_map(list_item, column_dict, key_stack, rows)
      key_stack.pop()
    else:
      column_name = "_".join(key_stack + [k])
      column_dict[column_name] = v
  if last_child:
    rows.append(Row(**column_dict))
  return rows


# Can put this in a main or leave it in a functional way at bottom
mapper = get_column_map(inputs)
rows = process_map(inputs, mapper)
final_df = spark.createDataFrame(rows)
通过在我的环境中运行此代码,我得到

鉴于您已经声明了spark数据帧,我们可以使用它来展平您的模式。您可以通过两个步骤完成此操作:

  • 爆炸阵列
  • 使结构扁平化
  • 从pyspark.sql.types导入StructType、StructField、ArrayType
    从pyspark.sql.functions导入explode\u outer
    def展平(df):
    """
    创建一个基于平面模式的新数据框,
    爆炸阵列并展平结构。
    """
    f_df=df
    选择\u expr=\u数组(element=f\u df.schema)
    #当至少有一个数组时,请爆炸。
    而“ArrayType”(“在f{f_df.schema})中):
    f_df=f_df.选择表达式(选择表达式)
    选择\u expr=\u数组(element=f\u df.schema)
    #使结构扁平化
    选择_expr=flattexpr(f_df.schema)
    f_df=f_df.选择表达式(选择表达式)
    返回f_df
    def_数组(元素,根=无):
    """
    将数组分解为新行,
    它只爆炸一级阵列。
    """
    el_类型=类型(元素)
    expr=[]
    尝试:
    _path=f“{root+.”如果根目录为“”}{element.name}”
    除属性错误外:
    _path=“”
    如果el_type==StructType:
    对于t in元素:
    res=_数组(t,根)
    扩展表达式(res)
    elif el_type==StructField和type(element.dataType)==ArrayType:
    expr.append(f“explode_outer({u path})作为{u path.replace('.',''.'''.'))
    elif el_type==StructField和type(element.dataType)==StructType:
    expr.extend(_数组(element.dataType,_路径))
    其他:
    expr.append(f“{u-path}作为{u-path.replace('.',''''.}”)
    返回表达式
    def Flattexpr(元素,根=无):
    """
    展平数据帧的结构
    (在级别名称之间使用“25;”)
    它对数组不起作用,
    您需要确保输入模式中没有数组
    """
    expr=[]
    el_类型=类型(元素)
    尝试:
    _path=f“{root+.”如果根目录为“”}{element.name}”
    除属性错误外:
    _path=“”
    如果el_type==StructType:
    对于t in元素:
    扩展(展平扩展(t,根))
    elif el_type==StructField和type(element.dataType)==StructType:
    expr.extend(flattexpr(element.dataType,_path))
    elif el_type==StructField和type(element.dataType)==ArrayType:
    #您应该使用扁平阵列以确保不会发生这种情况
    expr.extend(flattexpr(element.dataType.elementType,f“{u path}[0]”)
    其他:
    expr.append(f“{u-path}作为{u-path.replace('.',''''.}”)
    返回表达式
    
    所以我们可以这样做:

    json_test=spark.read.json(sc.parallelize([“{”Id:“001”,“Type:“Work”,“Tag:“{”Id:“a123”,“Location:“[{”LocName:“Astro”,“LocCode:“AST”}],“displayName:“Al”},{”Id:“e789”,“Location:“[{”LocName:“Cosmos”,“LocCode:“COS”}],“displayName:“Tom”}],“version:“version:”2}“]))
    json_test.printSchema()
    f_df=展平(json_测试)
    f_df.printSchema()
    f_df.show()
    
    因此,您可以得到原始模式:

    
    root
     |-- Id: string (nullable = true)
     |-- Tag: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- Id: string (nullable = true)
     |    |    |-- Location: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- LocCode: string (nullable = true)
     |    |    |    |    |-- LocName: string (nullable = true)
     |    |    |-- displayName: string (nullable = true)
     |-- Type: string (nullable = true)
     |-- version: long (nullable = true)
    
    新模式:

    root
     |-- Id: string (nullable = true)
     |-- Tag_Id: string (nullable = true)
     |-- Tag_Location_LocCode: string (nullable = true)
     |-- Tag_Location_LocName: string (nullable = true)
     |-- Tag_displayName: string (nullable = true)
     |-- Type: string (nullable = true)
     |-- version: long (nullable = true)
    
    以及数据帧:

    | Id|Tag_Id|Tag_Location_LocCode|Tag_Location_LocName|Tag_displayName|Type|version|
    +---+------+--------------------+--------------------+---------------+----+-------+
    |001|  a123|                 AST|               Astro|             Al|Work|      2|
    |001|  e789|                 COS|              Cosmos|            Tom|Work|      2|
    +---+------+--------------------+--------------------+---------------+----+-------+
    
    我希望这能帮助你思考你的解决方案