Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Arrays 如何使用pyspark在aws glue中展平嵌套json中的数组?_Arrays_Json_Pyspark_Pyspark Sql_Aws Glue - Fatal编程技术网

Arrays 如何使用pyspark在aws glue中展平嵌套json中的数组?

Arrays 如何使用pyspark在aws glue中展平嵌套json中的数组?,arrays,json,pyspark,pyspark-sql,aws-glue,Arrays,Json,Pyspark,Pyspark Sql,Aws Glue,我正在尝试扁平化JSON文件,以便能够将其加载到PostgreSQL all-in-AWS Glue中。我正在使用PySpark。我使用爬虫程序对S3JSON进行爬网并生成一个表。然后,我使用ETL粘合脚本: 阅读爬网表 使用“Relationalize”功能展平文件 将动态帧转换为数据帧 尝试“分解”request.data字段 到目前为止的脚本: datasource0 = glueContext.create_dynamic_frame.from_catalog(database =

我正在尝试扁平化JSON文件,以便能够将其加载到PostgreSQL all-in-AWS Glue中。我正在使用PySpark。我使用爬虫程序对S3JSON进行爬网并生成一个表。然后,我使用ETL粘合脚本:

  • 阅读爬网表
  • 使用“Relationalize”功能展平文件
  • 将动态帧转换为数据帧
  • 尝试“分解”request.data字段
到目前为止的脚本:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")

df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")

df1 = df0.select(dfc_root_table_name)

df2 = df1.toDF()

df2 = df1.select(explode(col('`request.data`')).alias("request_data"))

<then i write df1 to a PostgreSQL database which works fine>

一旦您合理化了json列,就不需要分解它。Relationalize将嵌套的JSON转换为JSON文档最外层的键值对。转换后的数据维护由句点分隔的嵌套JSON中的原始键列表

示例:

嵌套json:

{
    "player": {
        "username": "user1",
        "characteristics": {
            "race": "Human",
            "class": "Warlock",
            "subclass": "Dawnblade",
            "power": 300,
            "playercountry": "USA"
        },
        "arsenal": {
            "kinetic": {
                "name": "Sweet Business",
                "type": "Auto Rifle",
                "power": 300,
                "element": "Kinetic"
            },
            "energy": {
                "name": "MIDA Mini-Tool",
                "type": "Submachine Gun",
                "power": 300,
                "element": "Solar"
            },
            "power": {
                "name": "Play of the Game",
                "type": "Grenade Launcher",
                "power": 300,
                "element": "Arc"
            }
        },
        "armor": {
            "head": "Eye of Another World",
            "arms": "Philomath Gloves",
            "chest": "Philomath Robes",
            "leg": "Philomath Boots",
            "classitem": "Philomath Bond"
        },
        "location": {
            "map": "Titan",
            "waypoint": "The Rig"
        }
    }
}
合理化后的json变平:

{
    "player.username": "user1",
    "player.characteristics.race": "Human",
    "player.characteristics.class": "Warlock",
    "player.characteristics.subclass": "Dawnblade",
    "player.characteristics.power": 300,
    "player.characteristics.playercountry": "USA",
    "player.arsenal.kinetic.name": "Sweet Business",
    "player.arsenal.kinetic.type": "Auto Rifle",
    "player.arsenal.kinetic.power": 300,
    "player.arsenal.kinetic.element": "Kinetic",
    "player.arsenal.energy.name": "MIDA Mini-Tool",
    "player.arsenal.energy.type": "Submachine Gun",
    "player.arsenal.energy.power": 300,
    "player.arsenal.energy.element": "Solar",
    "player.arsenal.power.name": "Play of the Game",
    "player.arsenal.power.type": "Grenade Launcher",
    "player.arsenal.power.power": 300,
    "player.arsenal.power.element": "Arc",
    "player.armor.head": "Eye of Another World",
    "player.armor.arms": "Philomath Gloves",
    "player.armor.chest": "Philomath Robes",
    "player.armor.leg": "Philomath Boots",
    "player.armor.classitem": "Philomath Bond",
    "player.location.map": "Titan",
    "player.location.waypoint": "The Rig"
}
因此,在您的例子中,request.data已经是从request列中平展出来的一个新列,它的类型被spark解释为bigint

参考:

它将用下划线替换所有点。请注意,它使用
explode\u outer
和not
explode
在数组本身为空的情况下包含空值。此功能仅在spark v2.4+中可用

还要记住,分解数组将添加更多重复项,并且总体行大小将增加。展平结构将增加列大小。简言之,您的原始df将在水平和垂直方向上爆炸。这可能会降低以后处理数据的速度


因此,我的建议是识别与功能相关的数据,并仅将这些数据存储在postgresql和s3中的原始json文件中。

True,但问题是json结构(request.data)中有一个数组,需要进行平滑处理。否则,它只返回1的bigint(即忽略其中的实际数据),这是不正确的。“合理化”效果很好,否则的话。@charlesperry,你是对的。Relationalize仅适用于JSON的最外层,并且在文档中应该是明确的。我仍在试图找出将JSON文件与5层嵌套数组和结构进行关系化的最佳方法。这适用于我的大多数JSON文件。但是当结构/数组为NULL时,我遇到一个错误:“list.ordered中没有这样的结构字段样式;如何处理空条件?是否可以使用posexplode_outer函数而不是explode_outer函数?很好。嵌套列中列的第一行
需要删除
{
    "player.username": "user1",
    "player.characteristics.race": "Human",
    "player.characteristics.class": "Warlock",
    "player.characteristics.subclass": "Dawnblade",
    "player.characteristics.power": 300,
    "player.characteristics.playercountry": "USA",
    "player.arsenal.kinetic.name": "Sweet Business",
    "player.arsenal.kinetic.type": "Auto Rifle",
    "player.arsenal.kinetic.power": 300,
    "player.arsenal.kinetic.element": "Kinetic",
    "player.arsenal.energy.name": "MIDA Mini-Tool",
    "player.arsenal.energy.type": "Submachine Gun",
    "player.arsenal.energy.power": 300,
    "player.arsenal.energy.element": "Solar",
    "player.arsenal.power.name": "Play of the Game",
    "player.arsenal.power.type": "Grenade Launcher",
    "player.arsenal.power.power": 300,
    "player.arsenal.power.element": "Arc",
    "player.armor.head": "Eye of Another World",
    "player.armor.arms": "Philomath Gloves",
    "player.armor.chest": "Philomath Robes",
    "player.armor.leg": "Philomath Boots",
    "player.armor.classitem": "Philomath Bond",
    "player.location.map": "Titan",
    "player.location.waypoint": "The Rig"
}
# Flatten nested df  
def flatten_df(nested_df): 
    for col in nested_df.columns:


    array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
    for col in array_cols:
        nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))

    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
    if len(nested_cols) == 0:
        return nested_df

    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']

    flat_df = nested_df.select(flat_cols +
                            [F.col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols
                                for c in nested_df.select(nc+'.*').columns])

    return flatten_df(flat_df)

df=flatten_df(df)