Python 从S3读取大型JSON文件（3K+）并从数组中选择特定键_Python_Json_Pyspark_Aws Glue

Python 从S3读取大型JSON文件（3K+）并从数组中选择特定键

python json pyspark

Python 从S3读取大型JSON文件（3K+）并从数组中选择特定键,python,json,pyspark,aws-glue,Python,Json,Pyspark,Aws Glue,我需要读取存储在s3中的多个JSON文件3K+，所有这些文件都具有相同的结构。这个结构非常大，而且是嵌套的。在这些文件中，是一个数组，其中包含对象、键：值对。我需要从这些键中选择一些，并将值写入PySpark数据帧。我正在使用PySpark/Python3在AWS Glue中编写代码到目前为止，我试图从S3文件创建一个数据帧，然后推断模式。我不确定这是否正确，也不确定这是否是最有效的。我也不确定下一步要在哪里找到Products数组并从数组中提取几个键 json_data_frame = sp

我需要读取存储在s3中的多个JSON文件3K+，所有这些文件都具有相同的结构。这个结构非常大，而且是嵌套的。在这些文件中，是一个数组，其中包含对象、键：值对。我需要从这些键中选择一些，并将值写入PySpark数据帧。我正在使用PySpark/Python3在AWS Glue中编写代码

到目前为止，我试图从S3文件创建一个数据帧，然后推断模式。我不确定这是否正确，也不确定这是否是最有效的。我也不确定下一步要在哪里找到Products数组并从数组中提取几个键

json_data_frame = spark.read.json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

json_schema = spark.read.json(json_data_frame.rdd.map(lambda row: row.json)).schema

我想要的结果是一个包含列的数据帧，每个列都是数组中的键，并且具有整个s3文件中的所有值

编辑：我做得更进一步了：

json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prep = json_data_frame.withColumn("name", json_data_frame["products"].getItem("name")).withColumn("ndc_product_code", json_data_frame["products"].getItem("ndc_product_code"))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code")

final_data_frame.show(20,False)

在我现在所在的位置，dataframe正在创建，正如我所怀疑的那样，只是每个值都是一个列表，有些是一个项目，有些是多个项目。我现在需要将列表分成单独的行。如果你有什么建议的话，我很乐意接受。当前数据帧：

+------------------+----------------------+
|name |ndc_product_code |
+------------------+----------------------+
|[Refludan] |[50419-150] |
|[Erbitux, Erbitux]|[66733-948, 66733-958]|
+------------------+----------------------+

编辑2：

json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prep = json_data_frame.withColumn("name", explode(json_data_frame["products"].getItem("name"))).withColumn("ndc_product_code", explode(json_data_frame["products"].getItem("ndc_product_code"))).withColumn("dosage_form", explode(json_data_frame["products"].getItem("dosage_form"))).withColumn("strength", explode(json_data_frame["products"].getItem("strength")))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code","dosage_form","strength")

final_data_frame.show(20,False)

我能够在代码中添加一个分解，以及剩下的两列，但是我看到了数据帧中的重复，好像列表匹配了所有可能的结果，而不是匹配了关键帧所在数组中的对象。数据帧现在是：

+--------+----------------+-----------+---------+
|name |ndc_product_code|dosage_form|strength |
+--------+----------------+-----------+---------+
|Refludan|50419-150 |Powder |50 mg/1mL|
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
+--------+----------------+-----------+---------+

编辑3：我不相信爆炸是我想要的。我将代码恢复为“编辑1”。这张桌子呈现为

+------------------+----------------------+
|name |ndc_product_code |
+------------------+----------------------+
|[Refludan] |[50419-150] |
|[Erbitux, Erbitux]|[66733-948, 66733-958]|
+------------------+----------------------+

我想要的是：

+------------------+----------------------+
|name |ndc_product_code |
+------------------+----------------------+
|[Refludan]|[50419-150]|
|[Erbitux]|[66733-948]|
|[Erbitux]|[66733-958]|
+------------------+----------------------+

有没有办法做到这一点，即匹配数组中的位置并在此基础上创建新行？

我明白了

+--------+----------------+-----------+---------+
|name |ndc_product_code|dosage_form|strength |
+--------+----------------+-----------+---------+
|Refludan|50419-150 |Powder |50 mg/1mL|
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
+--------+----------------+-----------+---------+

代码是：

# Read in the json files from s3
json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prepprep = json_data_frame.withColumn("products_exp", explode(json_data_frame["products"]))\

final_data_frame_prep = final_data_frame_prepprep.withColumn("name", final_data_frame_prepprep["products_exp"].getItem("name"))\
                                             .withColumn("ndc_product_code", final_data_frame_prepprep["products_exp"].getItem("ndc_product_code"))\
                                             .withColumn("dosage_form", final_data_frame_prepprep["products_exp"].getItem("dosage_form"))\
                                             .withColumn("strength", final_data_frame_prepprep["products_exp"].getItem("strength"))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code","dosage_form","strength")

final_data_frame.show(20,False)

关键是将数据分解为一个整体，然后从数组中获取项目，然后选择要保留的内容。我希望这能帮助其他人干杯

我明白了

+--------+----------------+-----------+---------+
|name |ndc_product_code|dosage_form|strength |
+--------+----------------+-----------+---------+
|Refludan|50419-150 |Powder |50 mg/1mL|
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
+--------+----------------+-----------+---------+

代码是：

# Read in the json files from s3
json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prepprep = json_data_frame.withColumn("products_exp", explode(json_data_frame["products"]))\

final_data_frame_prep = final_data_frame_prepprep.withColumn("name", final_data_frame_prepprep["products_exp"].getItem("name"))\
                                             .withColumn("ndc_product_code", final_data_frame_prepprep["products_exp"].getItem("ndc_product_code"))\
                                             .withColumn("dosage_form", final_data_frame_prepprep["products_exp"].getItem("dosage_form"))\
                                             .withColumn("strength", final_data_frame_prepprep["products_exp"].getItem("strength"))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code","dosage_form","strength")

final_data_frame.show(20,False)

关键是将数据分解为一个整体，然后从数组中获取项目，然后选择要保留的内容。我希望这有助于其他人干杯