Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/341.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从S3读取大型JSON文件(3K+)并从数组中选择特定键_Python_Json_Pyspark_Aws Glue - Fatal编程技术网

Python 从S3读取大型JSON文件(3K+)并从数组中选择特定键

Python 从S3读取大型JSON文件(3K+)并从数组中选择特定键,python,json,pyspark,aws-glue,Python,Json,Pyspark,Aws Glue,我需要读取存储在s3中的多个JSON文件3K+,所有这些文件都具有相同的结构。这个结构非常大,而且是嵌套的。在这些文件中,是一个数组,其中包含对象、键:值对。我需要从这些键中选择一些,并将值写入PySpark数据帧。我正在使用PySpark/Python3在AWS Glue中编写代码 到目前为止,我试图从S3文件创建一个数据帧,然后推断模式。我不确定这是否正确,也不确定这是否是最有效的。我也不确定下一步要在哪里找到Products数组并从数组中提取几个键 json_data_frame = sp

我需要读取存储在s3中的多个JSON文件3K+,所有这些文件都具有相同的结构。这个结构非常大,而且是嵌套的。在这些文件中,是一个数组,其中包含对象、键:值对。我需要从这些键中选择一些,并将值写入PySpark数据帧。我正在使用PySpark/Python3在AWS Glue中编写代码

到目前为止,我试图从S3文件创建一个数据帧,然后推断模式。我不确定这是否正确,也不确定这是否是最有效的。我也不确定下一步要在哪里找到Products数组并从数组中提取几个键

json_data_frame = spark.read.json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

json_schema = spark.read.json(json_data_frame.rdd.map(lambda row: row.json)).schema
我想要的结果是一个包含列的数据帧,每个列都是数组中的键,并且具有整个s3文件中的所有值

编辑:我做得更进一步了:

json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prep = json_data_frame.withColumn("name", json_data_frame["products"].getItem("name")).withColumn("ndc_product_code", json_data_frame["products"].getItem("ndc_product_code"))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code")

final_data_frame.show(20,False)
在我现在所在的位置,dataframe正在创建,正如我所怀疑的那样,只是每个值都是一个列表,有些是一个项目,有些是多个项目。我现在需要将列表分成单独的行。如果你有什么建议的话,我很乐意接受。当前数据帧:

+------------------+----------------------+
|name |ndc_product_code |
+------------------+----------------------+
|[Refludan] |[50419-150] |
|[Erbitux, Erbitux]|[66733-948, 66733-958]|
+------------------+----------------------+
编辑2:

json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prep = json_data_frame.withColumn("name", explode(json_data_frame["products"].getItem("name"))).withColumn("ndc_product_code", explode(json_data_frame["products"].getItem("ndc_product_code"))).withColumn("dosage_form", explode(json_data_frame["products"].getItem("dosage_form"))).withColumn("strength", explode(json_data_frame["products"].getItem("strength")))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code","dosage_form","strength")

final_data_frame.show(20,False)
我能够在代码中添加一个分解,以及剩下的两列,但是我看到了数据帧中的重复,好像列表匹配了所有可能的结果,而不是匹配了关键帧所在数组中的对象。数据帧现在是:

+--------+----------------+-----------+---------+
|name |ndc_product_code|dosage_form|strength |
+--------+----------------+-----------+---------+
|Refludan|50419-150 |Powder |50 mg/1mL|
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
+--------+----------------+-----------+---------+
编辑3: 我不相信爆炸是我想要的。我将代码恢复为“编辑1”。这张桌子呈现为

+------------------+----------------------+
|name |ndc_product_code |
+------------------+----------------------+
|[Refludan] |[50419-150] |
|[Erbitux, Erbitux]|[66733-948, 66733-958]|
+------------------+----------------------+
我想要的是:

+------------------+----------------------+
|name |ndc_product_code |
+------------------+----------------------+
|[Refludan]|[50419-150]|
|[Erbitux]|[66733-948]|
|[Erbitux]|[66733-958]|
+------------------+----------------------+
有没有办法做到这一点,即匹配数组中的位置并在此基础上创建新行?

我明白了

+--------+----------------+-----------+---------+
|name |ndc_product_code|dosage_form|strength |
+--------+----------------+-----------+---------+
|Refludan|50419-150 |Powder |50 mg/1mL|
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
+--------+----------------+-----------+---------+
代码是:

# Read in the json files from s3
json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prepprep = json_data_frame.withColumn("products_exp", explode(json_data_frame["products"]))\

final_data_frame_prep = final_data_frame_prepprep.withColumn("name", final_data_frame_prepprep["products_exp"].getItem("name"))\
                                             .withColumn("ndc_product_code", final_data_frame_prepprep["products_exp"].getItem("ndc_product_code"))\
                                             .withColumn("dosage_form", final_data_frame_prepprep["products_exp"].getItem("dosage_form"))\
                                             .withColumn("strength", final_data_frame_prepprep["products_exp"].getItem("strength"))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code","dosage_form","strength")

final_data_frame.show(20,False)
关键是将数据分解为一个整体,然后从数组中获取项目,然后选择要保留的内容。我希望这能帮助其他人干杯

我明白了

+--------+----------------+-----------+---------+
|name |ndc_product_code|dosage_form|strength |
+--------+----------------+-----------+---------+
|Refludan|50419-150 |Powder |50 mg/1mL|
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
+--------+----------------+-----------+---------+
代码是:

# Read in the json files from s3
json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prepprep = json_data_frame.withColumn("products_exp", explode(json_data_frame["products"]))\

final_data_frame_prep = final_data_frame_prepprep.withColumn("name", final_data_frame_prepprep["products_exp"].getItem("name"))\
                                             .withColumn("ndc_product_code", final_data_frame_prepprep["products_exp"].getItem("ndc_product_code"))\
                                             .withColumn("dosage_form", final_data_frame_prepprep["products_exp"].getItem("dosage_form"))\
                                             .withColumn("strength", final_data_frame_prepprep["products_exp"].getItem("strength"))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code","dosage_form","strength")

final_data_frame.show(20,False)
关键是将数据分解为一个整体,然后从数组中获取项目,然后选择要保留的内容。我希望这有助于其他人干杯