Pyspark与AWS粘合，将1-N关系连接到JSON数组中_Pyspark_Aws Glue

Pyspark与AWS粘合，将1-N关系连接到JSON数组中

pyspark

Pyspark与AWS粘合，将1-N关系连接到JSON数组中,pyspark,aws-glue,Pyspark,Aws Glue,不知道如何在AWS Glue上加入1-N关系并导出JSON文件，如： {"id": 123, "name": "John Doe", "profiles": [ {"id": 1111, "channel": "twitter"}, {"id": 2222, "channel": "twitter"}, {"id": 3333, "channel": "instagram"} ]} {"id": 345, "name": "Test", "profiles": []} profiles JSO

不知道如何在AWS Glue上加入1-N关系并导出JSON文件，如：

{"id": 123, "name": "John Doe", "profiles": [ {"id": 1111, "channel": "twitter"}, {"id": 2222, "channel": "twitter"}, {"id": 3333, "channel": "instagram"} ]}
{"id": 345, "name": "Test", "profiles": []}

profiles JSON数组应该使用其他表创建。此外，我想添加频道列了

AWS Glue数据目录中的3个表是：

人名

{"id": 123,"nanme": "John Doe"}
{"id": 345,"nanme": "Test"}

instagram_json

{"id": 3333, "person_id": 123}
{"id": 3333, "person_id": null}

twitter_json

{"id": 1111, "person_id": 123}
{"id": 2222, "person_id": 123}

这是我到目前为止的脚本：

导入系统从awsglue.transforms导入* 从awsglue.utils导入getResolvedOptions 从pyspark.context导入SparkContext 从pyspark.sql.functions导入从awsglue.context导入GlueContext 从awsglue.job导入作业 glueContext=glueContext（SparkContext.getOrCreate（）） #目录：数据库和表名 db_name=“测试_数据库” tbl_person=“person_json” tbl_instagram=“instagram_json” tbl_twitter=“twitter_json” #从源表创建动态帧 person=glueContext。从目录创建动态框架（数据库=db\U名称，表\U名称=tbl\U person） instagram=glueContext。从目录创建动态框架（数据库=db\U名称，表\U名称=tbl\U instagram） twitter=glueContext。创建动态框架。从目录（数据库=db\u名称，表\u名称=tbl\u twitter） #连接框架 joined\u instagram=Join.apply（person，instagram，'id'，'person\u id'）。删除字段（['person\u id']） joined\u all=Join.apply（已加入instagram、twitter、“id”、“person\u id”）。删除字段（['person\u id']）） #将输出写入S3 output_s3_path=“s3://xxx/xxx/person.json” 输出=已联接的所有.toDF（）。重新分区（1） output.write.mode（“overwrite”）.json（输出路径）如何更改脚本以获得所需的输出

谢谢

.show（）不在profiles字段中显示结构的完整结构

print(joined_all.collect())
[Row(id=123, name='John Doe', profiles=[Row(id=1111, channel='twitter'), Row(id=2222, channel='twitter'), Row(id=3333, channel='instagram')]), Row(id=345, name='Test', profiles=None)]

.show（）不在profiles字段中显示结构的完整结构

print(joined_all.collect())
[Row(id=123, name='John Doe', profiles=[Row(id=1111, channel='twitter'), Row(id=2222, channel='twitter'), Row(id=3333, channel='instagram')]), Row(id=345, name='Test', profiles=None)]