Python apachespark通过保留null值来聚合JSONL数据帧
我是Spark的新手,所以我可能需要你的帮助 我在一个S3存储桶中有两个类似的JSONL文件,每个文件都有一个包含其文件名的字段(稍后将详细介绍): 因为我在一个数据帧中读取了所有的bucket文件,所以我丢失了记录和原始文件之间的关系,所以需要该字段Python apachespark通过保留null值来聚合JSONL数据帧,python,apache-spark,pyspark,pyspark-dataframes,Python,Apache Spark,Pyspark,Pyspark Dataframes,我是Spark的新手,所以我可能需要你的帮助 我在一个S3存储桶中有两个类似的JSONL文件,每个文件都有一个包含其文件名的字段(稍后将详细介绍): 因为我在一个数据帧中读取了所有的bucket文件,所以我丢失了记录和原始文件之间的关系,所以需要该字段 df = spark.read.json('s3://bucket/prefix') 我使用进行了一些转换,并获得了一个更新的数据帧,其中一些数据帧中添加了一些字段 df2 = df.rdd.map(lambda x: my_transform
df = spark.read.json('s3://bucket/prefix')
我使用进行了一些转换,并获得了一个更新的数据帧,其中一些数据帧中添加了一些字段
df2 = df.rdd.map(lambda x: my_transformations(x))
{"additional_field": "blabla", "id": "abc-123-abc", "first_name": "John", "last_name": "Simonis", "s3_original_file": "s3://bucket/prefix/file1.jsonl"}
{"additional_field": "blabla", "id": "def-563-abc", "first_name": "Mary", "last_name": "Culkin", "s3_original_file": "s3://bucket/prefix/file1.jsonl"}
{"additional_field": "blabla", "id": "abc-532-def", "first_name": "James", "s3_original_file": "s3://bucket/prefix/file2.jsonl"}
{"id": "abc-445-abc", "first_name": "Fiona", "last_name": "Goodwill", "s3_original_file": "s3://bucket/prefix/file3.jsonl"}
{"additional_field": "blabla", "id": "abc-167-def", "last_name": "Matz", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}
{"additional_field": "blabla", "id": "ghj-134-abc", "first_name": "Adam", "last_name": "Gleason", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}
{"id": "abc-523-abc", "first_name": "Phil", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}
{"additional_field": "blabla", "id": "ghj-823-abc", "first_name": "Jack", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file5.jsonl"}
{"id": "abc-128-abc", "first_name": "Mary", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}
{"additional_field": "blabla", "id": "abc-124-ghj", "last_name": "Foster", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}
{"additional_field": "blabla", "id": "ghj-133-abc", "first_name": "Julius", "last_name": "Bull", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}
{"additional_field": "blabla", "id": "abc-723-abc", "first_name": "Gareth", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file7.jsonl"}
然后我需要通过“s3_原始文件”重新组合它们
由于我需要重写保持原始关联的文件(我不能使用df.write.json,因为我丢失了该关联,我将使用df.foreach()
和lambda中的boto3来完成),因此我累积通过列名传递的聚合数据,除了分组列
def fetch_columns(dataframe, grouping):
output = []
columns = dataframe.columns
for column in columns:
if column != grouping:
output.append(F.collect_list(column).alias(column))
return output
resultDF = grouped.agg(*fetch_columns(df2, 's3_original_file'))
然后,我需要将生成的数据帧行保存为特定文件中的json行,这可能在save_back_to_s3函数中执行
resultDF.foreach(lambda x: save_back_to_s3(x))
问题是:
我得到一个每列值的聚合列表,相反,我希望有一个单独的列,其中的行列表被分组,而没有任何东西考虑最终的空值,从而打乱了顺序。我希望保留最终为空的列,以了解数据不存在
>>> resultDF.show(20, False)
+------------------------------+----------------+--------------+---------------------------------------+----------------------+
|s3_original_file |additional_field|first_name |id |last_name |
+------------------------------+----------------+--------------+---------------------------------------+----------------------+
|s3://bucket/prefix/file3.jsonl|[] |[Fiona] |[abc-445-abc] |[Goodwill] |
|s3://bucket/prefix/file7.jsonl|[blabla] |[Gareth] |[abc-723-abc] |[Smith] |
|s3://bucket/prefix/file5.jsonl|[blabla] |[Jack] |[ghj-823-abc] |[Smith] |
|s3://bucket/prefix/file4.jsonl|[blabla, blabla]|[Adam, Phil] |[abc-167-def, ghj-134-abc, abc-523-abc]|[Matz, Gleason, Smith]|
|s3://bucket/prefix/file6.jsonl|[blabla, blabla]|[Mary, Julius]|[abc-128-abc, abc-124-ghj, ghj-133-abc]|[Foster, Bull] |
|s3://bucket/prefix/file1.jsonl|[blabla, blabla]|[John, Mary] |[abc-123-abc, def-563-abc] |[Simonis, Culkin] |
|s3://bucket/prefix/file2.jsonl|[blabla] |[James] |[abc-532-def] |[] |
+------------------------------+----------------+--------------+---------------------------------------+----------------------+
是否有可能产生这样的数据帧
+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|s3_original_file |records |
+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|s3://bucket/prefix/file3.jsonl|{"id": "abc-445-abc", "first_name": "Fiona", "last_name": "Goodwill", "s3_original_file": "s3://bucket/prefix/file3.jsonl"} |
|s3://bucket/prefix/file7.jsonl|{"additional_field": "blabla", "id": "abc-723-abc", "first_name": "Gareth", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file7.jsonl"} |
|s3://bucket/prefix/file5.jsonl|{"additional_field": "blabla", "id": "ghj-823-abc", "first_name": "Jack", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file5.jsonl"} |
|s3://bucket/prefix/file4.jsonl|{"additional_field": "blabla", "id": "ghj-134-abc", "first_name": "Adam", "last_name": "Gleason", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}\n{"id": "abc-523-abc", "first_name": "Phil", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file4.jsonl"} |
|s3://bucket/prefix/file6.jsonl|{"id": "abc-128-abc", "first_name": "Mary", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}\n{"additional_field": "blabla", "id": "abc-124-ghj", "last_name": "Foster", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}\n{"additional_field": "blabla", "id": "ghj-133-abc", "first_name": "Julius", "last_name": "Bull", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}|
|s3://bucket/prefix/file1.jsonl|{"additional_field": "blabla", "id": "abc-123-abc", "first_name": "John", "last_name": "Simonis", "s3_original_file": "s3://bucket/prefix/file1.jsonl"}\n{"additional_field": "blabla", "id": "def-563-abc", "first_name": "Mary", "last_name": "Culkin", "s3_original_file": "s3://bucket/prefix/file1.jsonl"} |
|s3://bucket/prefix/file2.jsonl|{"additional_field": "blabla", "id": "abc-532-def", "first_name": "James", "s3_original_file": "s3://bucket/prefix/file2.jsonl"} |
+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
或者至少,我可以通过输出单独的列表来管理它,而不是输出字典,只是在正确的索引位置考虑空值,例如:
+------------------------------+----------------------+--------------------+---------------------------------------+----------------------+
|s3_original_file |additional_field |first_name |id |last_name |
+------------------------------+----------------------+--------------------+---------------------------------------+----------------------+
|s3://bucket/prefix/file3.jsonl|[null] |[Fiona] |[abc-445-abc] |[Goodwill] |
|s3://bucket/prefix/file7.jsonl|[blabla] |[Gareth] |[abc-723-abc] |[Smith] |
|s3://bucket/prefix/file5.jsonl|[blabla] |[Jack] |[ghj-823-abc] |[Smith] |
|s3://bucket/prefix/file4.jsonl|[blabla, blabla] |[null, Adam, Phil] |[abc-167-def, ghj-134-abc, abc-523-abc]|[Matz, Gleason, Smith]|
|s3://bucket/prefix/file6.jsonl|[null, blabla, blabla]|[Mary, null, Julius]|[abc-128-abc, abc-124-ghj, ghj-133-abc]|[null, Foster, Bull] |
|s3://bucket/prefix/file1.jsonl|[blabla, blabla] |[John, Mary] |[abc-123-abc, def-563-abc] |[Simonis, Culkin] |
|s3://bucket/prefix/file2.jsonl|[blabla] |[James] |[abc-532-def] |[null] |
+------------------------------+----------------------+--------------------+---------------------------------------+----------------------+
谢谢 您可以创建一个
struct
列,该列由您希望包含的任何列组成,然后使用to_json
函数将其转换为单个json字符串进行导出:
scala> val df = Seq((1, "a", Seq("a", "b", "c")), (2, "b", Seq("d", "e", "f"))).toDF("x", "y", "z")
df: org.apache.spark.sql.DataFrame = [x: int, y: string ... 1 more field]
scala> val df_json = df.select(to_json(struct($"x", $"y", $"z")).as("json_field"))
df_json: org.apache.spark.sql.DataFrame = [json_field: string]
scala> df_json.show(false)
+---------------------------------+
|json_field |
+---------------------------------+
|{"x":1,"y":"a","z":["a","b","c"]}|
|{"x":2,"y":"b","z":["d","e","f"]}|
+---------------------------------+
我用这种方法解决了部分问题。 它是有效的,即使我不认为这是最好的方法,因为它的速度非常慢
# Read S3 bucket into DataFrame and add an input_file Column to store original filename
df = spark.read.json(path_source_bucket).withColumn("input_file", F.input_file_name())
# Get an Enriched df2 dataframe
rdd = df.rdd.map(lambda payload: invoke_aws_lambda(region, payload, source_bucket, destination_bucket))
# group RDD by output_file
grouped = rdd.groupBy(lambda x: x.output_file)
for s3path, records in grouped.collect():
output_json = ''
for record in records:
row_dict = record.asDict()
del row_dict["output_file"]
output_json += json.dumps(row_dict) + "\n"
save_to_s3(region, output_json.rstrip("\n"), s3path, destination_bucket)
有没有理由不使用
input\u file\u name()
作为输入文件名?您仍然可以使用write.json
通过s3_original_file
将文件写入临时文件夹和分区,然后从此临时文件夹复制到s3,并根据分区文件夹名称重命名每个文件。谢谢,我已部分解决了此问题,但我也会尝试这种方法!
scala> val df = Seq((1, "a", Seq("a", "b", "c")), (2, "b", Seq("d", "e", "f"))).toDF("x", "y", "z")
df: org.apache.spark.sql.DataFrame = [x: int, y: string ... 1 more field]
scala> val df_json = df.select(to_json(struct($"x", $"y", $"z")).as("json_field"))
df_json: org.apache.spark.sql.DataFrame = [json_field: string]
scala> df_json.show(false)
+---------------------------------+
|json_field |
+---------------------------------+
|{"x":1,"y":"a","z":["a","b","c"]}|
|{"x":2,"y":"b","z":["d","e","f"]}|
+---------------------------------+
# Read S3 bucket into DataFrame and add an input_file Column to store original filename
df = spark.read.json(path_source_bucket).withColumn("input_file", F.input_file_name())
# Get an Enriched df2 dataframe
rdd = df.rdd.map(lambda payload: invoke_aws_lambda(region, payload, source_bucket, destination_bucket))
# group RDD by output_file
grouped = rdd.groupBy(lambda x: x.output_file)
for s3path, records in grouped.collect():
output_json = ''
for record in records:
row_dict = record.asDict()
del row_dict["output_file"]
output_json += json.dumps(row_dict) + "\n"
save_to_s3(region, output_json.rstrip("\n"), s3path, destination_bucket)