Python apachespark通过保留null值来聚合JSONL数据帧

Python apachespark通过保留null值来聚合JSONL数据帧,python,apache-spark,pyspark,pyspark-dataframes,Python,Apache Spark,Pyspark,Pyspark Dataframes,我是Spark的新手,所以我可能需要你的帮助 我在一个S3存储桶中有两个类似的JSONL文件,每个文件都有一个包含其文件名的字段(稍后将详细介绍): 因为我在一个数据帧中读取了所有的bucket文件,所以我丢失了记录和原始文件之间的关系,所以需要该字段 df = spark.read.json('s3://bucket/prefix') 我使用进行了一些转换,并获得了一个更新的数据帧,其中一些数据帧中添加了一些字段 df2 = df.rdd.map(lambda x: my_transform

我是Spark的新手,所以我可能需要你的帮助

我在一个S3存储桶中有两个类似的JSONL文件,每个文件都有一个包含其文件名的字段(稍后将详细介绍):

因为我在一个数据帧中读取了所有的bucket文件,所以我丢失了记录和原始文件之间的关系,所以需要该字段

df = spark.read.json('s3://bucket/prefix')
我使用进行了一些转换,并获得了一个更新的数据帧,其中一些数据帧中添加了一些字段

df2 = df.rdd.map(lambda x: my_transformations(x))

{"additional_field": "blabla", "id": "abc-123-abc", "first_name": "John", "last_name": "Simonis", "s3_original_file": "s3://bucket/prefix/file1.jsonl"}
{"additional_field": "blabla", "id": "def-563-abc", "first_name": "Mary", "last_name": "Culkin", "s3_original_file": "s3://bucket/prefix/file1.jsonl"}
{"additional_field": "blabla", "id": "abc-532-def", "first_name": "James", "s3_original_file": "s3://bucket/prefix/file2.jsonl"}
{"id": "abc-445-abc", "first_name": "Fiona", "last_name": "Goodwill", "s3_original_file": "s3://bucket/prefix/file3.jsonl"}
{"additional_field": "blabla", "id": "abc-167-def", "last_name": "Matz", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}
{"additional_field": "blabla", "id": "ghj-134-abc", "first_name": "Adam", "last_name": "Gleason", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}
{"id": "abc-523-abc", "first_name": "Phil", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}
{"additional_field": "blabla", "id": "ghj-823-abc", "first_name": "Jack", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file5.jsonl"}
{"id": "abc-128-abc", "first_name": "Mary", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}
{"additional_field": "blabla", "id": "abc-124-ghj", "last_name": "Foster", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}
{"additional_field": "blabla", "id": "ghj-133-abc", "first_name": "Julius", "last_name": "Bull", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}
{"additional_field": "blabla", "id": "abc-723-abc", "first_name": "Gareth", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file7.jsonl"}
然后我需要通过“s3_原始文件”重新组合它们

由于我需要重写保持原始关联的文件(我不能使用df.write.json,因为我丢失了该关联,我将使用
df.foreach()
和lambda中的boto3来完成),因此我累积通过列名传递的聚合数据,除了分组列

def fetch_columns(dataframe, grouping):
    output = []
    columns = dataframe.columns
    for column in columns:
        if column != grouping:
            output.append(F.collect_list(column).alias(column))
    return output

resultDF = grouped.agg(*fetch_columns(df2, 's3_original_file'))
然后,我需要将生成的数据帧行保存为特定文件中的json行,这可能在save_back_to_s3函数中执行

resultDF.foreach(lambda x: save_back_to_s3(x))
问题是: 我得到一个每列值的聚合列表,相反,我希望有一个单独的列,其中的行列表被分组,而没有任何东西考虑最终的空值,从而打乱了顺序。我希望保留最终为空的列,以了解数据不存在

>>> resultDF.show(20, False)
+------------------------------+----------------+--------------+---------------------------------------+----------------------+
|s3_original_file              |additional_field|first_name    |id                                     |last_name             |
+------------------------------+----------------+--------------+---------------------------------------+----------------------+
|s3://bucket/prefix/file3.jsonl|[]              |[Fiona]       |[abc-445-abc]                          |[Goodwill]            |
|s3://bucket/prefix/file7.jsonl|[blabla]        |[Gareth]      |[abc-723-abc]                          |[Smith]               |
|s3://bucket/prefix/file5.jsonl|[blabla]        |[Jack]        |[ghj-823-abc]                          |[Smith]               |
|s3://bucket/prefix/file4.jsonl|[blabla, blabla]|[Adam, Phil]  |[abc-167-def, ghj-134-abc, abc-523-abc]|[Matz, Gleason, Smith]|
|s3://bucket/prefix/file6.jsonl|[blabla, blabla]|[Mary, Julius]|[abc-128-abc, abc-124-ghj, ghj-133-abc]|[Foster, Bull]        |
|s3://bucket/prefix/file1.jsonl|[blabla, blabla]|[John, Mary]  |[abc-123-abc, def-563-abc]             |[Simonis, Culkin]     |
|s3://bucket/prefix/file2.jsonl|[blabla]        |[James]       |[abc-532-def]                          |[]                    |
+------------------------------+----------------+--------------+---------------------------------------+----------------------+
是否有可能产生这样的数据帧

+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|s3_original_file              |records                                                                                                                                                                                                                                                                                                                                                                                    |
+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|s3://bucket/prefix/file3.jsonl|{"id": "abc-445-abc", "first_name": "Fiona", "last_name": "Goodwill", "s3_original_file": "s3://bucket/prefix/file3.jsonl"}                                                                                                                                                                                                                                                                |
|s3://bucket/prefix/file7.jsonl|{"additional_field": "blabla", "id": "abc-723-abc", "first_name": "Gareth", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file7.jsonl"}                                                                                                                                                                                                                                    |
|s3://bucket/prefix/file5.jsonl|{"additional_field": "blabla", "id": "ghj-823-abc", "first_name": "Jack", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file5.jsonl"}                                                                                                                                                                                                                                      |
|s3://bucket/prefix/file4.jsonl|{"additional_field": "blabla", "id": "ghj-134-abc", "first_name": "Adam", "last_name": "Gleason", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}\n{"id": "abc-523-abc", "first_name": "Phil", "last_name": "Smith", "s3_original_file": "s3://bucket/prefix/file4.jsonl"}                                                                                                           |
|s3://bucket/prefix/file6.jsonl|{"id": "abc-128-abc", "first_name": "Mary", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}\n{"additional_field": "blabla", "id": "abc-124-ghj", "last_name": "Foster", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}\n{"additional_field": "blabla", "id": "ghj-133-abc", "first_name": "Julius", "last_name": "Bull", "s3_original_file": "s3://bucket/prefix/file6.jsonl"}|
|s3://bucket/prefix/file1.jsonl|{"additional_field": "blabla", "id": "abc-123-abc", "first_name": "John", "last_name": "Simonis", "s3_original_file": "s3://bucket/prefix/file1.jsonl"}\n{"additional_field": "blabla", "id": "def-563-abc", "first_name": "Mary", "last_name": "Culkin", "s3_original_file": "s3://bucket/prefix/file1.jsonl"}                                                                            |
|s3://bucket/prefix/file2.jsonl|{"additional_field": "blabla", "id": "abc-532-def", "first_name": "James", "s3_original_file": "s3://bucket/prefix/file2.jsonl"}                                                                                                                                                                                                                                                           |
+------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
或者至少,我可以通过输出单独的列表来管理它,而不是输出字典,只是在正确的索引位置考虑空值,例如:

+------------------------------+----------------------+--------------------+---------------------------------------+----------------------+
|s3_original_file              |additional_field      |first_name          |id                                     |last_name             |
+------------------------------+----------------------+--------------------+---------------------------------------+----------------------+
|s3://bucket/prefix/file3.jsonl|[null]                |[Fiona]             |[abc-445-abc]                          |[Goodwill]            |
|s3://bucket/prefix/file7.jsonl|[blabla]              |[Gareth]            |[abc-723-abc]                          |[Smith]               |
|s3://bucket/prefix/file5.jsonl|[blabla]              |[Jack]              |[ghj-823-abc]                          |[Smith]               |
|s3://bucket/prefix/file4.jsonl|[blabla, blabla]      |[null, Adam, Phil]  |[abc-167-def, ghj-134-abc, abc-523-abc]|[Matz, Gleason, Smith]|
|s3://bucket/prefix/file6.jsonl|[null, blabla, blabla]|[Mary, null, Julius]|[abc-128-abc, abc-124-ghj, ghj-133-abc]|[null, Foster, Bull]  |
|s3://bucket/prefix/file1.jsonl|[blabla, blabla]      |[John, Mary]        |[abc-123-abc, def-563-abc]             |[Simonis, Culkin]     |
|s3://bucket/prefix/file2.jsonl|[blabla]              |[James]             |[abc-532-def]                          |[null]                |
+------------------------------+----------------------+--------------------+---------------------------------------+----------------------+

谢谢

您可以创建一个
struct
列,该列由您希望包含的任何列组成,然后使用
to_json
函数将其转换为单个json字符串进行导出:

scala> val df = Seq((1, "a", Seq("a", "b", "c")), (2, "b", Seq("d", "e", "f"))).toDF("x", "y", "z")
df: org.apache.spark.sql.DataFrame = [x: int, y: string ... 1 more field]

scala> val df_json = df.select(to_json(struct($"x", $"y", $"z")).as("json_field"))
df_json: org.apache.spark.sql.DataFrame = [json_field: string]

scala> df_json.show(false)
+---------------------------------+
|json_field                       |
+---------------------------------+
|{"x":1,"y":"a","z":["a","b","c"]}|
|{"x":2,"y":"b","z":["d","e","f"]}|
+---------------------------------+

我用这种方法解决了部分问题。 它是有效的,即使我不认为这是最好的方法,因为它的速度非常慢

# Read S3 bucket into DataFrame and add an input_file Column to store original filename
    df = spark.read.json(path_source_bucket).withColumn("input_file", F.input_file_name())

    # Get an Enriched df2 dataframe
    rdd = df.rdd.map(lambda payload: invoke_aws_lambda(region, payload, source_bucket, destination_bucket))

    # group RDD by output_file
    grouped = rdd.groupBy(lambda x: x.output_file)

    for s3path, records in grouped.collect():
        output_json = ''
        for record in records:
            row_dict = record.asDict()
            del row_dict["output_file"]
            output_json += json.dumps(row_dict) + "\n"
        save_to_s3(region, output_json.rstrip("\n"), s3path, destination_bucket)

有没有理由不使用
input\u file\u name()
作为输入文件名?您仍然可以使用
write.json
通过
s3_original_file
将文件写入临时文件夹和分区,然后从此临时文件夹复制到s3,并根据分区文件夹名称重命名每个文件。谢谢,我已部分解决了此问题,但我也会尝试这种方法!
scala> val df = Seq((1, "a", Seq("a", "b", "c")), (2, "b", Seq("d", "e", "f"))).toDF("x", "y", "z")
df: org.apache.spark.sql.DataFrame = [x: int, y: string ... 1 more field]

scala> val df_json = df.select(to_json(struct($"x", $"y", $"z")).as("json_field"))
df_json: org.apache.spark.sql.DataFrame = [json_field: string]

scala> df_json.show(false)
+---------------------------------+
|json_field                       |
+---------------------------------+
|{"x":1,"y":"a","z":["a","b","c"]}|
|{"x":2,"y":"b","z":["d","e","f"]}|
+---------------------------------+
# Read S3 bucket into DataFrame and add an input_file Column to store original filename
    df = spark.read.json(path_source_bucket).withColumn("input_file", F.input_file_name())

    # Get an Enriched df2 dataframe
    rdd = df.rdd.map(lambda payload: invoke_aws_lambda(region, payload, source_bucket, destination_bucket))

    # group RDD by output_file
    grouped = rdd.groupBy(lambda x: x.output_file)

    for s3path, records in grouped.collect():
        output_json = ''
        for record in records:
            row_dict = record.asDict()
            del row_dict["output_file"]
            output_json += json.dumps(row_dict) + "\n"
        save_to_s3(region, output_json.rstrip("\n"), s3path, destination_bucket)