Python 聚合并创建Spark Dataframe列中“Dictionary”对象的数组
我创建了一个玩具火花数据框: 将numpy作为np导入 进口Pypark 从pyspark.sql导入函数作为sf 从pyspark.sql导入函数为F sc=pyspark.SparkContext sqlc=pyspark.SQLContextsc df=spark.createDataFrame['csc123'、'sr1'、'tac1'、'abc', ‘csc123’、‘sr2’、‘tac1’、‘abc’, ‘csc234’、‘sr3’、‘tac2’、‘bvd’, “csc345”、“sr5”、“tac2”、“bvd” ], ['bug_id'、'sr_link'、'TAC_工程师'、'de_manager'] df.show 然后我尝试为每个bug id聚合并生成[sr_link,sr_link]数组Python 聚合并创建Spark Dataframe列中“Dictionary”对象的数组,python,dataframe,apache-spark,pyspark,Python,Dataframe,Apache Spark,Pyspark,我创建了一个玩具火花数据框: 将numpy作为np导入 进口Pypark 从pyspark.sql导入函数作为sf 从pyspark.sql导入函数为F sc=pyspark.SparkContext sqlc=pyspark.SQLContextsc df=spark.createDataFrame['csc123'、'sr1'、'tac1'、'abc', ‘csc123’、‘sr2’、‘tac1’、‘abc’, ‘csc234’、‘sr3’、‘tac2’、‘bvd’, “csc345”、“s
#df = spark.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df_drop_dup = df.select('bug_id', 'de_manager').dropDuplicates()
df = df.withColumn('joined_column',
sf.concat(sf.col('sr_link'),sf.lit(' '), sf.col('TAC_engineer')))
df_sev_arr = df.groupby("bug_id").agg(F.collect_set("joined_column")).withColumnRenamed("collect_set(joined_column)","sr_array")
df = df_drop_dup.join(df_sev_arr, on=['bug_id'], how='inner')
df.show()
以下是输出:
+------+----------+--------------------+
|bug_id|de_manager| sr_array|
+------+----------+--------------------+
|csc345| bvd| [sr5 tac2]|
|csc123| abc|[sr2 tac1, sr1 tac1]|
|csc234| bvd| [sr3 tac2]|
+------+----------+--------------------+
但我真正期望的实际产出是:
+------+----------+----------------------------------------------------------------------+
|bug_id|de_manager| sr_array|
+------+----------+----------------------------------------------------------------------+
|csc345| bvd| [{sr_link: sr5, TAC_engineer:tac2}]|
|csc123| abc|[{sr_link: sr2, TAC_engineer:tac1},{sr_link: sr1, TAC_engineer: tac1}]|
|csc234| bvd| [{sr_link: sr3, TAC_engineer: tac2}]|
+------+----------+----------------------------------------------------------------------+
由于我希望最终输出可以保存为JSON格式,如:
'bug_id': 'csc123'
'de_manager': 'abc'
'sr_array':
'sr_link': 'sr2', 'TAC_engineer': 'tac1'
'sr_link': 'sr1', 'TAC_engineer': 'tac1'
有人能帮忙吗?对不起,我对Spark数据框中的MapType非常不熟悉
只是修改了一些函数,并根据您的需要添加了新函数
要求
第一部分将保持不变
我刚刚修改了第二部分
将重命名函数从WithcolumnRenamed修改为Alias,并添加到_json和Struct函数中,以获得所需的输出,并且对数据帧命名df->df1进行了少量修改
>>> df1 = df.withColumn('joined_column', F.to_json(F.struct(F.col('sr_link'), F.col('TAC_engineer'))))
>>> df_sev_arr = df1.groupby("bug_id").agg(F.collect_set("joined_column").alias("sr_array"))
>>> df = df_drop_dup.join(df_sev_arr, on=['bug_id'], how='inner')
>>> df.show(truncate=False)
+------+----------+----------------------------------------------------------------------------------+
|bug_id|de_manager|sr_array |
+------+----------+----------------------------------------------------------------------------------+
|csc345|bvd |[{"sr_link":"sr5","TAC_engineer":"tac2"}] |
|csc123|abc |[{"sr_link":"sr1","TAC_engineer":"tac1"}, {"sr_link":"sr2","TAC_engineer":"tac1"}]|
|csc234|bvd |[{"sr_link":"sr3","TAC_engineer":"tac2"}] |
+------+----------+----------------------------------------------------------------------------------+
如果您有任何与此相关的问题,请务必告诉我
from pyspark.sql import functions as F
# sc = pyspark.SparkContext()
# sqlc = pyspark.SQLContext(sc)
df = spark.createDataFrame([('csc123','sr1', 'tac1', 'abc'),
('csc123','sr2', 'tac1', 'abc'),
('csc234','sr3', 'tac2', 'bvd'),
('csc345','sr5', 'tac2', 'bvd')
],
['bug_id', 'sr_link', 'TAC_engineer','de_manager'])
df.show()
>>> df_drop_dup = df.select('bug_id', 'de_manager').dropDuplicates()
>>> df1 = df.withColumn('joined_column', F.to_json(F.struct(F.col('sr_link'), F.col('TAC_engineer'))))
>>> df_sev_arr = df1.groupby("bug_id").agg(F.collect_set("joined_column").alias("sr_array"))
>>> df = df_drop_dup.join(df_sev_arr, on=['bug_id'], how='inner')
>>> df.show(truncate=False)
+------+----------+----------------------------------------------------------------------------------+
|bug_id|de_manager|sr_array |
+------+----------+----------------------------------------------------------------------------------+
|csc345|bvd |[{"sr_link":"sr5","TAC_engineer":"tac2"}] |
|csc123|abc |[{"sr_link":"sr1","TAC_engineer":"tac1"}, {"sr_link":"sr2","TAC_engineer":"tac1"}]|
|csc234|bvd |[{"sr_link":"sr3","TAC_engineer":"tac2"}] |
+------+----------+----------------------------------------------------------------------------------+