如何在Spark java数据集中使用groupByKey,然后沿聚合执行自定义逻辑?
我刚刚开始学习Spark并使用Spark Java满足特定需求。我有一个以下格式的数据集如何在Spark java数据集中使用groupByKey,然后沿聚合执行自定义逻辑?,java,dataframe,apache-spark,Java,Dataframe,Apache Spark,我刚刚开始学习Spark并使用Spark Java满足特定需求。我有一个以下格式的数据集 +------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+ |field1| field2| field3| field4| field5| field6|fiel
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1| field2| field3| field4| field5| field6|field7| field8| BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
| 1| ABC|1234|385a3d24e| 3913647| 751923| 191|9977908|321799809| 1334|
| 1| DFC|385a3d24e|3913637| 40010625|751923| 357.0| 9877908|321799841| 1332|
| 1| SDC|385a3d24e|3913637|399787631|751923| 245.0| 363908|321799835| 1332|
| 1| GFF|385a3d24e|3913637|399146918|751923| 275.0| 6977908|321799809| 1334|
| 1| GFF|385a3d24e|3913637|399146918|751923| 275.0| 7975908|321799809| 1335|
我想按BoxId分组,并按索引保存每个组的df
喜欢Boxid吗
321799809
数据帧将是
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1| field2| field3| field4| field5| field6|field7| field8| BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
| 1| ABC|1234|385a3d24e| 3913647| 751923| 191|9977908|321799809| 1334|
| 1| GFF|385a3d24e|3913637|399146918|751923| 275.0| 6977908|321799809| 1334|
| 1| GFF|385a3d24e|3913637|399146918|751923| 275.0| 7975908|321799809| 1335|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1| field2| field3| field4| field5| field6|field7| field8| BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
| 1| DFC|385a3d24e|3913637| 40010625|751923| 357.0| 9877908|321799841| 1332|
文件需要另存为321799809/1334.csv(此csv将包含两行),321799809/1335.csv(仅包含一行)
对于Boxid
321799841
数据帧将是
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1| field2| field3| field4| field5| field6|field7| field8| BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
| 1| ABC|1234|385a3d24e| 3913647| 751923| 191|9977908|321799809| 1334|
| 1| GFF|385a3d24e|3913637|399146918|751923| 275.0| 6977908|321799809| 1334|
| 1| GFF|385a3d24e|3913637|399146918|751923| 275.0| 7975908|321799809| 1335|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1| field2| field3| field4| field5| field6|field7| field8| BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
| 1| DFC|385a3d24e|3913637| 40010625|751923| 357.0| 9877908|321799841| 1332|
文件需要另存为321799841/1332.csv
为此,我考虑编写一个自定义函数,因为保存文件逻辑是自定义的,如下所示
这个想法来自阅读
这根线
Sample custom function to write csv
writeCsv(df){
if(!file.exist(df.col(Index)))
csvWrite(df); // where csv fields are filled.
else{
append row
}
then use
df.groupByKey(t=>t.BoxId).mapGroups((df)=> writeCsv(df));
但是groupByKey的java语法采用了两个参数,即函数和编码器,我找不到任何示例
我试着用POJO创建一个编码器,其中POJO类包含上述数据帧的字段
Encoder<POJO> pojoEncoder = Encoders.bean(POJO.class);
df.as(probeEncoder).groupByKey(t->{return t.BoxId;}, pojoEncoder).mapGroups(); but it gives error at groupByKey()
Encoder-pojoEncoder=Encoders.bean(POJO.class);
as(probeEncoder).groupByKey(t->{return t.BoxId;},pojoEncoder.mapGroups();但它在groupByKey()中给出了错误
有人能帮我介绍一些groupByKey的java示例吗?在java中。会有帮助吗?:
df.write.option(“header”,True)。partitionBy(“BoxId”).csv(“output”)“
@werner试图使用它,但有两个问题我想用自定义位置格式保存文件,如boxid/indexid.csv
如321799841/1332.csv
使用写入选项,它不提供自定义文件名和文件夹结构,如果使用上述格式,则保存为boxid=321799841/part-0000-as2324343.csv。第二,当我尝试用这种方式写文件时,速度非常慢。