Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/vba/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在Spark java数据集中使用groupByKey,然后沿聚合执行自定义逻辑?_Java_Dataframe_Apache Spark - Fatal编程技术网

如何在Spark java数据集中使用groupByKey,然后沿聚合执行自定义逻辑?

如何在Spark java数据集中使用groupByKey,然后沿聚合执行自定义逻辑?,java,dataframe,apache-spark,Java,Dataframe,Apache Spark,我刚刚开始学习Spark并使用Spark Java满足特定需求。我有一个以下格式的数据集 +------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+ |field1| field2| field3| field4| field5| field6|fiel

我刚刚开始学习Spark并使用Spark Java满足特定需求。我有一个以下格式的数据集

+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1|            field2|              field3|    field4|    field5|     field6|field7|  field8|   BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|  1|         ABC|1234|385a3d24e| 3913647|   751923| 191|9977908|321799809|   1334|
|  1|       DFC|385a3d24e|3913637| 40010625|751923| 357.0|    9877908|321799841|   1332|
|  1|        SDC|385a3d24e|3913637|399787631|751923| 245.0|    363908|321799835|   1332|
|  1|       GFF|385a3d24e|3913637|399146918|751923| 275.0|    6977908|321799809|   1334|
|  1|       GFF|385a3d24e|3913637|399146918|751923| 275.0|    7975908|321799809|   1335|
我想按BoxId分组,并按索引保存每个组的df

喜欢Boxid吗 321799809 数据帧将是

+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1|            field2|              field3|    field4|    field5|     field6|field7|  field8|   BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|  1|         ABC|1234|385a3d24e| 3913647|   751923| 191|9977908|321799809|   1334|
|  1|       GFF|385a3d24e|3913637|399146918|751923| 275.0|    6977908|321799809|   1334|
|  1|       GFF|385a3d24e|3913637|399146918|751923| 275.0|    7975908|321799809|   1335|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1|            field2|              field3|    field4|    field5|     field6|field7|  field8|   BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|  1|       DFC|385a3d24e|3913637| 40010625|751923| 357.0|    9877908|321799841|   1332|
文件需要另存为321799809/1334.csv(此csv将包含两行),321799809/1335.csv(仅包含一行)

对于Boxid 321799841 数据帧将是

+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1|            field2|              field3|    field4|    field5|     field6|field7|  field8|   BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|  1|         ABC|1234|385a3d24e| 3913647|   751923| 191|9977908|321799809|   1334|
|  1|       GFF|385a3d24e|3913637|399146918|751923| 275.0|    6977908|321799809|   1334|
|  1|       GFF|385a3d24e|3913637|399146918|751923| 275.0|    7975908|321799809|   1335|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|field1|            field2|              field3|    field4|    field5|     field6|field7|  field8|   BoxId|Index|
+------+------------------+--------------------+----------+----------+-----------+------+--------+---------+---------+
|  1|       DFC|385a3d24e|3913637| 40010625|751923| 357.0|    9877908|321799841|   1332|
文件需要另存为321799841/1332.csv

为此,我考虑编写一个自定义函数,因为保存文件逻辑是自定义的,如下所示

这个想法来自阅读 这根线

Sample custom function to write csv
 
writeCsv(df){
if(!file.exist(df.col(Index)))
    csvWrite(df); // where csv fields are filled.
else{
   append row

}

then use

df.groupByKey(t=>t.BoxId).mapGroups((df)=> writeCsv(df));

但是groupByKey的java语法采用了两个参数,即函数和编码器,我找不到任何示例

我试着用POJO创建一个编码器,其中POJO类包含上述数据帧的字段

Encoder<POJO> pojoEncoder = Encoders.bean(POJO.class);

df.as(probeEncoder).groupByKey(t->{return t.BoxId;}, pojoEncoder).mapGroups(); but it gives error at groupByKey()
Encoder-pojoEncoder=Encoders.bean(POJO.class);
as(probeEncoder).groupByKey(t->{return t.BoxId;},pojoEncoder.mapGroups();但它在groupByKey()中给出了错误

有人能帮我介绍一些groupByKey的java示例吗?在java中。

会有帮助吗?:
df.write.option(“header”,True)。partitionBy(“BoxId”).csv(“output”)“
@werner试图使用它,但有两个问题我想用自定义位置格式保存文件,如
boxid/indexid.csv
321799841/1332.csv
使用写入选项,它不提供自定义文件名和文件夹结构,如果使用上述格式,则保存为boxid=321799841/part-0000-as2324343.csv。第二,当我尝试用这种方式写文件时,速度非常慢。