Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/355.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何在spark中分组_Java_Apache Spark - Fatal编程技术网

Java 如何在spark中分组

Java 如何在spark中分组,java,apache-spark,Java,Apache Spark,我有下面的数据-一些简单的数据,但在现实生活中,这个数据集是巨大的 A B 1-1-2018 10 A B 2-1-2018 20 C D 1-1-2018 15 C D 2-1-2018 25 我需要使用日期按上述数据分组并生成密钥对值 1-1-2018->key ----------------- A B 1-1-2018 10 C D 1-1-2018 15 2-1-2018->key ----------------- A B 2-1-2018 20

我有下面的数据-一些简单的数据,但在现实生活中,这个数据集是巨大的

A B 1-1-2018  10
A B 2-1-2018  20
C D 1-1-2018  15
C D 2-1-2018  25 
我需要使用日期按上述数据分组并生成密钥对值

1-1-2018->key
-----------------
A B 1-1-2018  10 
C D 1-1-2018  15 

2-1-2018->key
-----------------
A B 2-1-2018  20
C D 2-1-2018  25 

有谁能告诉我,我们如何在spark中以最佳的优化方式(如果可能的话,使用java)做到这一点吗?

不是java,但看看上面的代码,您似乎希望按键递归地将数据帧设置为子组。我知道的最好的方法是通过一个while循环,这不是地球上最简单的方法

//You will also need to import all DataFrame and Array data types in Scala, don't know if you need to do it for Java for the below code.

//Inputting your DF, with columns as Value_1, Value_2, Key, Output_Amount
val inputDF = //DF From above

//Need to get an empty DF, I just like doing it this way
val testDF = spark.sql("select 'foo' as bar")

var arrayOfDataFrames = Array[DataFrame] = Array(testDF)

val arrayOfKeys = inputDF.selectExpr("Key").distinct.rdd.map(x=>x.mkString).collect

var keyIterator = 1

//Need to overwrite the foo bar first DF
arrayOfDataFrames = Array(inputDF.where($""===arrayOfKeys(keyIterator - 1)))
keyIterator = keyIterator + 1

//loop through find the key and place it into the DataFrames array
while(keyIterator <= arrayOfKeys.length) {
  arrayOfDataFrames = arrayOfDataFrames ++ Array(inputDF.where($"Key"===arrayOfKeys(keyIterator - 1)))
  keyIterator = keyIterator + 1
}
//您还需要导入Scala中的所有DataFrame和Array数据类型,不知道是否需要为下面的代码为Java执行此操作。
//输入DF,列为值_1、值_2、键、输出金额
val inputDF=//从上方开始的DF
//需要一个空的DF,我只是喜欢这样做
val testDF=spark.sql(“选择'foo'作为条”)
var arrayOfDataFrames=数组[DataFrame]=数组(testDF)
val arrayOfKeys=inputDF.selectExpr(“Key”).distinct.rdd.map(x=>x.mkString).collect
var keyIterator=1
//需要先覆盖foo-bar
arrayOfDataFrames=数组(inputDF.where($“”===arrayOfKeys(keyIterator-1)))
keyIterator=keyIterator+1
//循环查找密钥并将其放入DataFrames数组中

while(keyIterator)你能分享你的代码吗?我想并行处理输入,因为输入文件很大,这个while循环可能会使它成为顺序处理。