apachesparkjavapairdd中的键排序
我有一个javapairdd,其键类型为apachesparkjavapairdd中的键排序,java,sorting,apache-spark,rdd,Java,Sorting,Apache Spark,Rdd,我有一个javapairdd,其键类型为Tuple2 我想按我的键对JavaPairRDD进行排序,所以我编写了一个比较器,如下所示: JavaPairRDD<Tuple2<Integer, Integer>, Integer> Rresult=result.sortByKey(new Comparator<Tuple2<Integer, Integer>>() { @Override public int compare(Tu
Tuple2
我想按我的键对JavaPairRDD进行排序,所以我编写了一个比较器,如下所示:
JavaPairRDD<Tuple2<Integer, Integer>, Integer> Rresult=result.sortByKey(new Comparator<Tuple2<Integer, Integer>>() {
@Override
public int compare(Tuple2<Integer, Integer> o1, Tuple2<Integer, Integer> o2) {
if(o1._1()==o2._1())
return o1._2()-o2._2();
return o1._1()-o2._1();
}
},true);
您如何创建
javapairdd
?请在应用排序之前进行检查。对于直接在sortByKey
方法中使用新的比较器,Yow还将获得Task not serializable异常。您应该在单独的类中实现Comparator
和Serializable
,并将其传递给sortByKey
方法。这是样品供你参考
public class SparkSortSample {
public static void main(String[] args) {
//SparkSession
SparkSession spark = SparkSession
.builder()
.appName("SparkSortSample")
.master("local[1]")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
//Sample data
List<Tuple2<Tuple2<Integer, Integer>, Integer>> inputList = new ArrayList<Tuple2<Tuple2<Integer, Integer>, Integer>>();
inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(2, 444), 4444));
inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(3, 333), 3333));
inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(1, 111), 1111));
inputList.add(new Tuple2<Tuple2<Integer, Integer>, Integer>(new Tuple2<Integer, Integer>(2, 222), 2222));
//JavaPairRDD
JavaPairRDD<Tuple2<Integer, Integer>, Integer> javaPairRdd = jsc.parallelizePairs(inputList);
//Sorted RDD
JavaPairRDD<Tuple2<Integer, Integer>, Integer> sortedPairRDD = javaPairRdd.sortByKey(new TupleComparator(), true);
sortedPairRDD.foreach(rdd -> {
System.out.println("sort = " + rdd);
});
// stop
jsc.stop();
jsc.close();
}
}
公共类SparkSortSample{
公共静态void main(字符串[]args){
//SparkSession
火花会话火花=火花会话
.builder()
.appName(“SparkSortSample”)
.master(“本地[1]”)
.getOrCreate();
JavaSparkContext jsc=新的JavaSparkContext(spark.sparkContext());
//样本数据
List inputList=新建ArrayList();
add(新的Tuple2(新的Tuple2(2444),4444));
add(新的Tuple2(新的Tuple2(3333),3333));
add(新的Tuple2(新的Tuple2(11111),1111));
add(新的Tuple2(新的Tuple2(2222),2222));
//爪哇派
javapairrdjavapairrdd=jsc.parallelizePairs(输入列表);
//分类RDD
javapairdd sortedPairRDD=javapairdd.sortByKey(新的TupleComparator(),true);
sortedPairRDD.foreach(rdd->{
System.out.println(“sort=“+rdd”);
});
//停止
jsc.stop();
jsc.close();
}
}
这里是TupleComparator类,它实现了比较器和可序列化接口
class TupleComparator implements Comparator<Tuple2<Integer, Integer>>, Serializable {
@Override
public int compare(Tuple2<Integer, Integer> o1, Tuple2<Integer, Integer> o2) {
if (o1._1() == o2._1())
return o1._2() - o2._2();
return o1._1() - o2._1();
}
}
class TupleComparator实现可序列化的比较器{
@凌驾
公共整数比较(Tuple2 o1,Tuple2 o2){
如果(o1._1()==o2._1())
返回o1._2()-o2._2();
返回o1._1()-o2._1();
}
}
class TupleComparator implements Comparator<Tuple2<Integer, Integer>>, Serializable {
@Override
public int compare(Tuple2<Integer, Integer> o1, Tuple2<Integer, Integer> o2) {
if (o1._1() == o2._1())
return o1._2() - o2._2();
return o1._1() - o2._1();
}
}