Java 如何在ApacheSpark中执行简单的reduceByKey?
我是一个新的火花和尝试学习。这是一个相当简单的问题,我有以下代码将重复的键w.r.t减少到它们的值 数据帧的值如下所示Java 如何在ApacheSpark中执行简单的reduceByKey?,java,apache-spark,Java,Apache Spark,我是一个新的火花和尝试学习。这是一个相当简单的问题,我有以下代码将重复的键w.r.t减少到它们的值 数据帧的值如下所示 subject object node1 node5 node1 node6 node1 node7 node2 node5 node2 node7 subject object node1 [node5,node6,node
subject object
node1 node5
node1 node6
node1 node7
node2 node5
node2 node7
subject object
node1 [node5,node6,node7]
node2 [node5,node7]
我希望他们像这样减少
subject object
node1 node5
node1 node6
node1 node7
node2 node5
node2 node7
subject object
node1 [node5,node6,node7]
node2 [node5,node7]
我可以使用groupByKey
方法实现这一点,但我想在这里使用reduceByKey
,我无法理解执行此操作的正确语法
这是我的代码:
DataFrame records = Service.sqlCtx().sql("SELECT subject,object FROM Graph");
JavaPairRDD<String,Iterable<String>> rows = records.select("subject","object").toJavaRDD().mapToPair(
new PairFunction<Row,String,String>(){
@Override
public Tuple2<String, String> call(Row row) throws Exception {
return new Tuple2<String, String>(row.getString(0), row.getString(1));
}
// this can be optimized if we use reduceByKey instead of groupByKey
}).distinct().groupByKey().cache();
DataFrame records=Service.sqlCtx().sql(“从图形中选择主题、对象”);
javapairdd rows=records.select(“主题”、“对象”).toJavaRDD().mapToPair(
新PairFunction(){
@凌驾
公共Tuple2调用(行)引发异常{
返回新的Tuple2(row.getString(0),row.getString(1));
}
//如果我们使用reduceByKey而不是groupByKey,则可以对其进行优化
}).distinct().groupByKey().cache();
- 一般情况下,无法使用
对其进行优化。效率低下的部分是操作本身,而不是特定的实现reduceByKey
- 此外,由于签名不兼容,这无法直接用
实现。这可以通过reduceByKey
或aggregateByKey
实现,但它仍然不是一种优化combineByKey
- 最后,如果使用
,只需使用数据帧
:收集列表
import static org.apache.spark.sql.functions.*; records.groupBy("subject").agg(collect_list(col("object")));
- 一般情况下,无法使用
对其进行优化。效率低下的部分是操作本身,而不是特定的实现reduceByKey
- 此外,由于签名不兼容,这无法直接用
实现。这可以通过reduceByKey
或aggregateByKey
实现,但它仍然不是一种优化combineByKey
- 最后,如果使用
,只需使用数据帧
:收集列表
import static org.apache.spark.sql.functions.*; records.groupBy("subject").agg(collect_list(col("object")));
val keyValuePairs = sc.parallelize(List(("node1","node5"),("node1","node6"),("node1","node7"),("node2","node5"),("node2","node7"))) //Input
val mappedKV = keyValuePairs.map(x => (x._1,Seq(x._2)))
//Transform each value of the K,V pair to 'Seq' (extra transformation)
val reducedKV = mappedKV.reduceByKey(_++_)
然后用reduceByKey应用'+'
val keyValuePairs = sc.parallelize(List(("node1","node5"),("node1","node6"),("node1","node7"),("node2","node5"),("node2","node7"))) //Input
val mappedKV = keyValuePairs.map(x => (x._1,Seq(x._2)))
//Transform each value of the K,V pair to 'Seq' (extra transformation)
val reducedKV = mappedKV.reduceByKey(_++_)
输出:
scala>减少的千伏收集
Array[(String,Seq[String])]=Array((node2,List(node5,node7)),(node1,List(node5,node6,node7)))
有一种方法可以应用reduceByKey进行优化,但我们必须在reduceByKey之前进行1次转换
val keyValuePairs = sc.parallelize(List(("node1","node5"),("node1","node6"),("node1","node7"),("node2","node5"),("node2","node7"))) //Input
val mappedKV = keyValuePairs.map(x => (x._1,Seq(x._2)))
//Transform each value of the K,V pair to 'Seq' (extra transformation)
val reducedKV = mappedKV.reduceByKey(_++_)
然后用reduceByKey应用'+'
val keyValuePairs = sc.parallelize(List(("node1","node5"),("node1","node6"),("node1","node7"),("node2","node5"),("node2","node7"))) //Input
val mappedKV = keyValuePairs.map(x => (x._1,Seq(x._2)))
//Transform each value of the K,V pair to 'Seq' (extra transformation)
val reducedKV = mappedKV.reduceByKey(_++_)
输出:
scala>减少的千伏收集
Array[(String,Seq[String])]=Array((node2,List(node5,node7)),(node1,List(node5,node6,node7)))