Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/350.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何在ApacheSpark中执行简单的reduceByKey?_Java_Apache Spark - Fatal编程技术网

Java 如何在ApacheSpark中执行简单的reduceByKey?

Java 如何在ApacheSpark中执行简单的reduceByKey?,java,apache-spark,Java,Apache Spark,我是一个新的火花和尝试学习。这是一个相当简单的问题,我有以下代码将重复的键w.r.t减少到它们的值 数据帧的值如下所示 subject object node1 node5 node1 node6 node1 node7 node2 node5 node2 node7 subject object node1 [node5,node6,node

我是一个新的火花和尝试学习。这是一个相当简单的问题,我有以下代码将重复的键w.r.t减少到它们的值

数据帧的值如下所示

 subject      object    

  node1        node5
  node1        node6
  node1        node7
  node2        node5
  node2        node7
 subject      object    

  node1        [node5,node6,node7]
  node2        [node5,node7]
我希望他们像这样减少

 subject      object    

  node1        node5
  node1        node6
  node1        node7
  node2        node5
  node2        node7
 subject      object    

  node1        [node5,node6,node7]
  node2        [node5,node7]
我可以使用
groupByKey
方法实现这一点,但我想在这里使用
reduceByKey
,我无法理解执行此操作的正确语法

这是我的代码:

    DataFrame records = Service.sqlCtx().sql("SELECT subject,object FROM Graph");


    JavaPairRDD<String,Iterable<String>> rows = records.select("subject","object").toJavaRDD().mapToPair(
            new PairFunction<Row,String,String>(){

                @Override
                public Tuple2<String, String> call(Row row) throws Exception {
                    return new Tuple2<String, String>(row.getString(0), row.getString(1));
                }

            // this can be optimized if we use reduceByKey instead of groupByKey
    }).distinct().groupByKey().cache();
DataFrame records=Service.sqlCtx().sql(“从图形中选择主题、对象”);
javapairdd rows=records.select(“主题”、“对象”).toJavaRDD().mapToPair(
新PairFunction(){
@凌驾
公共Tuple2调用(行)引发异常{
返回新的Tuple2(row.getString(0),row.getString(1));
}
//如果我们使用reduceByKey而不是groupByKey,则可以对其进行优化
}).distinct().groupByKey().cache();
  • 一般情况下,无法使用
    reduceByKey
    对其进行优化。效率低下的部分是操作本身,而不是特定的实现
  • 此外,由于签名不兼容,这无法直接用
    reduceByKey
    实现。这可以通过
    aggregateByKey
    combineByKey
    实现,但它仍然不是一种优化
  • 最后,如果使用
    数据帧
    ,只需使用
    收集列表

    import static org.apache.spark.sql.functions.*;
    
    records.groupBy("subject").agg(collect_list(col("object")));
    
      • 一般情况下,无法使用
        reduceByKey
        对其进行优化。效率低下的部分是操作本身,而不是特定的实现
      • 此外,由于签名不兼容,这无法直接用
        reduceByKey
        实现。这可以通过
        aggregateByKey
        combineByKey
        实现,但它仍然不是一种优化
      • 最后,如果使用
        数据帧
        ,只需使用
        收集列表

        import static org.apache.spark.sql.functions.*;
        
        records.groupBy("subject").agg(collect_list(col("object")));
        

      有一种方法可以应用reduceByKey进行优化,但我们必须在reduceByKey之前进行1次转换

      val keyValuePairs = sc.parallelize(List(("node1","node5"),("node1","node6"),("node1","node7"),("node2","node5"),("node2","node7")))    //Input
      
      val mappedKV = keyValuePairs.map(x => (x._1,Seq(x._2)))
      
      //Transform each value of the K,V pair to 'Seq' (extra transformation)
      
      val reducedKV = mappedKV.reduceByKey(_++_)
      
      然后用reduceByKey应用'+'

      val keyValuePairs = sc.parallelize(List(("node1","node5"),("node1","node6"),("node1","node7"),("node2","node5"),("node2","node7")))    //Input
      
      val mappedKV = keyValuePairs.map(x => (x._1,Seq(x._2)))
      
      //Transform each value of the K,V pair to 'Seq' (extra transformation)
      
      val reducedKV = mappedKV.reduceByKey(_++_)
      
      输出:

      scala>减少的千伏收集

      Array[(String,Seq[String])]=Array((node2,List(node5,node7)),(node1,List(node5,node6,node7)))


      有一种方法可以应用reduceByKey进行优化,但我们必须在reduceByKey之前进行1次转换

      val keyValuePairs = sc.parallelize(List(("node1","node5"),("node1","node6"),("node1","node7"),("node2","node5"),("node2","node7")))    //Input
      
      val mappedKV = keyValuePairs.map(x => (x._1,Seq(x._2)))
      
      //Transform each value of the K,V pair to 'Seq' (extra transformation)
      
      val reducedKV = mappedKV.reduceByKey(_++_)
      
      然后用reduceByKey应用'+'

      val keyValuePairs = sc.parallelize(List(("node1","node5"),("node1","node6"),("node1","node7"),("node2","node5"),("node2","node7")))    //Input
      
      val mappedKV = keyValuePairs.map(x => (x._1,Seq(x._2)))
      
      //Transform each value of the K,V pair to 'Seq' (extra transformation)
      
      val reducedKV = mappedKV.reduceByKey(_++_)
      
      输出:

      scala>减少的千伏收集

      Array[(String,Seq[String])]=Array((node2,List(node5,node7)),(node1,List(node5,node6,node7)))