Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/401.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Spark和Java8从Dataframe获取多列的不同值计数_Java_Apache Spark_Java 8 - Fatal编程技术网

使用Spark和Java8从Dataframe获取多列的不同值计数

使用Spark和Java8从Dataframe获取多列的不同值计数,java,apache-spark,java-8,Java,Apache Spark,Java 8,我想使用Spark和Java8从Dataframe获得多列的不同值计数 输入数据帧-需要为动态列编写代码-列可以稍后添加 +----+----+----+ |Col1|Col2|Col3| +----+----+----+ |A1|Y|B2|Y|C3|Y| |A1|Y|B2|N|C3|Y| |A1|Y|B2|Y|C3|N| +----+----+----+ 输出日期框 +--------+---------------------+--------------------+ |Col1

我想使用Spark和Java8从Dataframe获得多列的不同值计数

输入数据帧-需要为动态列编写代码-列可以稍后添加

+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
|A1|Y|B2|Y|C3|Y|
|A1|Y|B2|N|C3|Y|
|A1|Y|B2|Y|C3|N|
+----+----+----+
输出日期框

+--------+---------------------+--------------------+
|Col1    | Col2                | Col3               |
+--------+---------------------+--------------------+
|A1|Y - 3| B2|Y - 2 & B2|N - 1 | C3|Y - 3 & C3|N -1 |
+----+----+----+----+----+----+----+----+----+------+

也许这会对您有所帮助,在scala中使用rdd,但在java中应该非常类似

  val df = Seq(("a", "a", "a"), ("a", "b", "c"), ("b", "b", "c")).toDF("Col1","Col2","Col3")
  df.show()

  val ok = df.rdd.map(s => {
    var arr = new Array[(String, String)](s.size)
    for (i <- 0 to s.size - 1) {
      arr(i) = (s.getString(i), s.schema.fieldNames(i))
    }
    arr
  }).map(s => {
    for (i <- s) yield ((i._2, i._1), 1)
  }).flatMap(s => s)
    .reduceByKey(_ + _)
    .map(s => (s._1._1, s._1._2 + "=" + s._2))
    .reduceByKey(_ +","+ _)

  ok.foreach(println(_))

查看
groupingBy
和下游as
counting
收集器方法。数据集df_dup=spark.read().format(“json”).load(“src/main/resources/new2.json”);df_dup=df_dup.groupBy(“Col2”).agg(org.apache.spark.sql.functions.count(“Col2”).as(“count”);df_dup=df_dup.withColumn(“Final”,org.apache.spark.sql.functions.concat(df_dup.col(“Col2”),org.apache.spark.sql.functions.lit(“-”),df_dup.col(“Count”);df_dup=df_dup.drop(df_dup.col(“Col2”));df_dup=df_dup.drop(df_dup.col(“Count”);df_dup.show();无法以有效的方式获得预期的输出-如果您愿意帮助用java8编写代码,我们将不胜感激。

 +----+----+----+
 |Col1|Col2|Col3|
 +----+----+----+
 |   a|   a|   a|
 |   a|   b|   c|
 |   b|   b|   c|
 +----+----+----+

 (Col1,a=2,b=1)
 (Col2,b=2,a=1)
 (Col3,a=1,c=2)