数据帧中2列的组计数-Spark Java
我有一个Spark数据框,有两列,如下所示数据帧中2列的组计数-Spark Java,java,apache-spark,Java,Apache Spark,我有一个Spark数据框,有两列,如下所示 Date | Area 1/1/2016 | 1 3/1/2016 | 4 1/1/2016 | 1 5/1/2016 | 2 1/1/2016 | 3 1/1/2016 | 1 3/1/2016 | 4 1/1/2016 | 2 3/1/2016 | 3 3/1/2016 | 3 1/1/2016 | 4 1/1/2016 | 4 1/1/2016 | 2 我想要一个输出 Day: 1
Date | Area
1/1/2016 | 1
3/1/2016 | 4
1/1/2016 | 1
5/1/2016 | 2
1/1/2016 | 3
1/1/2016 | 1
3/1/2016 | 4
1/1/2016 | 2
3/1/2016 | 3
3/1/2016 | 3
1/1/2016 | 4
1/1/2016 | 4
1/1/2016 | 2
我想要一个输出
Day: 1/1/2016 -> There are 3 rows at Area1
-> There are 2 rows at Area2
-> There are 1 rows at Area3
-> There are 2 rows at Area4
Day: 3/1/2016 -> There are 0 rows at Area1
-> There are 0 rows at Area2
-> There are 2 rows at Area3
-> There are 2 rows at Area4
Day: 5/1/2016 -> ..........
我的java 8代码是:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;
import org.apache.spark.sql.*;
public class Main {
public static void main(String[] args) {
SparkConf conf = new SparkConf();
conf.setAppName("My 1st Spark app");
conf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession sparkSession = SparkSession.builder().sparkContext(sc.sc()).getOrCreate();
Dataset<Row> df = sparkSession.read().option("header", true).option("inferSchema", "true").option("timestampFormat", "yyyy-MM-dd hh:mm:ss").csv("hdfs://quickstart.cloudera:8020//user//cloudera//fares.csv");
Dataset<Row> df = df_date_column.groupBy("Date").count();
import org.apache.spark.SparkConf;
导入org.apache.spark.api.java.*;
导入org.apache.spark.sql.*;
公共班机{
公共静态void main(字符串[]args){
SparkConf conf=新的SparkConf();
conf.setAppName(“我的第一个Spark应用”);
conf.setMaster(“本地[*]”);
JavaSparkContext sc=新的JavaSparkContext(conf);
SparkSession SparkSession=SparkSession.builder().sparkContext(sc.sc()).getOrCreate();
数据集df=sparkSession.read().option(“header”,true)。option(“inferSchema”,“true”)。option(“timestampFormat”,“yyyy-MM-dd hh:MM:ss”).csv(“hdfs://quickstart.cloudera:8020//user//cloudera//fares.csv");
数据集df=df_date_column.groupBy(“日期”).count();
但我有一个按日期分组的结果,而不是按区域分组的结果。那么如何按日期和区域分组呢?这可以使用Spark SQL窗口函数来完成,对于Spark dataframe上的每个循环,可以使用collect函数来完成(不适合大数据,因为工作速度会变慢)。以下是pyspark代码,您可以将其转换为Java,因为主spark sql查询不会更改。稍后使用Java for循环并访问数组中的每个元素,即sparkDataFrame.collect()
这可以使用Spark SQL窗口函数和带有collect函数的Spark dataframe上的每个循环来完成(不适合大数据,因为作业会变慢)。以下是pyspark代码,您可以将其转换为Java,因为主Spark SQL查询不会更改。稍后使用Java for循环并访问数组上的每个元素,即sparkDataFrame.collect()
您可以
groupBy
列列表。您可以groupBy
列列表。
from pyspark.sql.functions import *
data.createOrReplaceTempView("tmp")
# final = data.groupBy("Area").agg(count("Date"))
# final.show(20,False)
df = spark.sql("""
SELECT distinct date,
area,
count(area) over (partition by date,area order by date,area) as area_cnt,
min(area) over (partition by date order by date,area) as area_first,
max(area) over (partition by date order by date,area desc) as area_last
from tmp
order by date, area
""")
df.show(20,False)
for i in df.collect() :
if i.area_first == i.area :
print("Day: " + i.date + " -> There are " + str(i.area_cnt) + " rows at Area" + str(i.area))
else :
print(" -> There are " + str(i.area_cnt) + " rows at Area" + str(i.area))
InputData :
+--------+----+--------+----------+---------+
|date |area|area_cnt|area_first|area_last|
+--------+----+--------+----------+---------+
|1/1/2016|1 |3 |1 |4 |
|1/1/2016|2 |2 |1 |4 |
|1/1/2016|3 |1 |1 |4 |
|1/1/2016|4 |2 |1 |4 |
|3/1/2016|3 |2 |3 |4 |
|3/1/2016|4 |2 |3 |4 |
|5/1/2016|2 |1 |2 |2 |
+--------+----+--------+----------+---------+
Output :
Day: 1/1/2016 -> There are 3 rows at Area1
-> There are 2 rows at Area2
-> There are 1 rows at Area3
-> There are 2 rows at Area4
Day: 3/1/2016 -> There are 2 rows at Area3
-> There are 2 rows at Area4
Day: 5/1/2016 -> There are 1 rows at Area2