Apache spark 聚合后如何包含非聚合列?
我使用的是spark-sql-2.4.1v。 这里我有如下场景Apache spark 聚合后如何包含非聚合列?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我使用的是spark-sql-2.4.1v。 这里我有如下场景 val df = Seq( (2010,"2018-11-24",71285,"USA","0.9192019", "0.1992019", "0.9955999"), (2010,"2017-08-24",71286,"USA","0.9292018", "0.2992019", "0.99662018"), (2010,"2019-02-24",71287,"USA","0.9392017", "0.399
val df = Seq(
(2010,"2018-11-24",71285,"USA","0.9192019", "0.1992019", "0.9955999"),
(2010,"2017-08-24",71286,"USA","0.9292018", "0.2992019", "0.99662018"),
(2010,"2019-02-24",71287,"USA","0.9392017", "0.3992019", "0.99772000")).toDF("seq_id","load_date","company_id","country_code","item1_value","item2_value","item3_value")
.withColumn("item1_value", $"item1_value".cast(DoubleType))
.withColumn("item2_value", $"item2_value".cast(DoubleType))
.withColumn("item3_value", $"item3_value".cast(DoubleType))
.withColumn("fiscal_year", year(col("load_date")).cast(IntegerType))
.withColumn("fiscal_quarter", quarter(col("load_date")).cast(IntegerType))
df.show()
val aggregateColumns = Seq("item1_value","item2_value","item3_value")
var aggDFs = aggregateColumns.map( c => {
df.groupBy("country_code").agg(lit(c).as("col_name"),sum(c).as("sum_of_column"))
})
var combinedDF = aggDFs.reduce(_ union _)
combinedDF.show
我得到的输出数据如下
我需要在输出中获得其他列,即“序列id”、“加载日期”和“公司id”
数据帧聚合操作后如何获取它们?您可以使用窗口函数显示未聚合的列,也可以说显示每行的总和 如果有帮助,请参阅下面的代码段:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(2010,"2018-11-24",71285,"USA","0.9192019", "0.1992019", "0.9955999"),
(2010,"2017-08-24",71286,"USA","0.9292018", "0.2992019", "0.99662018"),
(2010,"2019-02-24",71287,"USA","0.9392017", "0.3992019", "0.99772000")).
toDF("seq_id","load_date","company_id","country_code","item1_value","item2_value","item3_value").
withColumn("item1_value", $"item1_value".cast(DoubleType)).
withColumn("item2_value", $"item2_value".cast(DoubleType)).
withColumn("item3_value", $"item3_value".cast(DoubleType)).
withColumn("fiscal_year", year(col("load_date")).cast(IntegerType)).
withColumn("fiscal_quarter", quarter(col("load_date")).cast(IntegerType))
val byCountry = Window.partitionBy(col("country_code"))
val aggregateColumns = Seq("item1_value","item2_value","item3_value")
var aggDFs = aggregateColumns.map( c => {
df.withColumn("col_name",lit(c)).withColumn("sum_country", sum(c) over byCountry)
})
var combinedDF = aggDFs.reduce(_ union _)
combinedDF.
select("seq_id","load_date","company_id","country_code","col_name","sum_country").
distinct.show(100,false)
输出如下所示:
+------+----------+----------+------------+-----------+------------------+
|seq_id|load_date |company_id|country_code|col_name |sum_country |
+------+----------+----------+------------+-----------+------------------+
|2010 |2019-02-24|71287 |USA |item1_value|2.7876054 |
|2010 |2018-11-24|71285 |USA |item1_value|2.7876054 |
|2010 |2017-08-24|71286 |USA |item1_value|2.7876054 |
|2010 |2018-11-24|71285 |USA |item2_value|0.8976057000000001|
|2010 |2019-02-24|71287 |USA |item2_value|0.8976057000000001|
|2010 |2017-08-24|71286 |USA |item2_value|0.8976057000000001|
|2010 |2019-02-24|71287 |USA |item3_value|2.9899400800000002|
|2010 |2018-11-24|71285 |USA |item3_value|2.9899400800000002|
|2010 |2017-08-24|71286 |USA |item3_value|2.9899400800000002|
+------+----------+----------+------------+-----------+------------------+
您可以使用窗口函数显示未聚合的列,也可以说显示每行的总和 如果有帮助,请参阅下面的代码段:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(2010,"2018-11-24",71285,"USA","0.9192019", "0.1992019", "0.9955999"),
(2010,"2017-08-24",71286,"USA","0.9292018", "0.2992019", "0.99662018"),
(2010,"2019-02-24",71287,"USA","0.9392017", "0.3992019", "0.99772000")).
toDF("seq_id","load_date","company_id","country_code","item1_value","item2_value","item3_value").
withColumn("item1_value", $"item1_value".cast(DoubleType)).
withColumn("item2_value", $"item2_value".cast(DoubleType)).
withColumn("item3_value", $"item3_value".cast(DoubleType)).
withColumn("fiscal_year", year(col("load_date")).cast(IntegerType)).
withColumn("fiscal_quarter", quarter(col("load_date")).cast(IntegerType))
val byCountry = Window.partitionBy(col("country_code"))
val aggregateColumns = Seq("item1_value","item2_value","item3_value")
var aggDFs = aggregateColumns.map( c => {
df.withColumn("col_name",lit(c)).withColumn("sum_country", sum(c) over byCountry)
})
var combinedDF = aggDFs.reduce(_ union _)
combinedDF.
select("seq_id","load_date","company_id","country_code","col_name","sum_country").
distinct.show(100,false)
输出如下所示:
+------+----------+----------+------------+-----------+------------------+
|seq_id|load_date |company_id|country_code|col_name |sum_country |
+------+----------+----------+------------+-----------+------------------+
|2010 |2019-02-24|71287 |USA |item1_value|2.7876054 |
|2010 |2018-11-24|71285 |USA |item1_value|2.7876054 |
|2010 |2017-08-24|71286 |USA |item1_value|2.7876054 |
|2010 |2018-11-24|71285 |USA |item2_value|0.8976057000000001|
|2010 |2019-02-24|71287 |USA |item2_value|0.8976057000000001|
|2010 |2017-08-24|71286 |USA |item2_value|0.8976057000000001|
|2010 |2019-02-24|71287 |USA |item3_value|2.9899400800000002|
|2010 |2018-11-24|71285 |USA |item3_value|2.9899400800000002|
|2010 |2017-08-24|71286 |USA |item3_value|2.9899400800000002|
+------+----------+----------+------------+-----------+------------------+
用下面的代码片段替换您的代码
scala> val W = Window.partitionBy("country_code")
scala> val aggDFs = aggregateColumns.map( c => {
| df.withColumn("col_name", lit(c)).withColumn("sum_of_column" ,sum(c).over(W)).select("seq_id","load_date", "company_id","col_name","sum_of_column")
| })
scala> val combinedDF = aggDFs.reduce(_ union _)
scala> combinedDF.show()
+------+----------+----------+-----------+------------------+
|seq_id| load_date|company_id| col_name| sum_of_column|
+------+----------+----------+-----------+------------------+
| 2010|2018-11-24| 71285|item1_value| 2.7876054|
| 2010|2017-08-24| 71286|item1_value| 2.7876054|
| 2010|2019-02-24| 71287|item1_value| 2.7876054|
| 2010|2018-11-24| 71285|item2_value| 0.8976057|
| 2010|2017-08-24| 71286|item2_value| 0.8976057|
| 2010|2019-02-24| 71287|item2_value| 0.8976057|
| 2010|2018-11-24| 71285|item3_value|2.9899400800000002|
| 2010|2017-08-24| 71286|item3_value|2.9899400800000002|
| 2010|2019-02-24| 71287|item3_value|2.9899400800000002|
+------+----------+----------+-----------+------------------+
用下面的代码片段替换您的代码
scala> val W = Window.partitionBy("country_code")
scala> val aggDFs = aggregateColumns.map( c => {
| df.withColumn("col_name", lit(c)).withColumn("sum_of_column" ,sum(c).over(W)).select("seq_id","load_date", "company_id","col_name","sum_of_column")
| })
scala> val combinedDF = aggDFs.reduce(_ union _)
scala> combinedDF.show()
+------+----------+----------+-----------+------------------+
|seq_id| load_date|company_id| col_name| sum_of_column|
+------+----------+----------+-----------+------------------+
| 2010|2018-11-24| 71285|item1_value| 2.7876054|
| 2010|2017-08-24| 71286|item1_value| 2.7876054|
| 2010|2019-02-24| 71287|item1_value| 2.7876054|
| 2010|2018-11-24| 71285|item2_value| 0.8976057|
| 2010|2017-08-24| 71286|item2_value| 0.8976057|
| 2010|2019-02-24| 71287|item2_value| 0.8976057|
| 2010|2018-11-24| 71285|item3_value|2.9899400800000002|
| 2010|2017-08-24| 71286|item3_value|2.9899400800000002|
| 2010|2019-02-24| 71287|item3_value|2.9899400800000002|
+------+----------+----------+-----------+------------------+
在获得“seq_id”和其他字段后,您的输出会是什么样子?我们正在考虑从基本数据帧进行联接吗?您可以添加所需的输出吗?您可以通过创建所需的列
item1\u值
,避免这种昂贵的reduce(\uuunion\uuu)
(这是迭代且缓慢的),item2\u值
等,而不是6列结果数据框,创建一个多列数据集,并在其各自的列中进行聚合。只是说……在获得“seq_id”和其他字段后,您的输出会是什么样子?我们是否考虑从基本数据帧进行联接?您可以添加所需的输出吗?您可以通过创建所需的列item1\u value
,避免这种昂贵的reduce(uuu-union)
(这是迭代且缓慢的),item2\u值
等,而不是6列结果数据框,创建一个多列数据集,并在其各自的列中进行聚合。只是说…@b工程师,使用上面的代码,尝试使用val而不是var。你能告诉我这个广播变量访问有什么问题吗@b工程师,使用上面的代码,尝试使用val而不是var。您能告诉我这个广播变量访问有什么问题吗?