Scala 使用spark shell对数据进行分组,并从json文件中查找不同日期的平均值
我有一个包含10天独立数据文件夹的目录。每个日期文件夹都有一个JSON文件,如下所示Scala 使用spark shell对数据进行分组,并从json文件中查找不同日期的平均值,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个包含10天独立数据文件夹的目录。每个日期文件夹都有一个JSON文件,如下所示 [{"value": 5,"count" : 16,"currency":"EUR","date" : "2021-01-10"},{"value": 7,"count" : 166,"currency":"USD"
[{"value": 5,"count" : 16,"currency":"EUR","date" : "2021-01-10"},{"value": 7,"count" : 166,"currency":"USD","date" : "2021-01-10"},{"value": 2,"count" : 188,"currency":"USD","date" : "2021-01-10"},{"value": 3,"count" : 114,"currency":"GBP","date" : "2021-01-11"},{"value": 5,"count" : 80,"currency":"USD","date" : "2021-01-11"},{"value": 10,"count" : 41,"currency":"GBP","date" : "2021-01-12"},{"value": 7,"count" : 84,"currency":"USD","date" : "2021-01-12"},{"value": 3,"count" : 147,"currency":"EUR","date" : "2021-01-15"},{"value": 2,"count" : 172,"currency":"USD","date" : "2021-01-15"},{"value": 10,"count" : 118,"currency":"USD","date" : "2021-01-15"}]
我已使用
val sc = sqlContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("/Users/khan/directory/*/*.json")
我想通过日期和货币读取最近3天的可用数据和分组数据,并找到货币单位为美元的平均值
我的想法:
dates_currency = df.select('date', 'currency').distinct().groupBy(desc('date', 'currency')).limit(3)
dates_currency.select('date', 'currency').distinct().where('currency'=='USD').mean()
我的语法有问题?您可以使用
densite\u rank
获取最近3天:
import org.apache.spark.sql.expressions.Window
val usd_mean = df.withColumn("rank", dense_rank().over(Window.partitionBy("currency").orderBy(desc("date"))))
.filter("rank <= 3 and currency = 'USD'")
.groupBy("date")
.agg(mean("value"))
usd_mean.show()
+----------+----------+
| date|avg(value)|
+----------+----------+
|2021-01-15| 6.0|
|2021-01-12| 7.0|
|2021-01-11| 5.0|
+----------+----------+
import org.apache.spark.sql.expressions.Window
val usd_mean=df.带列(“排名”,密集的_rank()。超过(窗口、分区(“货币”)。订单(描述(“日期”))
.filter(“我想获得单独日期的平均值,并且没有为窗口操作定义分区!将所有数据移动到单个分区,这可能会导致严重的性能下降。@看到我编辑的答案了吗?”?