Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 使用spark shell对数据进行分组,并从json文件中查找不同日期的平均值_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 使用spark shell对数据进行分组,并从json文件中查找不同日期的平均值

Scala 使用spark shell对数据进行分组,并从json文件中查找不同日期的平均值,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个包含10天独立数据文件夹的目录。每个日期文件夹都有一个JSON文件,如下所示 [{"value": 5,"count" : 16,"currency":"EUR","date" : "2021-01-10"},{"value": 7,"count" : 166,"currency":"USD"

我有一个包含10天独立数据文件夹的目录。每个日期文件夹都有一个JSON文件,如下所示

[{"value": 5,"count" : 16,"currency":"EUR","date" : "2021-01-10"},{"value": 7,"count" : 166,"currency":"USD","date" : "2021-01-10"},{"value": 2,"count" : 188,"currency":"USD","date" : "2021-01-10"},{"value": 3,"count" : 114,"currency":"GBP","date" : "2021-01-11"},{"value": 5,"count" : 80,"currency":"USD","date" : "2021-01-11"},{"value": 10,"count" : 41,"currency":"GBP","date" : "2021-01-12"},{"value": 7,"count" : 84,"currency":"USD","date" : "2021-01-12"},{"value": 3,"count" : 147,"currency":"EUR","date" : "2021-01-15"},{"value": 2,"count" : 172,"currency":"USD","date" : "2021-01-15"},{"value": 10,"count" : 118,"currency":"USD","date" : "2021-01-15"}]
我已使用

val sc = sqlContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("/Users/khan/directory/*/*.json")
我想通过日期和货币读取最近3天的可用数据和分组数据,并找到货币单位为美元的平均值

我的想法:

dates_currency = df.select('date', 'currency').distinct().groupBy(desc('date', 'currency')).limit(3)
dates_currency.select('date', 'currency').distinct().where('currency'=='USD').mean()

我的语法有问题?

您可以使用
densite\u rank
获取最近3天:

import org.apache.spark.sql.expressions.Window

val usd_mean = df.withColumn("rank", dense_rank().over(Window.partitionBy("currency").orderBy(desc("date"))))
                 .filter("rank <= 3 and currency = 'USD'")
                 .groupBy("date")
                 .agg(mean("value"))

usd_mean.show()
+----------+----------+
|      date|avg(value)|
+----------+----------+
|2021-01-15|       6.0|
|2021-01-12|       7.0|
|2021-01-11|       5.0|
+----------+----------+
import org.apache.spark.sql.expressions.Window
val usd_mean=df.带列(“排名”,密集的_rank()。超过(窗口、分区(“货币”)。订单(描述(“日期”))

.filter(“我想获得单独日期的平均值,并且没有为窗口操作定义分区!将所有数据移动到单个分区,这可能会导致严重的性能下降。@看到我编辑的答案了吗?”?