Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/php/256.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 聚合Dataframe中的JSON对象并将字符串时间戳转换为日期_Scala_Apache Spark - Fatal编程技术网

Scala 聚合Dataframe中的JSON对象并将字符串时间戳转换为日期

Scala 聚合Dataframe中的JSON对象并将字符串时间戳转换为日期,scala,apache-spark,Scala,Apache Spark,我得到的JSON行如下所示 [{"time":"2017-03-23T12:23:05","user":"randomUser","action":"sleeping"}] [{"time":"2017-03-23T12:24:05","user":"randomUser","action":"sleeping"}] [{"time":"2017-03-23T12:33:05","user":"randomUser","action":"sleeping"}] [

我得到的JSON行如下所示

    [{"time":"2017-03-23T12:23:05","user":"randomUser","action":"sleeping"}]
    [{"time":"2017-03-23T12:24:05","user":"randomUser","action":"sleeping"}]
    [{"time":"2017-03-23T12:33:05","user":"randomUser","action":"sleeping"}]
    [{"time":"2017-03-23T15:33:05","user":"randomUser2","action":"eating"}]
    [{"time":"2017-03-23T15:33:06","user":"randomUser2","action":"eating"}]
所以我有两个问题,首先,时间作为字符串存储在我的df中,我相信它必须是日期,我才能聚合它们

其次,我需要按5分钟的间隔汇总这些数据, 例如,从2017-03-23T12:20:00到2017-03-23T12:24:59发生的所有事情都需要聚合,并被视为2017-03-23T12:20:00时间戳

预期产量为

    [{"time":"2017-03-23T12:20:00","user":"randomUser","action":"sleeping","count":2}]
    [{"time":"2017-03-23T12:30:00","user":"randomUser","action":"sleeping","count":1}]
    [{"time":"2017-03-23T15:30:00","user":"randomUser2","action":"eating","count":2}]

谢谢

您可以使用casting将
StringType
列转换为
TimestampType
列;然后,您可以将时间戳强制转换为
IntegerType
,以使“舍入”到最后5分钟的间隔更容易,并根据该时间戳(以及所有其他列)进行分组:

// importing SparkSession's implicits
import spark.implicits._

// Use casting to convert String into Timestamp:
val withTime = df.withColumn("time", $"time" cast TimestampType)

// calculate the "most recent 5-minute-round time" and group by all
val result = withTime.withColumn("time", $"time" cast IntegerType)
  .withColumn("time", ($"time" - ($"time" mod 60 * 5)) cast TimestampType)
  .groupBy("time", "user", "action").count()

result.show(truncate = false)
// +---------------------+-----------+--------+-----+
// |time                 |user       |action  |count|
// +---------------------+-----------+--------+-----+
// |2017-03-23 12:20:00.0|randomUser |sleeping|2    |
// |2017-03-23 15:30:00.0|randomUser2|eating  |2    |
// |2017-03-23 12:30:00.0|randomUser |sleeping|1    |
// +---------------------+-----------+--------+-----+