Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark pyspark在对列使用groupby之前更改该列的值_Apache Spark_Pyspark_Apache Spark Sql_Spark Streaming_Pyspark Sql - Fatal编程技术网

Apache spark pyspark在对列使用groupby之前更改该列的值

Apache spark pyspark在对列使用groupby之前更改该列的值,apache-spark,pyspark,apache-spark-sql,spark-streaming,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Spark Streaming,Pyspark Sql,我有这个json数据,我想每小时在“timestamp”列上聚合,同时在“b”列和“a”列中汇总数据 {"a":1 , "b":1, "timestamp":"2017-01-26T01:14:55.719214Z"} {"a":1 , "b":1,"timestamp":"2017-01-26T01:14:55.719214Z"} {"a":1 , "b":1,"timestamp":"2017-01-26T02:14:55.719214Z"} {"a":1 , "b":1,"timestam

我有这个json数据,我想每小时在“timestamp”列上聚合,同时在“b”列和“a”列中汇总数据

{"a":1 , "b":1, "timestamp":"2017-01-26T01:14:55.719214Z"}
{"a":1 , "b":1,"timestamp":"2017-01-26T01:14:55.719214Z"}
{"a":1 , "b":1,"timestamp":"2017-01-26T02:14:55.719214Z"}
{"a":1 , "b":1,"timestamp":"2017-01-26T03:14:55.719214Z"}
这是我想要的最终输出

{"a":2 , "b":2, "timestamp":"2017-01-26T01:00:00"}
{"a":1 , "b":1,"timestamp":"2017-01-26T02:00:00"}
{"a":1 , "b":1,"timestamp":"2017-01-26T03:00:00"}
这就是我到目前为止写的

df = spark.read.json(inputfile)
df2 = df.groupby("timestamp").agg(f.sum(df["a"],f.sum(df["b"])

但是在使用groupby函数之前,我应该如何更改'timestamp'列的值呢?提前谢谢

我想这是唯一的方法

df2 = df.withColumn("r_timestamp",df["r_timestamp"].substr(0,12)).groupby("timestamp").agg(f.sum(df["a"],f.sum(df["b"])
有没有更好的解决方案来获取所需格式的时间戳

from pyspark.sql import functions as f   

df = spark.read.load(path='file:///home/zht/PycharmProjects/test/disk_file', format='json')
df = df.withColumn('ts', f.to_utc_timestamp(df['timestamp'], 'EST'))
win = f.window(df['ts'], windowDuration='1 hour')
df = df.groupBy(win).agg(f.sum(df['a']).alias('sumA'), f.sum(df['b']).alias('sumB'))
res = df.select(df['window']['start'].alias('start_time'), df['sumA'], df['sumB'])
res.show(truncate=False)

# output:
+---------------------+----+----+                                               
|start_time           |sumA|sumB|
+---------------------+----+----+
|2017-01-26 15:00:00.0|1   |1   |
|2017-01-26 16:00:00.0|1   |1   |
|2017-01-26 14:00:00.0|2   |2   |
+---------------------+----+----+

f、 窗口更加灵活

这可能会有所帮助。它显示了如何舍入已解析的时间戳对象。感谢您的回答,实际上我只需要时间戳列中的“2017-01-26 15:00:00.0”而不是“[2017-01-26 15:00:00.02017-01-26 16:00:00.0]”。你知道我怎么才能得到这个吗?