Pyspark使用计算值创建汇总表

Pyspark使用计算值创建汇总表,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个如下所示的数据框: +--------------------+---------------------+-------------+------------+-----+ |tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay| +--------------------+---------------------+-------------+------------+-----+ |

我有一个如下所示的数据框:

+--------------------+---------------------+-------------+------------+-----+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay|
+--------------------+---------------------+-------------+------------+-----+
| 2019-01-01 09:01:00|  2019-01-01 08:53:20|          1.5|        2.00| true|
| 2019-01-01 21:59:59|  2019-01-01 21:18:59|          2.6|        5.00|false|
| 2019-01-01 10:01:00|  2019-01-01 08:53:20|          1.5|        2.00| true|
| 2019-01-01 22:59:59|  2019-01-01 21:18:59|          2.6|        5.00|false|
+--------------------+---------------------+-------------+------------+-----+
+------------+-----------+
| day_night  | trip_rate |
+------------+-----------+
|Day         | 1.33      |
|Night       | 1.92      |
+------------+-----------+
我想创建一个汇总表,计算所有夜间旅行和白天旅行的旅行费率(
total\u amount
列除以
trip\u distance
)。所以最终结果应该是这样的:

+--------------------+---------------------+-------------+------------+-----+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|total_amount|isDay|
+--------------------+---------------------+-------------+------------+-----+
| 2019-01-01 09:01:00|  2019-01-01 08:53:20|          1.5|        2.00| true|
| 2019-01-01 21:59:59|  2019-01-01 21:18:59|          2.6|        5.00|false|
| 2019-01-01 10:01:00|  2019-01-01 08:53:20|          1.5|        2.00| true|
| 2019-01-01 22:59:59|  2019-01-01 21:18:59|          2.6|        5.00|false|
+--------------------+---------------------+-------------+------------+-----+
+------------+-----------+
| day_night  | trip_rate |
+------------+-----------+
|Day         | 1.33      |
|Night       | 1.92      |
+------------+-----------+
以下是我想做的:

df2 = spark.createDataFrame(
    [
        ('2019-01-01 09:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
        ('2019-01-01 21:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
        ('2019-01-01 10:01:00','2019-01-01 08:53:20','1.5','2.00','true'),#day
        ('2019-01-01 22:59:59','2019-01-01 21:18:59','2.6','5.00','false'),#night
    ],
    ['tpep_pickup_datetime','tpep_dropoff_datetime','trip_distance','total_amount','day_night'] # add your columns label here
)

day_trip_rate = df2.where(df2.day_night == 'Day').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
night_trip_rate = df2.where(df2.day_night == 'Night').withColumn("trip_rate",F.sum("total_amount")/F.sum("trip_distance"))
我甚至不相信我的方法是正确的。我得到一个错误:(
raiseanalysisexception(s.split(“:”,1)[1],stackTrace)pyspark.sql.utils.AnalysisException:“分组表达式序列为空,“
tpep\u pickup\u datetime
”不是聚合函数。

有人能帮我知道如何处理这个问题以得到汇总表吗

from pyspark.sql import functions as F
from pyspark.sql.functions import *

df2.groupBy("day_night").agg(F.round(F.sum("total_amount")/F.sum("trip_distance"),2).alias('trip_rate'))\
        .withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()

+---------+---------+
|day_night|trip_rate|
+---------+---------+
|      Day|     1.33|
|    Night|     1.92|
+---------+---------+
无需四舍五入:

df2.groupBy("day_night").agg(F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')\
        .withColumn("day_night", F.when(col("day_night")=="true", "Day").otherwise("Night")).show()

(您在
df2
构造代码中有
day\u night
,但显示表中有
isDay
。我认为字段名为
day\u night

再次感谢Cena。您是我的英雄,您甚至发现了我的发布错误。这非常有效。