Pyspark groupBy:获取列的最小值,但从同一行的不同列中检索值
我试图用PySpark对我的数据进行分组——我有来自在轨道上行驶的汽车的数据 我想对比赛id、赛车、车手等进行分组,但对于每组,我想记录第一次和最后一次记录的时间,我已经在下面做了。 我还想从记录的第一行中获取轮胎压力。我已尝试执行以下操作,但出现错误: “…由于数据类型不匹配:CaseWhen中的WHEN表达式应全部为布尔类型” 如有任何建议,我们将不胜感激 谢谢 原始数据:Pyspark groupBy:获取列的最小值,但从同一行的不同列中检索值,pyspark,Pyspark,我试图用PySpark对我的数据进行分组——我有来自在轨道上行驶的汽车的数据 我想对比赛id、赛车、车手等进行分组,但对于每组,我想记录第一次和最后一次记录的时间,我已经在下面做了。 我还想从记录的第一行中获取轮胎压力。我已尝试执行以下操作,但出现错误: “…由于数据类型不匹配:CaseWhen中的WHEN表达式应全部为布尔类型” 如有任何建议,我们将不胜感激 谢谢 原始数据: +---------+-----------+----------+--------+---------------+
+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:32 | 35 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:45 | 34 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:53 | 33 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | automatic | mazda | bob | 09:32 | 31 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | automatic | mazda | bob | 09:43 | 30 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:11 | 33 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:18 | 32 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:27 | 32 |
+---------+-----------+----------+--------+---------------+---------------+
+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1 | manual | ford | juan | 09:32 | 09:53 | 35 |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1 | automatic | mazda | bob | 09:32 | 09:43 | 31 |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2 | automatic | merc | linda | 10:11 | 10:27 | 33 |
+---------+-----------+----------+--------+------------+----------+---------------------+
EVENTS_GROUPED = EVENTS \
.groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
.agg(
f.min(f.col('time_recorded')).alias('start_time'),
f.max(f.col('time_recorded')).alias('end_time'),
f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
)
目标:
+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:32 | 35 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:45 | 34 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:53 | 33 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | automatic | mazda | bob | 09:32 | 31 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | automatic | mazda | bob | 09:43 | 30 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:11 | 33 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:18 | 32 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:27 | 32 |
+---------+-----------+----------+--------+---------------+---------------+
+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1 | manual | ford | juan | 09:32 | 09:53 | 35 |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1 | automatic | mazda | bob | 09:32 | 09:43 | 31 |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2 | automatic | merc | linda | 10:11 | 10:27 | 33 |
+---------+-----------+----------+--------+------------+----------+---------------------+
EVENTS_GROUPED = EVENTS \
.groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
.agg(
f.min(f.col('time_recorded')).alias('start_time'),
f.max(f.col('time_recorded')).alias('end_time'),
f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
)
代码:
+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:32 | 35 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:45 | 34 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | manual | ford | juan | 09:53 | 33 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | automatic | mazda | bob | 09:32 | 31 |
+---------+-----------+----------+--------+---------------+---------------+
| 1 | automatic | mazda | bob | 09:43 | 30 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:11 | 33 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:18 | 32 |
+---------+-----------+----------+--------+---------------+---------------+
| 2 | automatic | merc | linda | 10:27 | 32 |
+---------+-----------+----------+--------+---------------+---------------+
+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1 | manual | ford | juan | 09:32 | 09:53 | 35 |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1 | automatic | mazda | bob | 09:32 | 09:43 | 31 |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2 | automatic | merc | linda | 10:11 | 10:27 | 33 |
+---------+-----------+----------+--------+------------+----------+---------------------+
EVENTS_GROUPED = EVENTS \
.groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
.agg(
f.min(f.col('time_recorded')).alias('start_time'),
f.max(f.col('time_recorded')).alias('end_time'),
f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
)
创建一个窗口函数,然后使用groupby。这样做的目的是在执行groupby之前创建
第一个轮胎压力列。要创建此列,我们需要window函数
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('race_id', 'car_type', 'car_make', 'driver').orderBy('time_recorded')
df.withColumn('start_tyre_pressure', F.first('tyre_pressure').over(w).alias('start_tyre_pressure'))\
.groupby('race_id', 'car_type', 'car_make', 'driver', 'start_tyre_pressure')\
.agg(F.min('time_recorded').alias('start_time'),
F.max('time_recorded').alias('end_time')).show()
输出
+-------+---------+--------+------+-------------------+----------+--------+
|race_id| car_type|car_make|driver|start_tyre_pressure|start_time|end_time|
+-------+---------+--------+------+-------------------+----------+--------+
| 1|automatic| mazda| bob| 31| 09:32| 09:43|
| 2|automatic| merc| linda| 33| 10:11| 10:27|
| 1| manual| ford| juan| 35| 09:32| 09:53|
+-------+---------+--------+------+-------------------+----------+--------+
你能分享你的数据吗?@pythonic833完成!太棒了,非常感谢!