Pyspark groupBy：获取列的最小值，但从同一行的不同列中检索值_Pyspark

Pyspark groupBy：获取列的最小值，但从同一行的不同列中检索值

pyspark

Pyspark groupBy：获取列的最小值，但从同一行的不同列中检索值,pyspark,Pyspark,我试图用PySpark对我的数据进行分组——我有来自在轨道上行驶的汽车的数据我想对比赛id、赛车、车手等进行分组，但对于每组，我想记录第一次和最后一次记录的时间，我已经在下面做了。我还想从记录的第一行中获取轮胎压力。我已尝试执行以下操作，但出现错误： “…由于数据类型不匹配：CaseWhen中的WHEN表达式应全部为布尔类型” 如有任何建议，我们将不胜感激谢谢原始数据： +---------+-----------+----------+--------+---------------+

我试图用PySpark对我的数据进行分组——我有来自在轨道上行驶的汽车的数据

我想对比赛id、赛车、车手等进行分组，但对于每组，我想记录第一次和最后一次记录的时间，我已经在下面做了。我还想从记录的第一行中获取轮胎压力。我已尝试执行以下操作，但出现错误：

“…由于数据类型不匹配：CaseWhen中的WHEN表达式应全部为布尔类型”

如有任何建议，我们将不胜感激

谢谢

原始数据：

+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type  | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:32         | 35            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:45         | 34            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:53         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:32         | 31            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:43         | 30            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:11         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:18         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:27         | 32            |
+---------+-----------+----------+--------+---------------+---------------+

+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type  | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | manual    | ford     | juan   | 09:32      | 09:53    | 35                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | automatic | mazda    | bob    | 09:32      | 09:43    | 31                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2       | automatic | merc     | linda  | 10:11      | 10:27    | 33                  |
+---------+-----------+----------+--------+------------+----------+---------------------+

EVENTS_GROUPED = EVENTS \
    .groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
        .agg(
            f.min(f.col('time_recorded')).alias('start_time'),
            f.max(f.col('time_recorded')).alias('end_time'),
            f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
        )

目标：

+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type  | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:32         | 35            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:45         | 34            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:53         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:32         | 31            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:43         | 30            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:11         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:18         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:27         | 32            |
+---------+-----------+----------+--------+---------------+---------------+

+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type  | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | manual    | ford     | juan   | 09:32      | 09:53    | 35                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | automatic | mazda    | bob    | 09:32      | 09:43    | 31                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2       | automatic | merc     | linda  | 10:11      | 10:27    | 33                  |
+---------+-----------+----------+--------+------------+----------+---------------------+

EVENTS_GROUPED = EVENTS \
    .groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
        .agg(
            f.min(f.col('time_recorded')).alias('start_time'),
            f.max(f.col('time_recorded')).alias('end_time'),
            f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
        )

代码：

+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type  | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:32         | 35            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:45         | 34            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:53         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:32         | 31            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:43         | 30            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:11         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:18         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:27         | 32            |
+---------+-----------+----------+--------+---------------+---------------+

+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type  | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | manual    | ford     | juan   | 09:32      | 09:53    | 35                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | automatic | mazda    | bob    | 09:32      | 09:43    | 31                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2       | automatic | merc     | linda  | 10:11      | 10:27    | 33                  |
+---------+-----------+----------+--------+------------+----------+---------------------+

EVENTS_GROUPED = EVENTS \
    .groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
        .agg(
            f.min(f.col('time_recorded')).alias('start_time'),
            f.max(f.col('time_recorded')).alias('end_time'),
            f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
        )

创建一个窗口函数，然后使用groupby。这样做的目的是在执行groupby之前创建

第一个轮胎压力列。要创建此列，我们需要window函数
from pyspark.sql import functions as F
from pyspark.sql import Window

w = Window.partitionBy('race_id', 'car_type', 'car_make', 'driver').orderBy('time_recorded')

df.withColumn('start_tyre_pressure', F.first('tyre_pressure').over(w).alias('start_tyre_pressure'))\
             .groupby('race_id', 'car_type', 'car_make', 'driver', 'start_tyre_pressure')\
             .agg(F.min('time_recorded').alias('start_time'),
                  F.max('time_recorded').alias('end_time')).show()

输出
+-------+---------+--------+------+-------------------+----------+--------+
|race_id| car_type|car_make|driver|start_tyre_pressure|start_time|end_time|
+-------+---------+--------+------+-------------------+----------+--------+
|      1|automatic|   mazda|   bob|                 31|     09:32|   09:43|
|      2|automatic|    merc| linda|                 33|     10:11|   10:27|
|      1|   manual|    ford|  juan|                 35|     09:32|   09:53|
+-------+---------+--------+------+-------------------+----------+--------+


你能分享你的数据吗？@pythonic833完成！太棒了，非常感谢！