Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/joomla/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark groupBy:获取列的最小值,但从同一行的不同列中检索值_Pyspark - Fatal编程技术网

Pyspark groupBy:获取列的最小值,但从同一行的不同列中检索值

Pyspark groupBy:获取列的最小值,但从同一行的不同列中检索值,pyspark,Pyspark,我试图用PySpark对我的数据进行分组——我有来自在轨道上行驶的汽车的数据 我想对比赛id、赛车、车手等进行分组,但对于每组,我想记录第一次和最后一次记录的时间,我已经在下面做了。 我还想从记录的第一行中获取轮胎压力。我已尝试执行以下操作,但出现错误: “…由于数据类型不匹配:CaseWhen中的WHEN表达式应全部为布尔类型” 如有任何建议,我们将不胜感激 谢谢 原始数据: +---------+-----------+----------+--------+---------------+

我试图用PySpark对我的数据进行分组——我有来自在轨道上行驶的汽车的数据

我想对比赛id、赛车、车手等进行分组,但对于每组,我想记录第一次和最后一次记录的时间,我已经在下面做了。 我还想从记录的第一行中获取轮胎压力。我已尝试执行以下操作,但出现错误:

“…由于数据类型不匹配:CaseWhen中的WHEN表达式应全部为布尔类型”

如有任何建议,我们将不胜感激

谢谢

原始数据:

+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type  | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:32         | 35            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:45         | 34            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:53         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:32         | 31            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:43         | 30            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:11         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:18         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:27         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type  | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | manual    | ford     | juan   | 09:32      | 09:53    | 35                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | automatic | mazda    | bob    | 09:32      | 09:43    | 31                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2       | automatic | merc     | linda  | 10:11      | 10:27    | 33                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
EVENTS_GROUPED = EVENTS \
    .groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
        .agg(
            f.min(f.col('time_recorded')).alias('start_time'),
            f.max(f.col('time_recorded')).alias('end_time'),
            f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
        )
目标:

+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type  | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:32         | 35            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:45         | 34            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:53         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:32         | 31            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:43         | 30            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:11         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:18         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:27         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type  | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | manual    | ford     | juan   | 09:32      | 09:53    | 35                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | automatic | mazda    | bob    | 09:32      | 09:43    | 31                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2       | automatic | merc     | linda  | 10:11      | 10:27    | 33                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
EVENTS_GROUPED = EVENTS \
    .groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
        .agg(
            f.min(f.col('time_recorded')).alias('start_time'),
            f.max(f.col('time_recorded')).alias('end_time'),
            f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
        )
代码:

+---------+-----------+----------+--------+---------------+---------------+
| race_id | car_type  | car_make | driver | time_recorded | tyre_pressure |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:32         | 35            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:45         | 34            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | manual    | ford     | juan   | 09:53         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:32         | 31            |
+---------+-----------+----------+--------+---------------+---------------+
| 1       | automatic | mazda    | bob    | 09:43         | 30            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:11         | 33            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:18         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
| 2       | automatic | merc     | linda  | 10:27         | 32            |
+---------+-----------+----------+--------+---------------+---------------+
+---------+-----------+----------+--------+------------+----------+---------------------+
| race_id | car_type  | car_make | driver | start_time | end_time | start_tyre_pressure |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | manual    | ford     | juan   | 09:32      | 09:53    | 35                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 1       | automatic | mazda    | bob    | 09:32      | 09:43    | 31                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
| 2       | automatic | merc     | linda  | 10:11      | 10:27    | 33                  |
+---------+-----------+----------+--------+------------+----------+---------------------+
EVENTS_GROUPED = EVENTS \
    .groupBy(['race_id', 'car_type', 'car_make', 'driver']) \
        .agg(
            f.min(f.col('time_recorded')).alias('start_time'),
            f.max(f.col('time_recorded')).alias('end_time'),
            f.when(f.min(f.col('time_recorded')), f.col('tyre_pressure')).alias('start_tyre_pressure'),
        )

创建一个窗口函数,然后使用groupby。这样做的目的是在执行groupby之前创建
第一个轮胎压力
列。要创建此列,我们需要window函数

from pyspark.sql import functions as F
from pyspark.sql import Window

w = Window.partitionBy('race_id', 'car_type', 'car_make', 'driver').orderBy('time_recorded')

df.withColumn('start_tyre_pressure', F.first('tyre_pressure').over(w).alias('start_tyre_pressure'))\
             .groupby('race_id', 'car_type', 'car_make', 'driver', 'start_tyre_pressure')\
             .agg(F.min('time_recorded').alias('start_time'),
                  F.max('time_recorded').alias('end_time')).show()
输出

+-------+---------+--------+------+-------------------+----------+--------+
|race_id| car_type|car_make|driver|start_tyre_pressure|start_time|end_time|
+-------+---------+--------+------+-------------------+----------+--------+
|      1|automatic|   mazda|   bob|                 31|     09:32|   09:43|
|      2|automatic|    merc| linda|                 33|     10:11|   10:27|
|      1|   manual|    ford|  juan|                 35|     09:32|   09:53|
+-------+---------+--------+------+-------------------+----------+--------+


你能分享你的数据吗?@pythonic833完成!太棒了,非常感谢!