如何在pyspark数据帧中保持一列和其他列的最大值?

如何在pyspark数据帧中保持一列和其他列的最大值?,pyspark,Pyspark,假设我在pyspark中有这个数据帧: +--------+----------------+---------+---------+ |DeviceID| TimeStamp |range | zipcode | +--------+----------------+---------+--------- | 00236|11-03-2014 07:33| 4.5| 90041 | | 00236|11-04-2014 05:43| 7.2| 9

假设我在pyspark中有这个数据帧:

+--------+----------------+---------+---------+
|DeviceID| TimeStamp      |range    | zipcode |
+--------+----------------+---------+---------
|   00236|11-03-2014 07:33|      4.5| 90041   |
|   00236|11-04-2014 05:43|      7.2| 90024   |
|   00236|11-05-2014 05:43|      8.5| 90026   |
|   00234|11-06-2014 05:55|      5.6| 90037   |
|   00234|11-01-2014 05:55|      9.2| 90032   |
|   00235|11-05-2014 05:33|      4.3| 90082   |
|   00235|11-02-2014 05:33|      4.3| 90029   |
|   00235|11-09-2014 05:33|      4.2| 90047   |
+--------+----------------+---------+---------+
如何编写pyspark脚本以保持此pyspark数据帧中range列和其他列的最大值?输出如下:

+--------+----------------+---------+---------+
|DeviceID| TimeStamp      |range    | zipcode |
+--------+----------------+---------+---------
|   00236|11-05-2014 05:43|      8.5| 90026   |
|   00234|11-01-2014 05:55|      9.2| 90032   |
|   00235|11-05-2014 05:33|      4.3| 90082   |
+--------+----------------+---------+---------+

使用
窗口
行号()

输出:

+--------+----------------+-----+-------+
|DeviceID|       TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
|   00236|11-05-2014 05:43|  8.5|  90026|
|   00234|11-01-2014 05:55|  9.2|  90032|
|   00235|11-05-2014 05:33|  4.3|  90082|
+--------+----------------+-----+-------+

谢谢你的回复。您建议的解决方案适用于小数据集。但当数据集的大小增加时,需要几天的时间。如果你有一个更快的解决这个问题的办法,我感谢你与我分享。谢谢
+--------+----------------+-----+-------+
|DeviceID|       TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
|   00236|11-05-2014 05:43|  8.5|  90026|
|   00234|11-01-2014 05:55|  9.2|  90032|
|   00235|11-05-2014 05:33|  4.3|  90082|
+--------+----------------+-----+-------+