如何在pyspark数据帧中保持一列和其他列的最大值?
假设我在pyspark中有这个数据帧:如何在pyspark数据帧中保持一列和其他列的最大值?,pyspark,Pyspark,假设我在pyspark中有这个数据帧: +--------+----------------+---------+---------+ |DeviceID| TimeStamp |range | zipcode | +--------+----------------+---------+--------- | 00236|11-03-2014 07:33| 4.5| 90041 | | 00236|11-04-2014 05:43| 7.2| 9
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-03-2014 07:33| 4.5| 90041 |
| 00236|11-04-2014 05:43| 7.2| 90024 |
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-06-2014 05:55| 5.6| 90037 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
| 00235|11-02-2014 05:33| 4.3| 90029 |
| 00235|11-09-2014 05:33| 4.2| 90047 |
+--------+----------------+---------+---------+
如何编写pyspark脚本以保持此pyspark数据帧中range列和其他列的最大值?输出如下:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
+--------+----------------+---------+---------+
使用
窗口
和行号()
输出:
+--------+----------------+-----+-------+
|DeviceID| TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
| 00236|11-05-2014 05:43| 8.5| 90026|
| 00234|11-01-2014 05:55| 9.2| 90032|
| 00235|11-05-2014 05:33| 4.3| 90082|
+--------+----------------+-----+-------+
谢谢你的回复。您建议的解决方案适用于小数据集。但当数据集的大小增加时,需要几天的时间。如果你有一个更快的解决这个问题的办法,我感谢你与我分享。谢谢
+--------+----------------+-----+-------+
|DeviceID| TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
| 00236|11-05-2014 05:43| 8.5| 90026|
| 00234|11-01-2014 05:55| 9.2| 90032|
| 00235|11-05-2014 05:33| 4.3| 90082|
+--------+----------------+-----+-------+