Python 删除低于阈值的功能,但保留spark数据帧中每个组的第一个和最后一个条目
我有一个像这样的spark数据框Python 删除低于阈值的功能,但保留spark数据帧中每个组的第一个和最后一个条目,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个像这样的spark数据框 +----+---------+------------------+ |user|timestamp| distance| +----+---------+------------------+ | A| 1| 0.0| | A| 2| 36.35191443247001| | A| 3|62.550475311048984| | A| 4|1
+----+---------+------------------+
|user|timestamp| distance|
+----+---------+------------------+
| A| 1| 0.0|
| A| 2| 36.35191443247001|
| A| 3|62.550475311048984|
| A| 4|16.847739134139704|
| A| 5|17.952563555225684|
| A| 6|102.41261599024176|
| A| 7| 95.82221771177366|
| A| 8|104.63394547709433|
| A| 9|26.506336419934364|
| A| 10|157.00039533864333|
| A| 11| 20.15671111217189|
| A| 12|21.870381223509487|
| A| 13|18.137363209583356|
| A| 14|129.28661000398125|
| A| 15|163.74993239641088|
| A| 16| 267.4166754520851|
| B| 17| 0.0|
| B| 18|101.20396648774368|
| B| 19|24.029134761698852|
| B| 20| 97.04635170538656|
| B| 21|13.411774011828113|
| B| 22|14.631128012534537|
| B| 23| 75.87504358867835|
| B| 24|19.864402941978202|
| B| 25|14.797121212341262|
| B| 26| 10.53739042907292|
| B| 27| 73.658902219453|
| B| 28|252.58741644688948|
+----+---------+------------------+
我想在spark中编写一个函数,删除距离低于某个值的所有功能,但同时保留每个
用户
组的第一个和最后一个功能,而不考虑距离阈值。在spark中实现这一点的最佳方法是什么?您可以指定行号,指示某一行是否是该用户的第一行/最后一行,并根据行号和距离进行筛选:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'first',
F.row_number().over(Window.partitionBy('user').orderBy('timestamp'))
).withColumn(
'last',
F.row_number().over(Window.partitionBy('user').orderBy(F.desc('timestamp')))
).filter('first = 1 or last = 1 or distance > 50').drop('first', 'last')
df2.show()
+----+---------+------------------+
|user|timestamp| distance|
+----+---------+------------------+
| B| 28|252.58741644688948|
| B| 27| 73.658902219453|
| B| 23| 75.87504358867835|
| B| 20| 97.04635170538656|
| B| 18|101.20396648774368|
| B| 17| 0.0|
| A| 16| 267.4166754520851|
| A| 15|163.74993239641088|
| A| 14|129.28661000398125|
| A| 10|157.00039533864333|
| A| 8|104.63394547709433|
| A| 7| 95.82221771177366|
| A| 6|102.41261599024176|
| A| 3|62.550475311048984|
| A| 1| 0.0|
+----+---------+------------------+