Python 删除低于阈值的功能,但保留spark数据帧中每个组的第一个和最后一个条目

Python 删除低于阈值的功能,但保留spark数据帧中每个组的第一个和最后一个条目,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个像这样的spark数据框 +----+---------+------------------+ |user|timestamp| distance| +----+---------+------------------+ | A| 1| 0.0| | A| 2| 36.35191443247001| | A| 3|62.550475311048984| | A| 4|1

我有一个像这样的spark数据框

+----+---------+------------------+
|user|timestamp|          distance|
+----+---------+------------------+
|   A|        1|               0.0|
|   A|        2| 36.35191443247001|
|   A|        3|62.550475311048984|
|   A|        4|16.847739134139704|
|   A|        5|17.952563555225684|
|   A|        6|102.41261599024176|
|   A|        7| 95.82221771177366|
|   A|        8|104.63394547709433|
|   A|        9|26.506336419934364|
|   A|       10|157.00039533864333|
|   A|       11| 20.15671111217189|
|   A|       12|21.870381223509487|
|   A|       13|18.137363209583356|
|   A|       14|129.28661000398125|
|   A|       15|163.74993239641088|
|   A|       16| 267.4166754520851|
|   B|       17|               0.0|
|   B|       18|101.20396648774368|
|   B|       19|24.029134761698852|
|   B|       20| 97.04635170538656|
|   B|       21|13.411774011828113|
|   B|       22|14.631128012534537|
|   B|       23| 75.87504358867835|
|   B|       24|19.864402941978202|
|   B|       25|14.797121212341262|
|   B|       26| 10.53739042907292|
|   B|       27|   73.658902219453|
|   B|       28|252.58741644688948|
+----+---------+------------------+

我想在spark中编写一个函数,删除距离低于某个值的所有功能,但同时保留每个
用户
组的第一个和最后一个功能,而不考虑距离阈值。在spark中实现这一点的最佳方法是什么?

您可以指定行号,指示某一行是否是该用户的第一行/最后一行,并根据行号和距离进行筛选:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'first',    
    F.row_number().over(Window.partitionBy('user').orderBy('timestamp'))
).withColumn(
    'last', 
    F.row_number().over(Window.partitionBy('user').orderBy(F.desc('timestamp')))
).filter('first = 1 or last = 1 or distance > 50').drop('first', 'last')

df2.show()
+----+---------+------------------+
|user|timestamp|          distance|
+----+---------+------------------+
|   B|       28|252.58741644688948|
|   B|       27|   73.658902219453|
|   B|       23| 75.87504358867835|
|   B|       20| 97.04635170538656|
|   B|       18|101.20396648774368|
|   B|       17|               0.0|
|   A|       16| 267.4166754520851|
|   A|       15|163.74993239641088|
|   A|       14|129.28661000398125|
|   A|       10|157.00039533864333|
|   A|        8|104.63394547709433|
|   A|        7| 95.82221771177366|
|   A|        6|102.41261599024176|
|   A|        3|62.550475311048984|
|   A|        1|               0.0|
+----+---------+------------------+