Apache spark 如何根据位置（而不是值）删除Spark数据帧中的几行？_Apache Spark_Pyspark

Apache spark 如何根据位置（而不是值）删除Spark数据帧中的几行？

apache-spark pyspark

Apache spark 如何根据位置（而不是值）删除Spark数据帧中的几行？,apache-spark,pyspark,Apache Spark,Pyspark,我想使用pyspark进行一些数据预处理，并想删除dataframe中数据开头和结尾的数据。假设我希望删除前30%和最后30%的数据。我只根据使用where的值查找可能性，并查找第一个和最后一个，而不是几个。以下是迄今为止没有解决方案的基本示例： import pandas as pd from pyspark.sql import SparkSession spark = SparkSession.builder.appName("foo").getOrCreate() cut_factor_

我想使用pyspark进行一些数据预处理，并想删除dataframe中数据开头和结尾的数据。假设我希望删除前30%和最后30%的数据。我只根据使用where的值查找可能性，并查找第一个和最后一个，而不是几个。以下是迄今为止没有解决方案的基本示例：

import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("foo").getOrCreate()
cut_factor_start = 0.3 # factor to cut the beginning of the data
cut_factor_stop = 1-cut_factor_start # factor to cut the end of the data
# create pandas dataframe
df = pd.DataFrame({'part':['foo','foo','foo','foo','foo', 'foo'], 'values':[9,1,2,2,6,9]})
# convert to spark dataframe
df = spark.createDataFrame(df)
df.show()

根据计算结果，我想要的是：

在Scala上，可以添加唯一id列，然后限制和排除函数：

val dfWithIds = df.withColumn("uniqueId", monotonically_increasing_id())
dfWithIds
  .limit(stopPostionToCut)
  .except(dfWithIds.limit(startPostionToCut - 1))
  .drop("uniqueId")

另一种方法是在分配行号后使用between：

import pyspark.sql.functions as F
from pyspark.sql import Window

rnum= F.row_number().over(Window.orderBy(F.lit(0)))
output = (df.withColumn('Rnum',rnum)
        .filter(F.col("Rnum").between(cut_start, cut_stop)).drop('Rnum'))

您可以使用row_number的windows函数，使用该函数可以获取当前的行号，然后可以在filter子句中使用该行号来删除

length of df: 6
start postion to cut: 2
stop  postion to cut: 4

+----+------+
|part|values|
+----+------+
| foo|     1|
| foo|     2|
| foo|     2|
+----+------+

val dfWithIds = df.withColumn("uniqueId", monotonically_increasing_id())
dfWithIds
  .limit(stopPostionToCut)
  .except(dfWithIds.limit(startPostionToCut - 1))
  .drop("uniqueId")

import pyspark.sql.functions as F
from pyspark.sql import Window

rnum= F.row_number().over(Window.orderBy(F.lit(0)))
output = (df.withColumn('Rnum',rnum)
        .filter(F.col("Rnum").between(cut_start, cut_stop)).drop('Rnum'))

output.show()

+----+------+
|part|values|
+----+------+
| foo|     1|
| foo|     2|
| foo|     2|
+----+------+