Apache spark 在pyspark数据帧中删除连续重复项_Apache Spark_Pyspark_Pyspark Sql

Apache spark 在pyspark数据帧中删除连续重复项

apache-spark pyspark

Apache spark 在pyspark数据帧中删除连续重复项,apache-spark,pyspark,pyspark-sql,Apache Spark,Pyspark,Pyspark Sql,具有如下数据帧： ## +---+---+ ## | id|num| ## +---+---+ ## | 2|3.0| ## | 3|6.0| ## | 3|2.0| ## | 3|1.0| ## | 2|9.0| ## | 4|7.0| ## +---+---+ 我想去掉连续的重复，得到： ## +---+---+ ## | id|num| ## +---+---+ ## | 2|3.0| ## | 3|6.0| ## | 2|9.0| ## | 4|7.0| ## +-

具有如下数据帧：

## +---+---+
## | id|num|
## +---+---+
## |  2|3.0|
## |  3|6.0|
## |  3|2.0|
## |  3|1.0|
## |  2|9.0|
## |  4|7.0|
## +---+---+

我想去掉连续的重复，得到：

## +---+---+
## | id|num|
## +---+---+
## |  2|3.0|
## |  3|6.0|
## |  2|9.0|
## |  4|7.0|
## +---+---+

我在Pandas中找到了答案，但在Pyspark中没有找到任何答案。

答案应该能满足您的要求，但是可能还有一些优化的空间：

从pyspark.sql.window导入窗口为W test_df=spark.createDataFrame[ 2,3.0,3,6.0,3,2.0,3,1.0,2,9.0,4,7.0 ]，id，num test_df=test_df.withColumnidx，单调递增的_id创建临时id，因为窗口需要有序结构 w=w.orderByidx get_last=whenlagid，1.overv==colid，False.otherwiseTrue检查前一行是否包含相同的id test_df.withColumnchanged，get_last.filtercolchanged.selectid，num.show仅选择ID已更改的行输出：

+--+--+ |id | num| +--+--+ | 2|3.0| | 3|6.0| | 2|9.0| | 4|7.0| +--+--+

通过“连续”，我是根据特定的顺序进行猜测，例如通过num。这是正确的，还是您也希望这样分配ID，如[1,2,1,1,2]，从而产生[1,2,1,2]？顺序应该由ID给出，因此您的示例是正确的。[1,2,1,1,2]应该导致[1,2,1,2]。spark在从数据源获取记录时会对记录进行洗牌，因此，如果我必须给出行号，您将如何确保顺序意味着引用哪一列