Pyspark 基于上一行/当前行的Pypark排名

Pyspark 基于上一行/当前行的Pypark排名,pyspark,Pyspark,下面是我的源数据帧和预期的输出数据帧 我需要应用下面的逻辑并计算最终的秩值 如果前一行(hdr)=当前行(hdr)&前一行(dtl)=当前行(dtl), 然后将上一行秩分配给其他行秩+1 我无法推进高密度排名。你能分享你的意见吗?考虑到潜在的性能开销,我试图避免使用没有分区的窗口 sample = [(100,1000),(100, 1000), (100, 2000), (200, 1000), (200,1000), (300,1000), (300,2000)] test = spark

下面是我的源数据帧和预期的输出数据帧

我需要应用下面的逻辑并计算最终的秩值 如果前一行(hdr)=当前行(hdr)&前一行(dtl)=当前行(dtl),
然后将上一行秩分配给其他行秩+1

我无法推进高密度排名。你能分享你的意见吗?考虑到潜在的性能开销,我试图避免使用没有分区的窗口

sample = [(100,1000),(100, 1000), (100, 2000), (200, 1000), (200,1000), (300,1000), (300,2000)]
test = spark.createDataFrame(sample,['hdr','dtl'])
spec = Window.partitionBy('hdr').orderBy('hdr','dtl')
test.withColumn('dense', func.dense_rank().over(spec)).show()


我认为没有窗口的排名是不可能的,在您的情况下,因为排名需要在整个数据集上进行,所以不可能避免没有partitionBy的窗口函数,但是我们可以使用下面的代码将大量数据减少到一个分区中

sample = [(100,1000),(100, 1000), (100, 2000), (200, 1000), (200,1000), (300,1000), (300,2000)]
test = spark.createDataFrame(sample,['hdr','dtl'])

# Since we select only distinct of hdr and dtl huge amount of data is eliminated.
dist_hdr_dtl=test.select("hdr","dtl").distinct()

# Since data size is reduced we can use this window spec.
spec = Window.orderBy('hdr','dtl')
dist_hdr_dtl=dist_hdr_dtl.withColumn('final_rank', dense_rank().over(spec))

# join it with original data to get the ranks.
Note: if distinct dataset is not very huge you can use broadcast join which will improve performance
test.join(dist_hdr_dtl,["hdr","dtl"],"inner").orderBy('hdr','dtl').show()

+---+----+----------+
|hdr| dtl|final_rank|
+---+----+----------+
|100|1000|         1|
|100|1000|         1|
|100|2000|         2|
|200|1000|         3|
|200|1000|         3|
|300|1000|         4|
|300|2000|         5|
+---+----+----------+

我认为没有窗口的排名是不可能的,在您的情况下,因为排名需要在整个数据集上进行,所以不可能避免没有partitionBy的窗口函数,但是我们可以使用下面的代码将大量数据减少到一个分区中

sample = [(100,1000),(100, 1000), (100, 2000), (200, 1000), (200,1000), (300,1000), (300,2000)]
test = spark.createDataFrame(sample,['hdr','dtl'])

# Since we select only distinct of hdr and dtl huge amount of data is eliminated.
dist_hdr_dtl=test.select("hdr","dtl").distinct()

# Since data size is reduced we can use this window spec.
spec = Window.orderBy('hdr','dtl')
dist_hdr_dtl=dist_hdr_dtl.withColumn('final_rank', dense_rank().over(spec))

# join it with original data to get the ranks.
Note: if distinct dataset is not very huge you can use broadcast join which will improve performance
test.join(dist_hdr_dtl,["hdr","dtl"],"inner").orderBy('hdr','dtl').show()

+---+----+----------+
|hdr| dtl|final_rank|
+---+----+----------+
|100|1000|         1|
|100|1000|         1|
|100|2000|         2|
|200|1000|         3|
|200|1000|         3|
|300|1000|         4|
|300|2000|         5|
+---+----+----------+

谢谢girish。因为我有一个有十亿行的巨大数据集,不知道我是否有选择,避免将数据集移动到一个分区。我认为这是不可避免的…但正如我提到的,你可以减少移动的数据量…谢谢girish。因为我有一个有十亿行的巨大数据集,不知道我是否有选择,避免将数据集移动到一个分区。我认为这是不可避免的…但正如我提到的,您可以减少正在移动的数据量。。。。