Apache spark 如何使用Spark Sql对大型数据集进行排序_Apache Spark_Apache Spark Sql

Apache spark 如何使用Spark Sql对大型数据集进行排序

apache-spark

Apache spark 如何使用Spark Sql对大型数据集进行排序,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一个分区表，在hive中有大约20亿行，如： id, num, num_partition 1, 1253742321.53124121, 12 4, 1253742323.53124121, 12 2, 1353742324.53124121, 13 3, 1253742325.53124121, 12 我想要一张像这样的桌子： id, rank,rank_partition 89, 1, 0 ... 1, 1253742321,12 7, 1253742322,12 4

我有一个分区表，在hive中有大约20亿行，如：

id, num, num_partition
1, 1253742321.53124121, 12
4, 1253742323.53124121, 12
2, 1353742324.53124121, 13
3, 1253742325.53124121, 12

我想要一张像这样的桌子：

id, rank,rank_partition
89,         1, 0
...
1, 1253742321,12
7, 1253742322,12
4, 1253742323,12
8, 1253742324,12
3, 1253742325,12
...
2, 1353742324,13
...

我尝试过这样做：

df = spark.sql("select *, rank/10000000 from (select id,row_number() over(order by num asc) rank from table)t1")

df = spark.sql("select *, rank/10000000 from (select id,row_number() over(distribute by num_partition order by num_partition asc,num asc) rank from table)t1")

速度非常慢，因为order by将只使用1个减速器

我试着这么做：

df = spark.sql("select *, rank/10000000 from (select id,row_number() over(order by num asc) rank from table)t1")

df = spark.sql("select *, rank/10000000 from (select id,row_number() over(distribute by num_partition order by num_partition asc,num asc) rank from table)t1")

但是结果是num_分区没有被排序