Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何使用Spark Sql对大型数据集进行排序_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 如何使用Spark Sql对大型数据集进行排序

Apache spark 如何使用Spark Sql对大型数据集进行排序,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一个分区表,在hive中有大约20亿行,如: id, num, num_partition 1, 1253742321.53124121, 12 4, 1253742323.53124121, 12 2, 1353742324.53124121, 13 3, 1253742325.53124121, 12 我想要一张像这样的桌子: id, rank,rank_partition 89, 1, 0 ... 1, 1253742321,12 7, 1253742322,12 4

我有一个分区表,在hive中有大约20亿行,如:

id, num, num_partition
1, 1253742321.53124121, 12
4, 1253742323.53124121, 12
2, 1353742324.53124121, 13
3, 1253742325.53124121, 12
我想要一张像这样的桌子:

id, rank,rank_partition
89,         1, 0
...
1, 1253742321,12
7, 1253742322,12
4, 1253742323,12
8, 1253742324,12
3, 1253742325,12
...
2, 1353742324,13
...
我尝试过这样做:

df = spark.sql("select *, rank/10000000 from (select id,row_number() over(order by num asc) rank from table)t1")
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(distribute by num_partition order by num_partition asc,num asc) rank from table)t1")
速度非常慢,因为order by将只使用1个减速器

我试着这么做:

df = spark.sql("select *, rank/10000000 from (select id,row_number() over(order by num asc) rank from table)t1")
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(distribute by num_partition order by num_partition asc,num asc) rank from table)t1")
但是结果是num_分区没有被排序