Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/elixir/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Pyspark:Rank()在列和索引上?_Python_Apache Spark_Pyspark_Apache Spark Sql_Window Functions - Fatal编程技术网

Python Pyspark:Rank()在列和索引上?

Python Pyspark:Rank()在列和索引上?,python,apache-spark,pyspark,apache-spark-sql,window-functions,Python,Apache Spark,Pyspark,Apache Spark Sql,Window Functions,我的窗口功能有些问题。我真的找不到任何例子来说明订单的重要性。我想做的是,将SortOrder(以及它们的第一次出现)考虑在内,对ColumnA进行排名。所以所有的B都会得到值1,a2和c3。我可以用秩函数实现它吗?我不能简单地按这两列排序 example = example.withColumn("rank", F.rank().over(Window.orderBy('ColumnA'))) 这个也不行,因为订单会丢失 from pyspark.sql.types im

我的窗口功能有些问题。我真的找不到任何例子来说明订单的重要性。我想做的是,将SortOrder(以及它们的第一次出现)考虑在内,对ColumnA进行排名。所以所有的B都会得到值1,a2和c3。我可以用秩函数实现它吗?我不能简单地按这两列排序

example = example.withColumn("rank", F.rank().over(Window.orderBy('ColumnA')))
这个也不行,因为订单会丢失

from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import pyspark.sql.functions as F
from pyspark.sql.window import Window

data = [("B", "BA", 1),
        ("B", "BB", 2),
        ("B", "BC", 3),
        ("A", "AA", 4),
        ("A", "AB", 5),
        ("C", "CA", 6),
        ("A", "AC", 7)]

cols = ['ColumnA', 'ColumnB', 'SortOrder']

schema = StructType([StructField('ColumnA', StringType(), True),
                     StructField('ColumnB', StringType(), True),
                     StructField('SortOrder', IntegerType(), True)])

rdd = sc.parallelize(data)
example = spark.createDataFrame(rdd, schema)

?
example = example.withColumn("rank", F.rank().over(Window.orderBy('SortOrder', 'ColumnA')))

获取每个ColumnA值的最小排序器,然后获取秩,并将其连接回原始数据帧

example2 = example.join(
    example.groupBy('ColumnA')
           .min('SortOrder')
           .select('ColumnA',
                   F.rank().over(Window.orderBy('min(SortOrder)')).alias('rank')
                  ),
    on = 'ColumnA'
).orderBy('SortOrder')

example2.show()
+-------+-------+---------+----+
|ColumnA|ColumnB|SortOrder|rank|
+-------+-------+---------+----+
|      B|     BA|        1|   1|
|      B|     BB|        2|   1|
|      B|     BC|        3|   1|
|      A|     AA|        4|   2|
|      A|     AB|        5|   2|
|      C|     CA|        6|   3|
|      A|     AC|        7|   2|
+-------+-------+---------+----+