Dataframe 如何创造,；在pyspark中按有序分类变量排序_Dataframe_Pyspark_Categorical Data

Dataframe 如何创造,；在pyspark中按有序分类变量排序

dataframe pyspark

Dataframe 如何创造,；在pyspark中按有序分类变量排序,dataframe,pyspark,categorical-data,Dataframe,Pyspark,Categorical Data,我正在将一些代码从pandas迁移到pyspark。我的源数据帧如下所示： a b c 0 1 insert 1 1 2 update 1 2 3 seed 1 3 4 insert 2 4 5 update 2 5 6 delete 2 6 7 snapshot 1 我正在应用的操作（在python/pandas中）是： df.b = pd.Categorical(df.b, ordered=Tru

我正在将一些代码从pandas迁移到pyspark。我的源数据帧如下所示：

   a         b  c
0  1    insert  1
1  2    update  1
2  3      seed  1
3  4    insert  2
4  5    update  2
5  6    delete  2
6  7  snapshot  1

我正在应用的操作（在python/pandas中）是：

df.b = pd.Categorical(df.b, ordered=True, categories=['insert', 'seed', 'update', 'snapshot', 'delete'])    
df.sort_values(['c', 'b'])

产生输出数据帧：

   a         b  c
0  1    insert  1
2  3      seed  1
1  2    update  1
6  7  snapshot  1
3  4    insert  2
4  5    update  2
5  6    delete  2

我不确定如何最好地使用pyspark设置有序分类，我最初的方法是使用case创建一个新列，并尝试随后使用该列：

df = df.withColumn(
    "_precedence",
    when(col("b") == "insert", 1)
    .when(col("b") == "seed", 2)
    .when(col("b") == "update", 3)
    .when(col("b") == "snapshot", 4)
    .when(col("b") == "delete", 5)
)

您可以使用地图：

from pyspark.sql.functions import create_map, lit, col

categories=['insert', 'seed', 'update', 'snapshot', 'delete']

# per @HaleemurAli, adjusted the below list comprehension to create map
map1 = create_map([val for (i, c) in enumerate(categories) for val in (c, lit(i))])
#Column<b'map(insert, 0, seed, 1, update, 2, snapshot, 3, delete, 4)'>

df.orderBy('c', map1[col('b')]).show()
+---+---+--------+---+
| id|  a|       b|  c|
+---+---+--------+---+
|  0|  1|  insert|  1|
|  2|  3|    seed|  1|
|  1|  2|  update|  1|
|  6|  7|snapshot|  1|
|  3|  4|  insert|  2|
|  4|  5|  update|  2|
|  5|  6|  delete|  2|
+---+---+--------+---+

从pyspark.sql.functions导入create_map，lit，col
类别=[“插入”、“种子”、“更新”、“快照”、“删除”]
#根据@Haleemulli，调整以下列表以创建地图
map1=创建_映射（[enumerate（categories）中的（i，c）的val，enumerate（categories）中的val，以及（c，lit（i））中的val）
#纵队
df.orderBy（'c'，map1[col（'b'）]）.show（）
+---+---+--------+---+
|id | a | b | c|
+---+---+--------+---+
|0 | 1 |插入| 1|
|2 | 3 |种子| 1|
|1 | 2 |更新| 1|
|6 | 7 |快照| 1|
|3 | 4 |插入| 2|
|4 | 5 |更新| 2|
|5 | 6 |删除| 2|
+---+---+--------+---+

若要颠倒b列上的顺序：

df.orderBy（'c'，map1[col（'b'）].desc（））.show（）

，您也可以在语句时使用ur

合并执行此操作
from pyspark.sql import functions as F

categories=['insert', 'seed', 'update', 'snapshot', 'delete']

cols=[(F.when(F.col("b")==x,F.lit(y))) for x,y in zip(categories,[x for x in (range(1, len(categories)+1))])]

df.orderBy("c",F.coalesce(*cols)).show()

#+---+--------+---+
#|  a|       b|  c|
#+---+--------+---+
#|  1|  insert|  1|
#|  3|    seed|  1|
#|  2|  update|  1|
#|  7|snapshot|  1|
#|  4|  insert|  2|
#|  5|  update|  2|
#|  6|  delete|  2|
#+---+--------+---+

谢谢，虽然我希望能像在熊猫身上做的那样，在本地创建有序的分类，但这可能是我们今天最接近的了<代码>求和
对于非最新列表来说速度很慢，并且不被认为是良好的做法，因此我将其替换为[enumerate（categories）中（I，c）的val对于（c，lit（I））中的val]
。你能为将来的读者更新你的答案吗。