Scala Spark数据集中数字字符串的排序

Scala Spark数据集中数字字符串的排序,scala,apache-spark,apache-spark-dataset,Scala,Apache Spark,Apache Spark Dataset,假设我有以下数据集: +-----------+----------+ |productCode| amount| +-----------+----------+ | XX-13| 300| | XX-1| 250| | XX-2| 410| | XX-9| 50| | XX-10| 35| | XX-100| 870| +-----------+-

假设我有以下
数据集

+-----------+----------+
|productCode|    amount|
+-----------+----------+
|      XX-13|       300|
|       XX-1|       250|
|       XX-2|       410|
|       XX-9|        50|
|      XX-10|        35|
|     XX-100|       870|
+-----------+----------+
其中,
productCode
String
类型,
amount
Int

如果试图通过
productCode
对其排序,结果将是(由于
String
比较的性质,这是意料之中的):

考虑到
Dataset
API,如何获得按
productCode
Integer
部分排序的输出

+-----------+----------+
|productCode|    amount|
+-----------+----------+
|       XX-1|       250|
|       XX-2|       410|
|       XX-9|        50|
|      XX-10|        35|
|      XX-13|       300|
|     XX-100|       870|
+-----------+----------+

使用orderBy中的表达式。看看这个:

scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]

scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
|       XX-1|250|
|       XX-2|410|
|       XX-9| 50|
|      XX-10| 35|
|      XX-13|300|
|     XX-100|870|
+-----------+---+


scala>
使用窗口功能,您可以像

scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1       |250|1   |
|XX-2       |410|2   |
|XX-9       |50 |3   |
|XX-10      |35 |4   |
|XX-13      |300|5   |
|XX-100     |870|6   |
+-----------+---+----+


scala>

请注意,spark抱怨将所有数据移动到单个分区。

Nice and simple
scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1       |250|1   |
|XX-2       |410|2   |
|XX-9       |50 |3   |
|XX-10      |35 |4   |
|XX-13      |300|5   |
|XX-100     |870|6   |
+-----------+---+----+


scala>