Scala Spark数据集中数字字符串的排序
假设我有以下Scala Spark数据集中数字字符串的排序,scala,apache-spark,apache-spark-dataset,Scala,Apache Spark,Apache Spark Dataset,假设我有以下数据集: +-----------+----------+ |productCode| amount| +-----------+----------+ | XX-13| 300| | XX-1| 250| | XX-2| 410| | XX-9| 50| | XX-10| 35| | XX-100| 870| +-----------+-
数据集
:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-13| 300|
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-100| 870|
+-----------+----------+
其中,productCode
为String
类型,amount
为Int
如果试图通过productCode
对其排序,结果将是(由于String
比较的性质,这是意料之中的):
考虑到Dataset
API,如何获得按productCode
的Integer
部分排序的输出
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-13| 300|
| XX-100| 870|
+-----------+----------+
使用orderBy中的表达式。看看这个:
scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]
scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
| XX-1|250|
| XX-2|410|
| XX-9| 50|
| XX-10| 35|
| XX-13|300|
| XX-100|870|
+-----------+---+
scala>
使用窗口功能,您可以像
scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1 |250|1 |
|XX-2 |410|2 |
|XX-9 |50 |3 |
|XX-10 |35 |4 |
|XX-13 |300|5 |
|XX-100 |870|6 |
+-----------+---+----+
scala>
请注意,spark抱怨将所有数据移动到单个分区。Nice and simple
scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1 |250|1 |
|XX-2 |410|2 |
|XX-9 |50 |3 |
|XX-10 |35 |4 |
|XX-13 |300|5 |
|XX-100 |870|6 |
+-----------+---+----+
scala>