Scala Spark:udf从路径获取dirname
我有大量的路径列需要拆分为两列,basename和dirname。我知道如何使用以下工具轻松获取路径的基本名称:Scala Spark:udf从路径获取dirname,scala,apache-spark,Scala,Apache Spark,我有大量的路径列需要拆分为两列,basename和dirname。我知道如何使用以下工具轻松获取路径的基本名称: val df = Seq("/test/coucou/jambon/hello/file" ,"/test/jambon/test") .toDF("column1") df.withColumn("basename", substring_index($"column1" , "/", -1)) .show(2, false) +-----------------
val df = Seq("/test/coucou/jambon/hello/file"
,"/test/jambon/test")
.toDF("column1")
df.withColumn("basename", substring_index($"column1" , "/", -1))
.show(2, false)
+------------------------------+---------+
|column1 |basename |
+------------------------------+---------+
|/test/coucou/jambon/hello/file|file |
|/test/jambon/test |test |
+------------------------------+---------+
然而,我挣扎着想得到这样的名字:
+------------------------------+--------------------------+
|column1 |dirname |
+------------------------------+--------------------------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello |
|/test/jambon/test |/test/jambon |
+------------------------------+--------------------------+
我尝试过各种解决方案,但找不到功能性的列式解决方案。我最好的办法是将
$“basename”
减去$“column1”
,但是我找不到在Spark中减去字符串的方法。您可以使用expr将column1作为子字符串。代码应该如下所示。我希望这会有帮助
//Creating Test Data
val df = Seq("/test/coucou/jambon/hello/file"
,"/test/jambon/prout/test")
.toDF("column1")
val test = df.withColumn("basename", substring_index($"column1" , "/", -1))
.withColumn("path", expr("substring(column1, 1, length(column1)-length(basename)-1)"))
test.show(false)
+------------------------------+--------+-------------------------+
|column1 |basename|path |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file |/test/coucou/jambon/hello|
|/test/jambon/prout/test |test |/test/jambon/prout |
+------------------------------+--------+-------------------------+
另一种方法是使用UDF:
import org.apache.spark.sql.functions.udf
val pathUDF = udf((s: String) => s.substring(0, s.lastIndexOf("/")))
val test = df.withColumn("basename", substring_index($"column1" , "/", -1))
.withColumn("path", pathUDF($"column1"))
test.show(false)
+------------------------------+--------+-------------------------+
|column1 |basename|path |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file |/test/coucou/jambon/hello|
|/test/jambon/prout/test |test |/test/jambon/prout |
+------------------------------+--------+-------------------------+
使用正则表达式替代已提供的解决方案 正确使用正则表达式regexp\u extractUDF将为您提供所需的内容
val df = Seq("/test/coucou/jambon/hello/file"
, "/test/jambon/prout/test")
.toDF("column1")
import org.apache.spark.sql.functions.regexp_extract
df.withColumn("path", regexp_extract('column1, "^\\/(\\w+\\/)+", 0)).withColumn("fileName",regexp_extract('column1, "\\w+$", 0)).show(false)
输出
+------------------------------+--------------------------+--------+
|column1 |path |fileName|
+------------------------------+--------------------------+--------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello/|file |
|/test/jambon/prout/test |/test/jambon/prout/ |test |
+------------------------------+--------------------------+--------+
编辑:不带尾随斜杠的正则表达式更易于管理:
df.withColumn("path",regexp_extract($"column1", "^(.+)(/.+)$", 1 ) ) )
看起来很有趣,但我希望它们是一种更简单的方式:D对IMO来说太复杂了,应该更简单一些。。。。非常感谢你的帮助。在多加思考之后。通过保留最后一个
/
(删除子字符串中的-1),这似乎是处理短路径(如/filex.txt
)的唯一方法使用较短路径(如:/file.txt
列路径为空且应为IMO a)时不起作用/
使用较短路径(如:/file.txt)时不起作用列路径为空且应为IMO a/