Scala Spark:udf从路径获取dirname_Scala_Apache Spark

Scala Spark:udf从路径获取dirname

scala apache-spark

Scala Spark:udf从路径获取dirname,scala,apache-spark,Scala,Apache Spark,我有大量的路径列需要拆分为两列，basename和dirname。我知道如何使用以下工具轻松获取路径的基本名称： val df = Seq("/test/coucou/jambon/hello/file" ,"/test/jambon/test") .toDF("column1") df.withColumn("basename", substring_index($"column1" , "/", -1)) .show(2, false) +-----------------

我有大量的路径列需要拆分为两列，basename和dirname。我知道如何使用以下工具轻松获取路径的基本名称：

val df = Seq("/test/coucou/jambon/hello/file"
    ,"/test/jambon/test")
    .toDF("column1")
df.withColumn("basename", substring_index($"column1"  , "/", -1))
.show(2, false)
+------------------------------+---------+
|column1                       |basename |
+------------------------------+---------+
|/test/coucou/jambon/hello/file|file     |
|/test/jambon/test             |test     |
+------------------------------+---------+

然而，我挣扎着想得到这样的名字：

+------------------------------+--------------------------+
|column1                       |dirname                   |
+------------------------------+--------------------------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello |
|/test/jambon/test             |/test/jambon              |
+------------------------------+--------------------------+

我尝试过各种解决方案，但找不到功能性的列式解决方案。

我最好的办法是将

$“basename”

减去

$“column1”

，但是我找不到在Spark中减去字符串的方法。

您可以使用expr将column1作为子字符串。代码应该如下所示。我希望这会有帮助

//Creating Test Data
val df = Seq("/test/coucou/jambon/hello/file"
  ,"/test/jambon/prout/test")
  .toDF("column1")

val test = df.withColumn("basename", substring_index($"column1"  , "/", -1))
    .withColumn("path", expr("substring(column1, 1, length(column1)-length(basename)-1)"))

test.show(false)
+------------------------------+--------+-------------------------+
|column1                       |basename|path                     |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file    |/test/coucou/jambon/hello|
|/test/jambon/prout/test       |test    |/test/jambon/prout       |
+------------------------------+--------+-------------------------+

另一种方法是使用UDF：

import org.apache.spark.sql.functions.udf

val pathUDF = udf((s: String) => s.substring(0, s.lastIndexOf("/")))

val test = df.withColumn("basename", substring_index($"column1"  , "/", -1))
    .withColumn("path", pathUDF($"column1"))

test.show(false)
+------------------------------+--------+-------------------------+
|column1                       |basename|path                     |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file    |/test/coucou/jambon/hello|
|/test/jambon/prout/test       |test    |/test/jambon/prout       |
+------------------------------+--------+-------------------------+

使用正则表达式替代已提供的解决方案
正确使用正则表达式regexp\u extractUDF将为您提供所需的内容

val df = Seq("/test/coucou/jambon/hello/file" , "/test/jambon/prout/test") .toDF("column1") import org.apache.spark.sql.functions.regexp_extract df.withColumn("path", regexp_extract('column1, "^\\/(\\w+\\/)+", 0)).withColumn("fileName",regexp_extract('column1, "\\w+$", 0)).show(false)
输出

+------------------------------+--------------------------+--------+ |column1 |path |fileName| +------------------------------+--------------------------+--------+ |/test/coucou/jambon/hello/file|/test/coucou/jambon/hello/|file | |/test/jambon/prout/test |/test/jambon/prout/ |test | +------------------------------+--------------------------+--------+
编辑：
不带尾随斜杠的正则表达式更易于管理：

df.withColumn("path",regexp_extract($"column1", "^(.+)(/.+)$", 1 ) ) )

看起来很有趣，但我希望它们是一种更简单的方式：D对IMO来说太复杂了，应该更简单一些。。。。非常感谢你的帮助。在多加思考之后。通过保留最后一个
/
（删除子字符串中的-1），这似乎是处理短路径（如
/filex.txt
）的唯一方法使用较短路径（如：
/file.txt
列路径为空且应为IMO a）时不起作用
/
使用较短路径（如：/file.txt）时不起作用列路径为空且应为IMO a/