Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark:udf从路径获取dirname_Scala_Apache Spark - Fatal编程技术网

Scala Spark:udf从路径获取dirname

Scala Spark:udf从路径获取dirname,scala,apache-spark,Scala,Apache Spark,我有大量的路径列需要拆分为两列,basename和dirname。我知道如何使用以下工具轻松获取路径的基本名称: val df = Seq("/test/coucou/jambon/hello/file" ,"/test/jambon/test") .toDF("column1") df.withColumn("basename", substring_index($"column1" , "/", -1)) .show(2, false) +-----------------

我有大量的路径列需要拆分为两列,basename和dirname。我知道如何使用以下工具轻松获取路径的基本名称:

val df = Seq("/test/coucou/jambon/hello/file"
    ,"/test/jambon/test")
    .toDF("column1")
df.withColumn("basename", substring_index($"column1"  , "/", -1))
.show(2, false)
+------------------------------+---------+
|column1                       |basename |
+------------------------------+---------+
|/test/coucou/jambon/hello/file|file     |
|/test/jambon/test             |test     |
+------------------------------+---------+
然而,我挣扎着想得到这样的名字:

+------------------------------+--------------------------+
|column1                       |dirname                   |
+------------------------------+--------------------------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello |
|/test/jambon/test             |/test/jambon              |
+------------------------------+--------------------------+
我尝试过各种解决方案,但找不到功能性的列式解决方案。

我最好的办法是将
$“basename”
减去
$“column1”
,但是我找不到在Spark中减去字符串的方法。

您可以使用expr将column1作为子字符串。代码应该如下所示。我希望这会有帮助

//Creating Test Data
val df = Seq("/test/coucou/jambon/hello/file"
  ,"/test/jambon/prout/test")
  .toDF("column1")

val test = df.withColumn("basename", substring_index($"column1"  , "/", -1))
    .withColumn("path", expr("substring(column1, 1, length(column1)-length(basename)-1)"))

test.show(false)
+------------------------------+--------+-------------------------+
|column1                       |basename|path                     |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file    |/test/coucou/jambon/hello|
|/test/jambon/prout/test       |test    |/test/jambon/prout       |
+------------------------------+--------+-------------------------+

另一种方法是使用UDF:

import org.apache.spark.sql.functions.udf

val pathUDF = udf((s: String) => s.substring(0, s.lastIndexOf("/")))

val test = df.withColumn("basename", substring_index($"column1"  , "/", -1))
    .withColumn("path", pathUDF($"column1"))

test.show(false)
+------------------------------+--------+-------------------------+
|column1                       |basename|path                     |
+------------------------------+--------+-------------------------+
|/test/coucou/jambon/hello/file|file    |/test/coucou/jambon/hello|
|/test/jambon/prout/test       |test    |/test/jambon/prout       |
+------------------------------+--------+-------------------------+

使用正则表达式替代已提供的解决方案

正确使用正则表达式regexp\u extractUDF将为您提供所需的内容

   val df = Seq("/test/coucou/jambon/hello/file"
      , "/test/jambon/prout/test")
      .toDF("column1")

    import org.apache.spark.sql.functions.regexp_extract

    df.withColumn("path", regexp_extract('column1, "^\\/(\\w+\\/)+", 0)).withColumn("fileName",regexp_extract('column1, "\\w+$", 0)).show(false)
输出

+------------------------------+--------------------------+--------+
|column1                       |path                      |fileName|
+------------------------------+--------------------------+--------+
|/test/coucou/jambon/hello/file|/test/coucou/jambon/hello/|file    |
|/test/jambon/prout/test       |/test/jambon/prout/       |test    |
+------------------------------+--------------------------+--------+
编辑:
不带尾随斜杠的正则表达式更易于管理:

df.withColumn("path",regexp_extract($"column1", "^(.+)(/.+)$", 1 ) ) )

看起来很有趣,但我希望它们是一种更简单的方式:D对IMO来说太复杂了,应该更简单一些。。。。非常感谢你的帮助。在多加思考之后。通过保留最后一个
/
(删除子字符串中的-1),这似乎是处理短路径(如
/filex.txt
)的唯一方法使用较短路径(如:
/file.txt
列路径为空且应为IMO a)时不起作用
/
使用较短路径(如:/file.txt)时不起作用列路径为空且应为IMO a/