Scala 使用udf和递归在dataframe中创建新列_Scala_Apache Spark_Recursion_Parquet

Scala 使用udf和递归在dataframe中创建新列

scala apache-spark recursion

Scala 使用udf和递归在dataframe中创建新列,scala,apache-spark,recursion,parquet,Scala,Apache Spark,Recursion,Parquet,我有两个拼花文件一个描述inode编号的文件，一个描述inode名称和父inode，我需要从第二个文件重建fullpath 带有inode描述的我的表名为idirs\u table\u read，格式如下（这是一个完整的示例）：我希望使用inode编号能够重建文件路径。例如，对于inode 93767723，路径是：/base/level2/name/level4 我定义了两个函数（一个是递归函数，另一个是过程函数）。这两个函数在像这样使用时工作newPathRecursive（123654

我有两个拼花文件一个描述inode编号的文件，一个描述inode名称和父inode，我需要从第二个文件重建fullpath

带有inode描述的我的表名为

idirs\u table\u read

，格式如下（这是一个完整的示例）：

我希望使用inode编号能够重建文件路径。
例如，对于inode 93767723，路径是：

/base/level2/name/level4

我定义了两个函数（一个是递归函数，另一个是过程函数）。这两个函数在像这样使用时工作

newPathRecursive（1236549）

，但在

with columns

中使用时失败：

def newPathRecursive( inumber : Int ):String = {
    var composite = ""
    var result = idirs_table_read.select("iparent", "iname").filter($"ichild"===inumber)

    if ( (result.count() == 1) && (result.first()(0) !=  inumber) )   {
       var num= result.first()(0).asInstanceOf[Int]
       composite=  newPath(num) + "/"  + result.first()(1).asInstanceOf[String] 
    }
    return composite
}


def newPathProcedurale  (inumber : Int ):String =  {

    var composite = ""
    var go = true
    var parentInode=inumber
    while(go){

        var result = idirs_table_read.select("iparent", "iname").filter($"ichild"===parentInode)

        if ( (result.count() == 1) && (result.first()(0) !=  inumber) )   {
            println(result.first()(0)+","+ result.first()(1)+","+ parentInode)
            parentInode = result.first()(0).asInstanceOf[Int]
            composite=  "/"  + result.first()(1).asInstanceOf[String] + composite
        }else{
            go=false
        }
    }
    return composite.asInstanceOf[String]
}
val buildpath2 = udf[String, Int](newPath2)
val buildpath = udf[String, Int](newPath)

我的目标是能够用该路径替换另一个表中的inode编号，但是当我尝试在select中使用该函数时，我得到如下结果：

 df.withColumn("newcol",buildpath ( $"inumber" )
 Caused by: java.lang.reflect.InvocationTargetException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1521.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1521.0 (TID 14859, ip.ip.ip.ip, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (int) => string)

请您帮助我编写这段代码，并最终建议我更好地实现这一算法并更好地使用它好吗？

我的目标是简单地从这两个文件中构建一个新的拼花地板文件，使用完整的路径而不是索引节点（这不是人类可读的）

因为帖子中提到的重复实际上是与我不同的重复，我不理解解释，因为真正的答案是在注释中，而不是在实际的answe中，下面是答案：

如果udf中使用的数据集不是本地的，则我不可能执行以下操作：

inodes_table_read.isLocal
false

因此，我将使用其他内容并发布解释。

可能重复@user8371915，这里没有NullPointerException的痕迹，我错了吗？没有可复制的示例或完整的回溯，名称也不匹配，但如果

idirs_表

是（非本地）

Dataset

然后使用

newPathRecursive

作为

udf

应导致

NullPointerException

：）TL；博士它在Apache Spark中不是有效的模式。如果您真的想使用Spark，请看一看graphframes-您需要某种形式的迭代消息传递来解决此问题。@user8371915感谢您为我指出了正确的方向，仍然没有真正回答我的问题，但答案下面的注释确实回答了。

inodes_table_read.isLocal
false