Apache spark 关于ApacheSpark上rdd.pipe（）运算符的问题_Apache Spark

Apache spark 关于ApacheSpark上rdd.pipe（）运算符的问题

apache-spark

Apache spark 关于ApacheSpark上rdd.pipe（）运算符的问题,apache-spark,Apache Spark,我试图使用RDD.Pu管在Apache Spice上运行一个外部C++脚本。我在文档中找不到足够的信息，所以我在这里问。使用rdd.pipe时，集群中的所有节点上是否都需要使用外部脚本如果我没有在集群节点上安装任何东西的权限怎么办？是否有其他方法使脚本可以用于工作节点？ Apache Spice，有一个特殊的RDD，PIPEDRD，它提供对外部程序的调用，例如基于CUDA的C++程序，以便更快地计算。我在这里添加了小exmaple来解释 Shell脚本：test.sh 将rdd数据管道化到

我试图使用RDD.Pu管在Apache Spice上运行一个外部C++脚本。我在文档中找不到足够的信息，所以我在这里问。使用rdd.pipe时，集群中的所有节点上是否都需要使用外部脚本

如果我没有在集群节点上安装任何东西的权限怎么办？是否有其他方法使脚本可以用于工作节点？

Apache Spice，有一个特殊的RDD，PIPEDRD，它提供对外部程序的调用，例如基于CUDA的C++程序，以便更快地计算。我在这里添加了小exmaple来解释

Shell脚本：test.sh

将rdd数据管道化到shell脚本

现在创建scala程序来调用这个管道RDD

结果:

Array[String] = Array(Running shell script, hi!, Running shell script, hello!, 
 Running shell script, how!, Running shell script, are!, you!)

毕竟，外部脚本似乎应该出现在所有执行器节点上。

一种方法是通过spark submit（例如，-files script.sh）传递脚本，然后您应该能够在rdd.pipe中引用该脚本，例如../script.sh。

感谢您的输入，但似乎仅在驱动程序节点或hdfs上使用外部脚本是不够的。执行器正在抛出错误：无法运行程序路径/to/program:error=2，没有这样的文件或目录

val scriptPath = "/home/hadoop/test.sh"
val pipeRDD = dataRDD.pipe(scriptPath)
pipeRDD.collect()

val proc = Runtime.getRuntime.exec(Array(command))

 new Thread("stderr reader for " + command) {
      override def run() {
        for(line <- Source.fromInputStream(proc.getErrorStream).getLines)
          System.err.println(line)
      }
    }.start()

val lineList = List("hello","how","are","you")
  new Thread("stdin writer for " + command) {
      override def run() {
        val out = new PrintWriter(proc.getOutputStream)
        for(elem <- lineList)
          out.println(elem)
        out.close()
      }
    }.start()

val data = sc.parallelize(List("hi","hello","how","are","you"))
val scriptPath = "/root/echo.sh"
val pipeRDD = dataRDD.pipe(scriptPath)
pipeRDD.collect()

Array[String] = Array(Running shell script, hi!, Running shell script, hello!, 
 Running shell script, how!, Running shell script, are!, you!)