SBT和Spark submit在火花纺纱模式下运行Scala的不同行为

SBT和Spark submit在火花纺纱模式下运行Scala的不同行为,scala,hadoop,apache-spark,cloudera,Scala,Hadoop,Apache Spark,Cloudera,我目前正试图在Cloudera集群上使用Apache Spark on Thread(-client)模式执行一些Scala代码,但由于以下Java异常,sbt运行被中止: [error] (run-main-0) org.apache.spark.SparkException: YARN mode not available ? org.apache.spark.SparkException: YARN mode not available ? at org.apache.sp

我目前正试图在Cloudera集群上使用Apache Spark on Thread(-client)模式执行一些Scala代码,但由于以下Java异常,sbt运行被中止:

[error] (run-main-0) org.apache.spark.SparkException: YARN mode not available ?
org.apache.spark.SparkException: YARN mode not available ?
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1267)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:199)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:100)
        at SimpleApp$.main(SimpleApp.scala:7)
        at SimpleApp.main(SimpleApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.cluster.YarnClientClusterScheduler
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:191)
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1261)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:199)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:100)
        at SimpleApp$.main(SimpleApp.scala:7)
        at SimpleApp.main(SimpleApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

[trace] Stack trace suppressed: run last compile:run for the full output.
java.lang.RuntimeException: Nonzero exit code: 1
        at scala.sys.package$.error(package.scala:27)

[trace] Stack trace suppressed: run last compile:run for the full output.
[error] (compile:run) Nonzero exit code: 1
15/11/24 17:18:03 INFO network.ConnectionManager: Selector thread was interrupted!
[error] Total time: 38 s, completed 24-nov-2015 17:18:04
有没有办法“自动”配置这些变量?我的意思是,我可以设置SPARK_JAR,因为这个JAR附带SPARK安装,但是SPARK_纱线_应用程序_JAR? 当我手动设置这些变量时,我注意到火花电机不考虑我的自定义配置,即使我设置了YARNNYCOMPUDIR变量。有没有办法告诉SBT使用我的本地Spark配置工作

如果有帮助的话,我让我正在执行的当前(丑陋的)代码:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "src/data/sample.txt"
    val sc = new SparkContext("yarn-client", "Simple App", "C:/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar",
      List("target/scala-2.10/file-searcher_2.10-1.0.jar"))
    val logData = sc.textFile(logFile, 2).cache()
    val numTHEs = logData.filter(line => line.contains("the")).count()
    println("Lines with the: %s".format(numTHEs))
  }
}
谢谢

谢谢
切洛特

嗯,我终于找到了我的问题所在

  • 首先,SBT项目必须包括spark core和spark Thread作为运行时依赖项
  • 接下来,Windows warn-site.xml必须将Cloudera集群共享类路径(类路径在linux节点上有效)指定为warn类路径,而不是Windows类路径。它使纱线资源管理器知道它的东西在哪里,即使是从Windows执行
  • 最后,从Windows core-site.xml文件中删除topology.py部分,以避免Spark尝试执行它,它不需要它来工作
  • 如果需要,请不要忘记删除任何mapred-site.xml以使用warn/MR2,并在使用spark submit命令行运行spark时指定spark-defaults.conf中实际定义的所有spark属性

就这样。其他一切都应该有效。

首先:发布你的消息来源。第二:从哪里启动sbt?第三:HADOOP_CONF_DIR设置好了吗?我已经用源代码和build.sbt更新了帖子。HADOOP_CONF_DIR和Thread_CONF_DIR当前已设置,但sbt run似乎没有请求:我的自定义配置使用与8020不同的resourcemanager地址端口,但sbt尝试连接到8020以联系Thread…关于sbt。。。我也在windows工作站上执行SBT。。。我创建了一个根文件夹,执行sbt eclipse以生成一个与eclipse兼容的项目,并使用Scala IDE进行开发,然后尝试在根文件夹中执行sbt clean package run以从命令行运行它。我确定了资源管理器无法为我分配容器的原因。通过分析ResourceManager日志,会向ResourceManager调度程序a请求一个容器,地址为0.0.0.0/0.0.0:8030,而实际地址应为:hostname.domain/hostname:26310(自定义配置)。所以它不能给我任何东西,因为这个要求无法实现。但我不明白,如果my Thread-site.xml指定hostname.domain/hostname:26310,并且由my Thread_CONF_DIR环境变量指向,为什么请求会发送到0.0.0.0/0.0.0:8030(其文件夹由该变量指向)
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "src/data/sample.txt"
    val sc = new SparkContext("yarn-client", "Simple App", "C:/spark/lib/spark-assembly-1.3.0-hadoop2.4.0.jar",
      List("target/scala-2.10/file-searcher_2.10-1.0.jar"))
    val logData = sc.textFile(logFile, 2).cache()
    val numTHEs = logData.filter(line => line.contains("the")).count()
    println("Lines with the: %s".format(numTHEs))
  }
}