Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何传递spark广播和累加器变量来映射和缩减函数_Scala_Apache Spark - Fatal编程技术网

Scala 如何传递spark广播和累加器变量来映射和缩减函数

Scala 如何传递spark广播和累加器变量来映射和缩减函数,scala,apache-spark,Scala,Apache Spark,考虑下面的代码片段 class SparkJob extends Serializable{ //Some code and other functions def launchJob = { val broadcastConfiguration = sc.broadcast(options) //options is some case class val accumulator = //create instance of accumulator inputFile.

考虑下面的代码片段

class SparkJob extends Serializable{
//Some code and other functions
def launchJob = {
    val broadcastConfiguration = sc.broadcast(options) //options is some case class
    val accumulator = //create instance of accumulator
    inputFile.mapPartitions(lines => testMap(lines, broadcastConfiguration, accumulator)) //this line will throw a serialization error
}
object SparkJob{
//apply and other functions
def testMap(lines: Iterable[String], broadcastConfiguration: ... //other params) = //function definition
}
如何通过其他函数传递accumulator和broadcastConfiguration的实例

我尝试只使用
inputFile.mapPartitions(lines=>testMap(lines))
,它工作得很好,所以在我看来,共享变量在传递时是个问题。我如何做到这一点

编辑:添加异常跟踪

15/06/17 19:12:56 INFO SparkContext: Created broadcast 3 from textFile at SparkJob.scala:74
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
    at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:635)
    at com.auditude.databuild.steps.SparkJob.launchJob(SparkJob.scala:80)
    at com.auditude.databuild.steps.SparkJobDriver$.main(SparkJobDriver.scala:37)
    at com.auditude.databuild.steps.SparkJobDriver.main(SparkJobDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:150)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
    at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
    at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:58)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:39)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
    ... 11 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050)
    at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)
    ... 29 more
Edit2:根据建议添加了@transient,但没有帮助。我甚至尝试过这种方法,但仍然希望得到相同的答案

val mapResult = inputFile.mapPartitions(lines =>  {
  println(broadcastConfiguration.value)
  lines
})
编辑3-在进一步的调查中,我意识到为了简化我的代码,我在类的构造函数中初始化了broadcastConfiguration,因此实际代码如下所示:

class SparkJob extends Serializable{
  //constructor
  val broadcastConfiguration = sc.broadcast(options) //options is some case class
  val accumulator = //create instance of accumulator
//Some code and other functions
def launchJob = {
    inputFile.mapPartitions(lines => testMap(lines, broadcastConfiguration, accumulator)) //this line will throw a serialization error
}

我在构造函数中初始化了一个简单的字符串,但没有使用broadcastVariable,结果还是失败了。如果我将该字符串的声明移动到函数launchJob中,它似乎至少对该字符串有效。将以进一步编辑的方式报告,以防广播变量以这种方式工作。我仍然想知道为什么会发生这种情况,尽管我的类被声明为可序列化

你能添加异常吗?@maasg添加了堆栈跟踪。@Sohab你的函数应该使用广播对象的值,而不是bc本身,但异常与此无关。奇怪的尝试将
@transient
添加到bc和计数器的声明中。我没有使用bc本身。在函数中会有类似于
bc.value
的内容。然而,在这种情况下,bc被传递,就像在
testMap
中除了一些打印行之外没有任何内容一样。我将尝试使用
@transient
,但不推荐使用注释吗?
@transient
是指示某些字段何时不应序列化的方法。完全可以使用。