Scala Spark accumulableCollection不适用于mutable.Map
我使用Spark进行员工记录累积,为此我使用Spark的累加器。我使用Map[empId,emp]作为accumulableCollection,这样我就可以根据员工的ID搜索他们。我什么都试过了,但都没用。是否有人可以指出我使用accumulableCollection或Map的方式是否存在任何逻辑问题不受支持。下面是我的代码Scala Spark accumulableCollection不适用于mutable.Map,scala,apache-spark,accumulator,Scala,Apache Spark,Accumulator,我使用Spark进行员工记录累积,为此我使用Spark的累加器。我使用Map[empId,emp]作为accumulableCollection,这样我就可以根据员工的ID搜索他们。我什么都试过了,但都没用。是否有人可以指出我使用accumulableCollection或Map的方式是否存在任何逻辑问题不受支持。下面是我的代码 package demo import org.apache.spark.{SparkContext, SparkConf, Logging} import org
package demo
import org.apache.spark.{SparkContext, SparkConf, Logging}
import org.apache.spark.SparkContext._
import scala.collection.mutable
object MapAccuApp extends App with Logging {
case class Employee(id:String, name:String, dept:String)
val conf = new SparkConf().setAppName("Employees") setMaster ("local[4]")
val sc = new SparkContext(conf)
implicit def empMapToSet(empIdToEmp: mutable.Map[String, Employee]): mutable.MutableList[Employee] = {
empIdToEmp.foldLeft(mutable.MutableList[Employee]()) { (l, e) => l += e._2}
}
val empAccu = sc.accumulableCollection[mutable.Map[String, Employee], Employee](mutable.Map[String,Employee]())
val employees = List(
Employee("10001", "Tom", "Eng"),
Employee("10002", "Roger", "Sales"),
Employee("10003", "Rafael", "Sales"),
Employee("10004", "David", "Sales"),
Employee("10005", "Moore", "Sales"),
Employee("10006", "Dawn", "Sales"),
Employee("10007", "Stud", "Marketing"),
Employee("10008", "Brown", "QA")
)
System.out.println("employee count " + employees.size)
sc.parallelize(employees).foreach(e => {
empAccu += e
})
System.out.println("empAccumulator size " + empAccu.value.size)
}
对于您的问题,使用
accumulableCollection
似乎有些过分,如下所示:
import org.apache.spark.{AccumulableParam, Accumulable, SparkContext, SparkConf}
import scala.collection.mutable
case class Employee(id:String, name:String, dept:String)
val conf = new SparkConf().setAppName("Employees") setMaster ("local[4]")
val sc = new SparkContext(conf)
implicit def mapAccum =
new AccumulableParam[mutable.Map[String,Employee], Employee]
{
def addInPlace(t1: mutable.Map[String,Employee],
t2: mutable.Map[String,Employee])
: mutable.Map[String,Employee] = {
t1 ++= t2
t1
}
def addAccumulator(t1: mutable.Map[String,Employee], e: Employee)
: mutable.Map[String,Employee] = {
t1 += (e.id -> e)
t1
}
def zero(t: mutable.Map[String,Employee])
: mutable.Map[String,Employee] = {
mutable.Map[String,Employee]()
}
}
val empAccu = sc.accumulable(mutable.Map[String,Employee]())
val employees = List(
Employee("10001", "Tom", "Eng"),
Employee("10002", "Roger", "Sales"),
Employee("10003", "Rafael", "Sales"),
Employee("10004", "David", "Sales"),
Employee("10005", "Moore", "Sales"),
Employee("10006", "Dawn", "Sales"),
Employee("10007", "Stud", "Marketing"),
Employee("10008", "Brown", "QA")
)
System.out.println("employee count " + employees.size)
sc.parallelize(employees).foreach(e => {
empAccu += e
})
println("empAccumulator size " + empAccu.value.size)
empAccu.value.foreach(entry =>
println("emp id = " + entry._1 + " name = " + entry._2.name))
虽然目前对这一点的文档记录很少,但Spark代码库中的内容非常有启发性
编辑:事实证明,使用accumulableCollection
确实有价值:您不需要定义AccumulableParam
,下面的方法也可以。我留下这两种解决方案,以防它们对人们有用
case class Employee(id:String, name:String, dept:String)
val conf = new SparkConf().setAppName("Employees") setMaster ("local[4]")
val sc = new SparkContext(conf)
val empAccu = sc.accumulableCollection(mutable.HashMap[String,Employee]())
val employees = List(
Employee("10001", "Tom", "Eng"),
Employee("10002", "Roger", "Sales"),
Employee("10003", "Rafael", "Sales"),
Employee("10004", "David", "Sales"),
Employee("10005", "Moore", "Sales"),
Employee("10006", "Dawn", "Sales"),
Employee("10007", "Stud", "Marketing"),
Employee("10008", "Brown", "QA")
)
System.out.println("employee count " + employees.size)
sc.parallelize(employees).foreach(e => {
// notice this is different from the previous solution
empAccu += e.id -> e
})
println("empAccumulator size " + empAccu.value.size)
empAccu.value.foreach(entry =>
println("emp id = " + entry._1 + " name = " + entry._2.name))
两种解决方案均使用Spark 1.0.2进行了测试 看起来empAccu.value.size没有给出正确的值,打印效果很好。我得到以下输出`员工计数8 emp id=10007 name=Stud emp id=10001 name=Tom emp id=10004 name=David emp id=10006 name=Dawn emp id=10003 name=Rafael emp id=10002 name=Roger emp id=10005 name=Moore emp id=10008 name=Brown`