如何将Scala嵌套映射操作转换为Scala Spark操作?

如何将Scala嵌套映射操作转换为Scala Spark操作?,scala,apache-spark,Scala,Apache Spark,下面的代码计算数据集中两个列表之间的欧氏距离: val user1 = List("a", "1", "3", "2", "6", "9") //> user1 : List[String] = List(a, 1, 3, 2, 6, 9) val user2 = List("b", "1", "2", "2", "5", "9") //> user2 : List[String] = List(b, 1, 2, 2, 5, 9) val all = List(u

下面的代码计算数据集中两个列表之间的欧氏距离:

 val user1 = List("a", "1", "3", "2", "6", "9")  //> user1  : List[String] = List(a, 1, 3, 2, 6, 9)
  val user2 = List("b", "1", "2", "2", "5", "9")  //> user2  : List[String] = List(b, 1, 2, 2, 5, 9)

  val all = List(user1, user2)                    //> all  : List[List[String]] = List(List(a, 1, 3, 2, 6, 9), List(b, 1, 2, 2, 5,
                                                  //|  9))



  def euclDistance(userA: List[String], userB: List[String]) = {
    println("comparing "+userA(0) +" and "+userB(0))
    val zipped = userA.zip(userB)
    val lastElements = zipped match {
      case (h :: t) => t
    }
    val subElements = lastElements.map(m => ((m._1.toDouble - m._2.toDouble) * (m._1.toDouble - m._2.toDouble)))
    val summed = subElements.sum
    val sqRoot = Math.sqrt(summed)

    sqRoot
  }                                               //> euclDistance: (userA: List[String], userB: List[String])Double

  all.map(m => (all.map(m2 => euclDistance(m,m2))))
                                                  //> comparing a and a
                                                  //| comparing a and b
                                                  //| comparing b and a
                                                  //| comparing b and b
                                                  //| res0: List[List[Double]] = List(List(0.0, 1.4142135623730951), List(1.414213
                                                  //| 5623730951, 0.0))
但这如何转化为并行的Spark Scala操作呢

打印distAll的内容时:

scala> distAll.foreach(p => p.foreach(println))
14/10/24 23:09:42 INFO SparkContext: Starting job: foreach at <console>:21
14/10/24 23:09:42 INFO DAGScheduler: Got job 2 (foreach at <console>:21) with 4
output partitions (allowLocal=false)
14/10/24 23:09:42 INFO DAGScheduler: Final stage: Stage 2(foreach at <console>:2
1)
14/10/24 23:09:42 INFO DAGScheduler: Parents of final stage: List()
14/10/24 23:09:42 INFO DAGScheduler: Missing parents: List()
14/10/24 23:09:42 INFO DAGScheduler: Submitting Stage 2 (ParallelCollectionRDD[1
] at parallelize at <console>:18), which has no missing parents
14/10/24 23:09:42 INFO MemoryStore: ensureFreeSpace(1152) called with curMem=115
2, maxMem=278019440
14/10/24 23:09:42 INFO MemoryStore: Block broadcast_2 stored as values in memory
 (estimated size 1152.0 B, free 265.1 MB)
14/10/24 23:09:42 INFO DAGScheduler: Submitting 4 missing tasks from Stage 2 (Pa
rallelCollectionRDD[1] at parallelize at <console>:18)
14/10/24 23:09:42 INFO TaskSchedulerImpl: Adding task set 2.0 with 4 tasks
14/10/24 23:09:42 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 8, lo
calhost, PROCESS_LOCAL, 1169 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 9, lo
calhost, PROCESS_LOCAL, 1419 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 10, l
ocalhost, PROCESS_LOCAL, 1169 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 11, l
ocalhost, PROCESS_LOCAL, 1420 bytes)
14/10/24 23:09:42 INFO Executor: Running task 0.0 in stage 2.0 (TID 8)
14/10/24 23:09:42 INFO Executor: Running task 1.0 in stage 2.0 (TID 9)
14/10/24 23:09:42 INFO Executor: Running task 3.0 in stage 2.0 (TID 11)
a14/10/24 23:09:42 INFO Executor: Running task 2.0 in stage 2.0 (TID 10)

14/10/24 23:09:42 INFO Executor: Finished task 2.0 in stage 2.0 (TID 10). 585 by
tes result sent to driver
114/10/24 23:09:42 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 10)
in 16 ms on localhost (1/4)

314/10/24 23:09:42 INFO Executor: Finished task 0.0 in stage 2.0 (TID 8). 585 by
tes result sent to driver

214/10/24 23:09:42 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 8) i
n 16 ms on localhost (2/4)

6
9
14/10/24 23:09:42 INFO Executor: Finished task 1.0 in stage 2.0 (TID 9). 585 byt
es result sent to driver
b14/10/24 23:09:42 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 9) i
n 16 ms on localhost (3/4)

1
2
2
5
9
14/10/24 23:09:42 INFO Executor: Finished task 3.0 in stage 2.0 (TID 11). 585 by
tes result sent to driver
14/10/24 23:09:42 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 11) i
n 31 ms on localhost (4/4)
14/10/24 23:09:42 INFO DAGScheduler: Stage 2 (foreach at <console>:21) finished
in 0.031 s
14/10/24 23:09:42 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have
all completed, from pool
14/10/24 23:09:42 INFO SparkContext: Job finished: foreach at <console>:21, took
 0.037641021 s
更新2:

我尝试了maasg应答中的代码,但收到错误:

scala> val userDistanceRdd = usersRdd.map { case (user1, user2) => {
     |         val data = sc.broadcast.value
     |         val distance = euclidDistance(data(user1), data(user2))
     |         ((user1, user2),distance)
     |     }
     |     }
<console>:27: error: missing arguments for method broadcast in class SparkContex
t;
follow this method with `_' if you want to treat it as a partially applied funct
ion
               val data = sc.broadcast.value
要使maasg代码正常工作,我需要将
}
添加到
userDistanceRdd
函数中

代码:

type UserId = String
type UserData = Array[Double]

val users: List[UserId] = List("a" , "b")

val data: Map[UserId,UserData] = Map( ("a" , Array(3.0,4.0)),
("b" , Array(3.0,3.0)) )

def combinations[T](l: List[T]): List[(T,T)] = l match {
    case Nil => Nil
    case h::Nil => Nil
    case h::t => t.map(x=>(h,x)) ++ combinations(t)
}

val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) => 
    math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
        val data = broadcastData.value
        val distance = euclidDistance(data(user1), data(user2))
        ((user1, user2),distance)
    }
    }

userDistanceRdd.foreach(println)

首先,我建议您从将用户模型存储在列表中转移到类型良好的类。然后我认为你不需要计算相同用户之间的距离,比如(a-a)和(b-b),也没有理由计算两次距离(a-b)(b-a)


实际解决方案将取决于数据集的维度。假设原始数据集适合内存,并且您希望并行计算欧几里德距离,我将这样继续:

假设
users
是按某个id列出的用户列表,
userData
是按id索引的每个用户要处理的数据

// sc is the Spark Context
type UserId = String
type UserData = Array[Double]

val users: List[UserId]= ???
val data: Map[UserId,UserData] = ???
// combination generates the unique pairs of users for which distance makes sense
// given that euclidDistance (a,b) = eclidDistance(b,a) only (a,b) is in this set
def combinations[T](l: List[T]): List[(T,T)] = l match {
    case Nil => Nil
    case h::Nil => Nil
    case h::t => t.map(x=>(h,x)) ++ comb(t)
}

// broadcasts the data to all workers
val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) => 
    math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
        val data = broadcastData.value
        val distance = euclidDistance(data(user1), data(user2))
        ((user1, user2),distance)
    }

如果用户数据太大,则不使用广播变量,而是从外部存储器加载该变量

谢谢,请看最新的问题,我需要做一些调整,使这将工作。我正在使用Spark v1.1.0,并通过命令行运行代码。您使用的是相同的吗?我使用Spark 1.1.0位和本地Spark上下文运行它,所以不会出现序列化错误。看起来我第一次没有得到你需要的,我会尝试更新mo答案所有.combinations'都在驱动程序节点上运行,从RDD中获取组合会很酷。但我现在不知道如何有效地做。不过,对于您的问题,第一次“看起来我没有得到您第一次需要的,将尝试更新mo答案”可能没问题。您指的是通过的次数吗?您的代码按预期工作?我正在思考如何解决这个问题:@EugeneZhulenev此处使用RDD没有多大意义,因为
foreach
将在驱动程序上运行,如果您使用纯Scala计算,实际上没有什么区别。我不明白为什么需要在spark上运行RDD。您希望扩大哪个维度功能/用户或#的users@maasg可能是其中之一,但更可能是users@maasgspark是否不用于缩放?ie未按比例放大“您想按比例放大哪个维度?”“按比例放大”指的是大小超出了一台机器的内存限制。@maasg好的,您是说此代码不适合按比例放大吗?请参阅我的问题更新。我试过你的代码,但收到错误?@blue sky oops。应该是
broadcastData
iso
broadcast
我来确定答案完成
type UserId = String
type UserData = Array[Double]

val users: List[UserId] = List("a" , "b")

val data: Map[UserId,UserData] = Map( ("a" , Array(3.0,4.0)),
("b" , Array(3.0,3.0)) )

def combinations[T](l: List[T]): List[(T,T)] = l match {
    case Nil => Nil
    case h::Nil => Nil
    case h::t => t.map(x=>(h,x)) ++ combinations(t)
}

val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) => 
    math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
        val data = broadcastData.value
        val distance = euclidDistance(data(user1), data(user2))
        ((user1, user2),distance)
    }
    }

userDistanceRdd.foreach(println)
  val user1 = List("a", "1", "3", "2", "6", "9")
  val user2 = List("b", "1", "2", "2", "5", "9")

  case class User(name: String, features: Vector[Double])

  object User {
    def fromList(list: List[String]): User = list match {
      case h :: tail => User(h, tail.map(_.toDouble).toVector)
    }
  }

  def euclDistance(userA: User, userB: User) = {
    println(s"comparing ${userA.name} and ${userB.name}")
    val subElements = (userA.features zip userB.features) map {
      m => (m._1 - m._2) * (m._1 - m._2)
    }
    val summed = subElements.sum
    val sqRoot = Math.sqrt(summed)

    sqRoot
  }

  val all = List(User.fromList(user1), User.fromList(user2))


  val users: RDD[(User, User)] = sc.parallelize(all.combinations(2).toSeq.map {
    case l :: r :: Nil => (l, r)
  })

  users.foreach(t => euclDistance(t._1, t._2))
// sc is the Spark Context
type UserId = String
type UserData = Array[Double]

val users: List[UserId]= ???
val data: Map[UserId,UserData] = ???
// combination generates the unique pairs of users for which distance makes sense
// given that euclidDistance (a,b) = eclidDistance(b,a) only (a,b) is in this set
def combinations[T](l: List[T]): List[(T,T)] = l match {
    case Nil => Nil
    case h::Nil => Nil
    case h::t => t.map(x=>(h,x)) ++ comb(t)
}

// broadcasts the data to all workers
val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) => 
    math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
        val data = broadcastData.value
        val distance = euclidDistance(data(user1), data(user2))
        ((user1, user2),distance)
    }