Scala 无法并行地向映射添加键

Scala 无法并行地向映射添加键,scala,Scala,我有以下代码: var res: GenMap[Point, GenSeq[Point]] = points.par.groupBy(point => findClosest(point, means)) means.par.foreach(mean => if(!res.contains(mean)) { println("Map doesn't contain mean: " + mean) res += mean -> GenSeq.empty[P

我有以下代码:

  var res: GenMap[Point, GenSeq[Point]] = points.par.groupBy(point => findClosest(point, means))
  means.par.foreach(mean => if(!res.contains(mean)) {
    println("Map doesn't contain mean: " + mean)
    res += mean -> GenSeq.empty[Point]
    println("Map contains?: " + res.contains(mean))
  })
使用此case类的:

case class Point(val x: Double, val y: Double, val z: Double)
基本上,代码将
中的
元素分组在
元素周围
表示
。不过,算法本身并不十分重要

我的问题是,我得到了以下输出:

Map doesn't contain mean: (0.44, 0.59, 0.73)
Map doesn't contain mean: (0.44, 0.59, 0.73)
Map doesn't contain mean: (0.1, 0.11, 0.11)
Map doesn't contain mean: (0.1, 0.11, 0.11)
Map contains?: true
Map contains?: true
Map contains?: false
Map contains?: true
为什么我会得到这个

Map contains?: false
我正在检查
res
地图中是否有钥匙。如果不是,那么我将添加它。 那么它怎么可能不出现在地图上呢


并行性有问题吗?

您的代码中有一个争用条件

res += mean -> GenSeq.empty[Point]
多个线程同时重新分配res,因此可能会丢失一些条目

此代码解决了以下问题:

val closest = points.par.groupBy(point => findClosest(point, means))
val res = means.foldLeft(closest) {
  case (map, mean) =>
    if(map.contains(mean))
      map
    else
      map + (mean -> GenSeq.empty[Point])
}

您的代码中有一个竞争条件

res += mean -> GenSeq.empty[Point]
多个线程同时重新分配res,因此可能会丢失一些条目

此代码解决了以下问题:

val closest = points.par.groupBy(point => findClosest(point, means))
val res = means.foldLeft(closest) {
  case (map, mean) =>
    if(map.contains(mean))
      map
    else
      map + (mean -> GenSeq.empty[Point])
}

处理一个点会改变方法,并且结果对处理顺序很敏感,因此该算法不适合并行执行。如果并行执行足够重要,可以改变算法,那么就有可能找到一种可以并行应用的算法

使用一组已知的分组点,如网格方形中心,意味着可以将这些点平行分配给它们的分组点,并按它们的分组点平行分组:

import scala.annotation.tailrec
import scala.collection.parallel.ParMap
import scala.collection.{GenMap, GenSeq, Map}
import scala.math._
import scala.util.Random

class ParallelPoint {
  val rng = new Random(0)

  val groups: Map[Point, Point] = (for {
                i <- 0 to 100
                j <- 0 to 100
                k <- 0 to 100
              }
              yield {
                val p = Point(10.0 * i, 10.0 * j, 10.0 * k)
                p -> p
              }
    ).toMap

  val points: Array[Point] = (1 to 10000000).map(aaa => Point(rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0)).toArray

  def findClosest(point: Point, groups: GenMap[Point, Point]): (Point, Point) = {
    val x: Double = rint(point.x / 10.0) * 10.0
    val y: Double = rint(point.y / 10.0) * 10.0
    val z: Double = rint(point.z / 10.0) * 10.0

    val mean: Point = groups(Point(x, y, z)) //.getOrElse(throw new Exception(s"$point out of range of mean ($x, $y, $z).") )

    (mean, point)
  }

  @tailrec
  private def total(points: GenSeq[Point]): Option[Point] = {
    points.size match {
      case 0 => None
      case 1 => Some(points(0))
      case _ => total((points(0) + points(1)) +: points.drop(2))
    }
  }

  def mean(points: GenSeq[Point]): Option[Point] = {
    total(points) match {
      case None => None
      case Some(p) => Some(p / points.size)
    }
  }

  val startTime = System.currentTimeMillis()

  println("starting test ...")

  val res: ParMap[Point, GenSeq[Point]] = points.par.map(p => findClosest(p, groups)).groupBy(pp => pp._1).map(kv => kv._1 -> kv._2.map(v => v._2))

  val groupTime = System.currentTimeMillis()
  println(s"... grouped result after ${groupTime - startTime}ms ...")

  points.par.foreach(p => if (! res(findClosest(p, groups)._1).exists(_ == p)) println(s"point $p not found"))

  val checkTime = System.currentTimeMillis()

  println(s"... checked grouped result after ${checkTime - startTime}ms ...")

  val means: ParMap[Point, GenSeq[Point]] = res.map{ kv => mean(kv._2).get -> kv._2 }

  val meansTime = System.currentTimeMillis()

  println(s"... means calculated after ${meansTime - startTime}ms.")
}

object ParallelPoint {
  def main(args: Array[String]): Unit = new ParallelPoint()
}

case class Point(x: Double, y: Double, z: Double) {
  def +(that: Point): Point = {
      Point(this.x + that.x, this.y + that.y, this.z + that.z)
  }

  def /(scale: Double): Point = Point(x/ scale, y / scale, z / scale)
}
import scala.annotation.tailrec
导入scala.collection.parallel.ParMap
导入scala.collection.{GenMap,GenSeq,Map}
导入scala.math_
导入scala.util.Random
类平行点{
val rng=新随机数(0)
val组:映射[点,点]=(用于{
我没有
案例1=>一些(点(0))
案例=>总数(分数(0)+分数(1))+:分数。下降(2))
}
}
def平均值(点:GenSeq[点]):选项[点]={
总(分)匹配{
案例无=>无
案例部分(p)=>部分(p/点大小)
}
}
val startTime=System.currentTimeMillis()
println(“启动测试…”)
Vall Res:PARMAP [点,GeSEQ [点] ]=Po.Par图(P= >查找最接近(p,组))。GROPBY(PP=>pp.1)。MAP(kV=>kV.1~-kV.2.2. MAP(V=>V.2))
val groupTime=System.currentTimeMillis()
println“…在${groupTime-startTime}ms之后的分组结果…”
Po.Pr.FACH(P= >如果(.REST(FopRead(p,组).1))存在((==P))PrimTLN(S点“$ P未找到”)
val checkTime=System.currentTimeMillis()
println“…在${checkTime-startTime}ms之后检查分组结果…”
val的意思是:ParMap[Point,GenSeq[Point]]=res.map{kv=>mean(kv.\u2).get->kv.\u2}
val meansTime=System.currentTimeMillis()
println(s“…指在${meansime-startTime}ms.之后计算的值”)
}
对象平行点{
def main(args:Array[String]):Unit=new ParallelPoint()
}
案例类点(x:Double,y:Double,z:Double){
def+(该点):点={
点(this.x+that.x,this.y+that.y,this.z+that.z)
}
def/(比例:双精度):点=点(x/比例,y/比例,z/比例)
}

最后一步用分组点的计算平均值替换分组点作为映射键。在我的2011 MBP上,这将在大约30秒内处理1000万个点。

处理一个点会改变平均值,并且结果对处理顺序敏感,因此算法不适合并行执行。如果需要并行执行重要到允许改变算法,那么就有可能找到一种可以并行应用的算法

使用一组已知的分组点,如网格方形中心,意味着可以将这些点平行分配给它们的分组点,并按它们的分组点平行分组:

import scala.annotation.tailrec
import scala.collection.parallel.ParMap
import scala.collection.{GenMap, GenSeq, Map}
import scala.math._
import scala.util.Random

class ParallelPoint {
  val rng = new Random(0)

  val groups: Map[Point, Point] = (for {
                i <- 0 to 100
                j <- 0 to 100
                k <- 0 to 100
              }
              yield {
                val p = Point(10.0 * i, 10.0 * j, 10.0 * k)
                p -> p
              }
    ).toMap

  val points: Array[Point] = (1 to 10000000).map(aaa => Point(rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0)).toArray

  def findClosest(point: Point, groups: GenMap[Point, Point]): (Point, Point) = {
    val x: Double = rint(point.x / 10.0) * 10.0
    val y: Double = rint(point.y / 10.0) * 10.0
    val z: Double = rint(point.z / 10.0) * 10.0

    val mean: Point = groups(Point(x, y, z)) //.getOrElse(throw new Exception(s"$point out of range of mean ($x, $y, $z).") )

    (mean, point)
  }

  @tailrec
  private def total(points: GenSeq[Point]): Option[Point] = {
    points.size match {
      case 0 => None
      case 1 => Some(points(0))
      case _ => total((points(0) + points(1)) +: points.drop(2))
    }
  }

  def mean(points: GenSeq[Point]): Option[Point] = {
    total(points) match {
      case None => None
      case Some(p) => Some(p / points.size)
    }
  }

  val startTime = System.currentTimeMillis()

  println("starting test ...")

  val res: ParMap[Point, GenSeq[Point]] = points.par.map(p => findClosest(p, groups)).groupBy(pp => pp._1).map(kv => kv._1 -> kv._2.map(v => v._2))

  val groupTime = System.currentTimeMillis()
  println(s"... grouped result after ${groupTime - startTime}ms ...")

  points.par.foreach(p => if (! res(findClosest(p, groups)._1).exists(_ == p)) println(s"point $p not found"))

  val checkTime = System.currentTimeMillis()

  println(s"... checked grouped result after ${checkTime - startTime}ms ...")

  val means: ParMap[Point, GenSeq[Point]] = res.map{ kv => mean(kv._2).get -> kv._2 }

  val meansTime = System.currentTimeMillis()

  println(s"... means calculated after ${meansTime - startTime}ms.")
}

object ParallelPoint {
  def main(args: Array[String]): Unit = new ParallelPoint()
}

case class Point(x: Double, y: Double, z: Double) {
  def +(that: Point): Point = {
      Point(this.x + that.x, this.y + that.y, this.z + that.z)
  }

  def /(scale: Double): Point = Point(x/ scale, y / scale, z / scale)
}
import scala.annotation.tailrec
导入scala.collection.parallel.ParMap
导入scala.collection.{GenMap,GenSeq,Map}
导入scala.math_
导入scala.util.Random
类平行点{
val rng=新随机数(0)
val组:映射[点,点]=(用于{
我没有
案例1=>一些(点(0))
案例=>总数(分数(0)+分数(1))+:分数。下降(2))
}
}
def平均值(点:GenSeq[点]):选项[点]={
总(分)匹配{
案例无=>无
案例部分(p)=>部分(p/点大小)
}
}
val startTime=System.currentTimeMillis()
println(“启动测试…”)
Vall Res:PARMAP [点,GeSEQ [点] ]=Po.Par图(P= >查找最接近(p,组))。GROPBY(PP=>pp.1)。MAP(kV=>kV.1~-kV.2.2. MAP(V=>V.2))
val groupTime=System.currentTimeMillis()
println“…在${groupTime-startTime}ms之后的分组结果…”
Po.Pr.FACH(P= >如果(.REST(FopRead(p,组).1))存在((==P))PrimTLN(S点“$ P未找到”)
val checkTime=System.currentTimeMillis()
println“…在${checkTime-startTime}ms之后检查分组结果…”
val的意思是:ParMap[Point,GenSeq[Point]]=res.map{kv=>mean(kv.\u2).get->kv.\u2}
val meansTime=System.currentTimeMillis()
println(s“…指在${meansime-startTime}ms.之后计算的值”)
}
对象平行点{
def main(args:Array[String]):Unit=new ParallelPoint()
}
案例类点(x:Double,y:Double,z:Double){
def+(该点):点={
点(this.x+that.x,this.y+that.y,this.z+that.z)
}
def/(比例:双精度):点=点(x/比例,y/比例,z/比例)
}

最后一步将分组点替换为计算出的分组点平均值作为映射键。这将在我的2011 MBP上在大约30秒内处理1000万个点。

问题是否在不使用并行化的情况下发生?问题是否在不使用并行化的情况下发生?你是对的,第二部分不是并行执行的。I ass我听说计算成本很高的部分是findClosest()函数的执行。这是一个简单的解决方案,在并行性方面应该相当有效(尽管可以改进)。我认为如果“means”不是真的很大,那么它将是