在Scala（TopN模式）中并行迭代集合的最有效方法是什么_Scala_Collections_Parallel Processing

在Scala（TopN模式）中并行迭代集合的最有效方法是什么

scala collections parallel-processing

在Scala（TopN模式）中并行迭代集合的最有效方法是什么,scala,collections,parallel-processing,Scala,Collections,Parallel Processing,我是Scala新手，希望构建一个实时应用程序来匹配一些人。对于给定的人，我希望得到匹配分数最高的前50名成语如下： val persons = new mutable.HashSet[Person]() // Collection of people /* Feed omitted */ val personsPar = persons.par // Make it parall val person = ... // The given person res = personsPar

我是Scala新手，希望构建一个实时应用程序来匹配一些人。对于给定的人，我希望得到匹配分数最高的前50名

成语如下：

val persons = new mutable.HashSet[Person]() // Collection of people
/* Feed omitted */
val personsPar = persons.par // Make it parall
val person = ... // The given person

res = personsPar
        .filter(...) // Some filters
        .map{p => (p,computeMatchingScoreAsFloat(person, p))}
        .toList
        .sortBy(-_._2)
        .take(50)
        .map(t => t._1 + "=" + t._2).mkString("\n")

在上面的示例代码中，使用了HashSet，但它可以是任何类型的集合，因为我非常确定它不是最优的

问题是persons包含超过5M个元素，computeMatchingScoreAsFloat mé方法使用200个浮点的2个向量计算一种a相关值。我的计算机有6个核，计算耗时约2秒

我的问题是，在Scala中，最快的方法是什么

子问题： -我应该使用什么样的集合实现（或其他什么？）？ -我应该使用期货吗

注意：必须并行计算，仅计算computeMatchingScoreAsFloat（无排名/前N名）的纯计算时间超过一秒钟，如果在我的计算机上使用多线程，则时间<200毫秒

编辑：多亏了纪尧姆，计算时间从2秒减少到了700毫秒

def top[B](n:Int,t: Traversable[B])(implicit ord: Ordering[B]):collection.mutable.PriorityQueue[B] = {

  val starter = collection.mutable.PriorityQueue[B]()(ord.reverse) // Need to reverse for us to capture the lowest (of the max) or the greatest (of the min)

  t.foldLeft(starter)(
    (myQueue,a) => {
      if( myQueue.length <= n ){ myQueue.enqueue(a);myQueue}
      else if( ord.compare(a,myQueue.head) < 0  ) myQueue
      else{
        myQueue.dequeue
        myQueue.enqueue(a)
        myQueue
      }
    }
  )
}

def top[B]（n:Int，t:Traversable[B]）（隐式ord:Ordering[B]）：collection.mutable.PriorityQueue[B]={
val starter=collection.mutable.PriorityQueue[B]（）（ord.reverse）//需要反转才能捕获最低（最大值）或最大（最小值）
t、 foldLeft（起动器）(
（myQueue，a）=>{
如果（myQueue.length我会提出一些修改：
1-我认为筛选和映射步骤需要遍历集合两次。拥有惰性集合会将其减少为一个。拥有惰性集合（如流）或将其转换为一个，例如，对于列表：
myList.view

2-排序步骤要求对所有元素进行排序。相反，您可以使用累加器进行折叠，累加器存储前N个记录。有关实现示例，请参见此处：
。如果您想要获得最高性能（真正落入其控制室），我可能会测试优先级队列而不是列表。例如，类似以下内容：
  def IntStream(n:Int):Stream[(Int,Int)] = if(n == 0) Stream.empty else (util.Random.nextInt,util.Random.nextInt) #:: IntStream(n-1)

  def top[B](n:Int,t: Traversable[B])(implicit ord: Ordering[B]):collection.mutable.PriorityQueue[B] = {

    val starter = collection.mutable.PriorityQueue[B]()(ord.reverse) // Need to reverse for us to capture the lowest (of the max) or the greatest (of the min)

    t.foldLeft(starter)(
      (myQueue,a) => {
        if( myQueue.length <= n ){ myQueue.enqueue(a);myQueue}
        else if( ord.compare(a,myQueue.head) < 0  ) myQueue
        else{
          myQueue.dequeue
          myQueue.enqueue(a)
          myQueue
        }
      }
    )
  }

def diff(t2:(Int,Int)) =  t2._2
 top(10,IntStream(10000))(Ordering.by(diff)) // select top 10 

def IntStream（n:Int）：Stream[（Int，Int）]=if（n==0）Stream.empty else（util.Random.nextInt，util.Random.nextInt）#：：：IntStream（n-1）
def top[B]（n:Int，t:Traversable[B]）（隐式order:Ordering[B]）：collection.mutable.PriorityQueue[B]={
val starter=collection.mutable.PriorityQueue[B]（）（ord.reverse）//需要反转才能捕获最低（最大值）或最大（最小值）
t、 foldLeft（起动器）(
（myQueue，a）=>{
如果（myQueue.length）感谢您的帮助，我将测试这一点并告诉您新的计算时间我似乎无法查看并行集合（一个既并行又懒惰的集合），我可以处理ParSeq xor或SeqView。你知道我如何做到这一点吗？Scala中有ParSeqView吗？你试过在非并行集合上运行它吗？我认为它有并行集合，有很大的开销，可能没有道理（另外，你有一个“toList”，我怀疑它合并到非并行集合中）。否则，您可以运行flatMap来组合过滤器和贴图（请参见此处）这将是一个懒惰的收集感谢你的帮助相同的结果。我已经更新了我原来的帖子。是的，我已经单独测试了CimeMeCaskCaskSaveBooT计算，它在一个非PAR集合上使用了1s，而在一个PAR收集2s上，200毫秒> 700毫秒。还有其他的建议吗？我会尝试手动分割。我将数据放入存储桶中并运行多线程计算，而不使用任何收集方法。您对此进行了基准测试吗？ComputeMatchingCoreAsFloat是最昂贵的部分吗？它足够重以至于值得并行化吗？是的，它必须并行计算，这是ComputeMatchingCoreAsFloat的纯计算（没有排名/前N名）需要一秒钟以上，如果我的计算机上有多线程，则需要40毫秒