Function 如何在scala中优化这个短阶乘函数？（创建50000个大整数）_Function_Scala_Optimization_Lazy Evaluation_Factorial

Function 如何在scala中优化这个短阶乘函数？（创建50000个大整数）

function scala optimization

Function 如何在scala中优化这个短阶乘函数？（创建50000个大整数）,function,scala,optimization,lazy-evaluation,factorial,Function,Scala,Optimization,Lazy Evaluation,Factorial,我比较了scala版本 (BigInt(1) to BigInt(50000)).reduce(_ * _) 到python版本 reduce(lambda x,y: x*y, range(1,50000)) 事实证明，scala版本所花的时间是python版本的10倍我猜，一个很大的区别是python可以使用其原生的long类型，而不是为每个数字创建新的BigInt对象。但是scala中有解决办法吗？scala代码创建50000个BigInt对象这一事实在这里不太可能有什么不同。一个更大

我比较了scala版本

(BigInt(1) to BigInt(50000)).reduce(_ * _)

到python版本

reduce(lambda x,y: x*y, range(1,50000))

事实证明，scala版本所花的时间是python版本的10倍

我猜，一个很大的区别是python可以使用其原生的long类型，而不是为每个数字创建新的BigInt对象。但是scala中有解决办法吗？

scala代码创建50000个

BigInt

对象这一事实在这里不太可能有什么不同。一个更大的问题是乘法算法——使用的是Java的

biginger

（它只包装了

BigInt

）而不是

最简单的解决方法可能是切换到更好的任意精度数学库，如：

这比我机器上的Python解决方案快

更新：我已经编写了使用来响应，它在我的（四核）机器上给出了以下结果：

我看不出他所做的

reduce

和

fold

之间有什么区别，但寓意很明确：如果你能使用Scala 2.9的并行集合，它们会给你带来巨大的改进，但切换到

LargeInteger

也会有所帮助。

我的机器上的Python:

def func():
  start= time.clock()
  reduce(lambda x,y: x*y, range(1,50000))
  end= time.clock()
  t = (end-start) * 1000
  print t

给出

1219ms

斯卡拉：

def timed[T](f: => T) = {
  val t0 = System.currentTimeMillis
  val r = f
  val t1 = System.currentTimeMillis
  println("Took: "+(t1 - t0)+" ms")
  r
}

timed { (BigInt(1) to BigInt(50000)).reduce(_ * _) }
4251 ms

timed { (BigInt(1) to BigInt(50000)).fold(BigInt(1))(_ * _) }
4224 ms

timed { (BigInt(1) to BigInt(50000)).par.reduce(_ * _) }
2083 ms

timed { (BigInt(1) to BigInt(50000)).par.fold(BigInt(1))(_ * _) }
689 ms

// using org.jscience.mathematics.number.LargeInteger from Travis's answer
timed { val a = (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _) }
3327 ms

timed { val a = (1 to 50000).map(LargeInteger.valueOf(_)).par.fold(
                                          LargeInteger.ONE)(_ times _) }
361 ms

这689毫秒和361毫秒是在几次热身后进行的。它们都是从大约1000毫秒开始的，但似乎以不同的量升温。并行集合似乎比非并行集合更热：非并行操作从第一次运行开始并没有显著减少

.par

（意思是使用并行集合）似乎比

reduce

更能加速

折叠

。我只有2个内核，但是更多的内核会带来更大的性能提升

因此，在实验上，优化这个函数的方法是

a）使用

折叠

而不是

减少

b）使用并行集合

更新： 受将计算分解成更小的块会加快速度这一观察结果的启发，我成功地让下面的代码在我的机器上以

215ms

的速度运行，这比标准并行算法提高了40%。（使用BigInt需要615毫秒）此外，它不使用并行集合，但不知何故使用了90%的CPU（与BigInt不同）

这里的另一个技巧是尝试

reduceLeft

和

reduceRight

，看看什么是最快的。在您的示例中，我可以更快地执行

reduceRight

：

scala> timed { (BigInt(1) to BigInt(50000)).reduceLeft(_ * _) }
Took: 4605 ms

scala> timed { (BigInt(1) to BigInt(50000)).reduceRight(_ * _) }
Took: 2004 ms

foldLeft

和

foldRight

之间的区别相同。我猜从树的哪一边开始缩减很重要：）

在Scala中计算阶乘最有效的方法是使用分而治之的策略：

def fact(n: Int): BigInt = rangeProduct(1, n)

private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
  case 0 => BigInt(n1)
  case 1 => BigInt(n1 * n2)
  case 2 => BigInt(n1 * (n1 + 1)) * n2
  case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
  case _ => 
    val nm = (n1 + n2) >> 1
    rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
}

另外，要获得更高的速度，请使用最新版本的JDK和以下JVM选项：

-server -XX:+TieredCompilation

以下是英特尔（R）核心（TM）i7-2640M CPU@2.80GHz（最大3.50GHz）、RAM 12Gb DDR3-1333、Windows 7 sp1、Oracle JDK 1.8.0_25-b18 64位的测试结果：

(BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
fact(100000) took: 203 ms with 28.3 % of CPU usage

顺便说一句，对于大于20000的数字，为了提高阶乘计算的效率，请使用Schönhage-Strassen算法的实现，或者等到它合并到JDK 9，Scala将能够使用它

Scala版本需要多少钱？在我的机器上大约需要7秒，我的意思是，我用纯java写的，大约需要6秒。根据您的陈述，python应该比java快一个数量级？我通过在sbt中运行scala版本测量了大约23秒。通过在python的REPL中使用time.time（）差异来运行它。我确实犯了一个错误，但区别是显而易见的。这表明Scala的语法有误导性吗？@PeterSchwede，我不认为这有误导性；它只是表明Java的BigInteger类与Python的相比有点慢，并且计算阶乘的算法比显而易见的算法更快。Scala允许您对代码进行调优，从而最终获得比Python快6倍的速度，这一事实应该被视为是积极的。现在，如果有一个内置的阶乘函数是dog-slow，那将是一个值得关注的问题。在四核上，

LargeIntegerReduce

比

LargeIntegerReducePar

要花11倍的时间？我的意思是，在实践中，使用缓存效果和其他东西，完全可以比线性缩放稍微好一点，但在4个核心上获得11.6的加速似乎有点可疑——或者我遗漏了什么？@Voo：我觉得这也很奇怪，但（至少可以想象）我们会看到比线性缩放更好的结果，因为我们通过拆分序列、获取子序列的乘积并将结果相乘来减少庞大的数字。这可能是真的，仍然是一个巨大的改进，但是您的基准代码似乎也不错。最后一段（假设简单分离）将为50k*3/4！较小的，其本身就是一个巨大的数字。这也是我能提出的最好的假设。假设这是真的，也为单线程改进打开了一个空间-有趣的想法；-）@Voo：是的，即使是像

（BigInt（1）到BigInt（50000））.grouped（100）.map（u.product）.grouped（100）.map（u.product）.product

这样简单的东西也能让你接近并行收集的性能（在本例中为550毫秒）。@Travis有趣的。。。我用它在使用

zipAll

时又减少了40%，这比

grouped

更快。看看我的答案。

def fact(n: Int): BigInt = rangeProduct(1, n)

private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
  case 0 => BigInt(n1)
  case 1 => BigInt(n1 * n2)
  case 2 => BigInt(n1 * (n1 + 1)) * n2
  case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
  case _ => 
    val nm = (n1 + n2) >> 1
    rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
}

-server -XX:+TieredCompilation

(BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
fact(100000) took: 203 ms with 28.3 % of CPU usage