Common lisp 从通用Lisp/SBCL中挤出更多速度_Common Lisp_Sbcl

Common lisp 从通用Lisp/SBCL中挤出更多速度

common-lisp

Common lisp 从通用Lisp/SBCL中挤出更多速度,common-lisp,sbcl,Common Lisp,Sbcl,声称使某个Lisp程序比其C语言运行得更快相等的试图重现结果，我能够接近（Lisp是比C）慢50%，但想知道是否有人知道挤压更多的方法超出SBCL 1.3.1的性能目标问题是为800 x中的每个单元格添加一个恒定的单个浮点 800个单个浮点数的数组。方法是用C和C语言编写程序使用公共Lisp并比较时间。使用这个，C代码如下如下: #包括 #包括 #包括 #包括 #包括，在新书中，在具体如下。这些评论指出了我尝试过的一些事情，但没有成功差别 ;;; None of the comm

声称使某个Lisp程序比其C语言运行得更快相等的试图重现结果，我能够接近（Lisp是比C）慢50%，但想知道是否有人知道挤压更多的方法超出SBCL 1.3.1的性能

目标问题是为800 x中的每个单元格添加一个恒定的单个浮点 800个单个浮点数的数组。方法是用C和C语言编写程序使用公共Lisp并比较时间。使用这个，C代码如下如下:

#包括
#包括
#包括
#包括
#包括，在新书中，在
具体如下。这些评论指出了我尝试过的一些事情，但没有成功
差别
;;; None of the commented-out declarations made any difference in speed. 

(declaim (optimize speed (safety 0)))

(defun image-add (to from val)
  (declare (type (simple-array single-float (*))
                 to from))
  (declare (type single-float val))
  (let ((size (array-dimension to 0)))
    ;(declare (type fixnum size))
    (dotimes (i size)
      ;(declare (type fixnum i))
      (setf (aref to i) (+ (aref from i) val)))))

(defparameter HORZ 800)
(defparameter VERT 800)

(defparameter PERF-REPS 1000)

(let ((to (make-array (* HORZ VERT) :element-type 'single-float))
      (fm (make-array (* HORZ VERT) :element-type 'single-float)))
  ;(declare (type fixnum HORZ))
  ;(declare (type fixnum VERT))
  (time (dotimes (_ PERF-REPS)
          ;(declare (type fixnum PERF-REPS))
          ;(declare (type fixnum _))
          ;(declare (inline image-add))
          (image-add to fm 42.0))))

我编译并运行它如下：
gcc -O3 image-add.c ./modules/tictoc/libtictoc.a && ./a.out

sbcl --script image-perf.lisp

典型运行时间为0.276。不错，但我想要更好的。当然
本练习的要点是Lisp代码较短，但确实如此
有人知道一种方法可以使它更快或更快吗？SBCL提供了一系列信息
当我保存您的代码（并将最后的let示例包装到一个单独的函数中）并用SBCL编译它时，我实际上得到了一系列诊断输出，它们告诉我们可以生成更好代码的一些地方。这里有很多，但是略读一下，虽然都在测试中，所以它可能有用，也可能没用。但是，由于测试代码可能会减慢速度，因此值得一试
CL-USER> (compile-file ".../compile.lisp")
; compiling file ".../compile.lisp" (written 25 JAN 2016 01:53:23 PM):
; compiling (DECLAIM (OPTIMIZE SPEED ...))
; compiling (DEFUN IMAGE-ADD ...)
; compiling (DEFPARAMETER HORZ ...)
; compiling (DEFPARAMETER VERT ...)
; compiling (DEFPARAMETER PERF-REPS ...)
; compiling (DEFUN TEST ...)

; file: /home/taylorj/tmp/compile.lisp
; in: DEFUN TEST
;     (* HORZ VERT)
; 
; note: unable to
;   convert x*2^k to shift
; due to type uncertainty:
;   The first argument is a NUMBER, not a INTEGER.
;   The second argument is a NUMBER, not a INTEGER.
; 
; note: unable to
;   convert x*2^k to shift
; due to type uncertainty:
;   The first argument is a NUMBER, not a INTEGER.
;   The second argument is a NUMBER, not a INTEGER.

;     (DOTIMES (_ PERF-REPS) (IMAGE-ADD TO FM 42.0))
; --> DO BLOCK LET TAGBODY UNLESS IF >= IF 
; ==>
;   (< SB-C::X SB-C::Y)
; 
; note: forced to do GENERIC-< (cost 10)
;       unable to do inline fixnum comparison (cost 4) because:
;       The first argument is a UNSIGNED-BYTE, not a FIXNUM.
;       The second argument is a INTEGER, not a FIXNUM.

; --> DO BLOCK LET TAGBODY PSETQ PSETF LET* MULTIPLE-VALUE-BIND LET 1+ 
; ==>
;   (+ _ 1)
; 
; note: forced to do GENERIC-+ (cost 10)
;       unable to do inline fixnum arithmetic (cost 1) because:
;       The first argument is a UNSIGNED-BYTE, not a FIXNUM.
;       The result is a (VALUES (INTEGER 1) &OPTIONAL), not a (VALUES FIXNUM
;                                                                     &REST T).
;       unable to do inline fixnum arithmetic (cost 2) because:
;       The first argument is a UNSIGNED-BYTE, not a FIXNUM.
;       The result is a (VALUES (INTEGER 1) &OPTIONAL), not a (VALUES FIXNUM
;                                                                     &REST T).
;       etc.

;     (* HORZ VERT)
; 
; note: forced to do GENERIC-* (cost 30)
;       unable to do inline fixnum arithmetic (cost 4) because:
;       The first argument is a NUMBER, not a FIXNUM.
;       The second argument is a NUMBER, not a FIXNUM.
;       unable to do inline (signed-byte 64) arithmetic (cost 5) because:
;       The first argument is a NUMBER, not a (SIGNED-BYTE 64).
;       The second argument is a NUMBER, not a (SIGNED-BYTE 64).
;       etc.
; 
; note: forced to do GENERIC-* (cost 30)
;       unable to do inline fixnum arithmetic (cost 4) because:
;       The first argument is a NUMBER, not a FIXNUM.
;       The second argument is a NUMBER, not a FIXNUM.
;       unable to do inline (signed-byte 64) arithmetic (cost 5) because:
;       The first argument is a NUMBER, not a (SIGNED-BYTE 64).
;       The second argument is a NUMBER, not a (SIGNED-BYTE 64).
;       etc.
; 
; compilation unit finished
;   printed 6 notes

; .../compile.fasl written
; compilation finished in 0:00:00.009

之后，您可能希望研究可能的循环展开、缓存数组维度以及考虑数组的内存位置。不过，这些都是相当通用的提示。我不确定有什么具体的事情可以在这里提供更多帮助
关于“确保使用优化和安全”的标准答案
编辑：拍摄，我错过了最高级别的声明，即参考速度和安全。仍然值得检查这些是否达到了预期效果，但如果达到了预期效果，那么这个答案大部分是多余的
在大多数情况下，编译器可以完全忽略声明。（唯一的例外是特殊的声明，它改变了变量的绑定语义。）因此编译器如何处理它们取决于它。像type这样的声明至少可以以两种不同的方式使用。如果您试图编译非常安全的代码，类型声明会让编译器知道可以添加额外的检查。当然，这会导致代码变慢，但会更安全。另一方面，如果您试图生成非常快速的代码，那么编译器可以将这些类型声明作为您的保证，确保值始终是正确的类型，从而生成更快的代码
看起来您只是在添加类型声明。如果您想要更快（或更安全）的代码，您也需要添加声明来说明这一点。在这种情况下，您可能需要（声明（优化（速度3）（安全0））。例如，看一看返回其fixnum参数的简单函数的一些缺点。首先，仅声明类型，代码最终为18字节：
(defun int-identity (x)
  (declare (type fixnum x))
  x)

INT-IDENTITY
CL-USER> (disassemble 'int-identity)
; disassembly for INT-IDENTITY
; Size: 18 bytes. Origin: #x100470619A
; 9A:       488BE5           MOV RSP, RBP                     ; no-arg-parsing entry point
; 9D:       F8               CLC
; 9E:       5D               POP RBP
; 9F:       C3               RET
; A0:       CC0A             BREAK 10                         ; error trap
; A2:       02               BYTE #X02
; A3:       19               BYTE #X19                        ; INVALID-ARG-COUNT-ERROR
; A4:       9A               BYTE #X9A                        ; RCX
; A5:       CC0A             BREAK 10                         ; error trap
; A7:       04               BYTE #X04
; A8:       08               BYTE #X08                        ; OBJECT-NOT-FIXNUM-ERROR
; A9:       FE1B01           BYTE #XFE, #X1B, #X01            ; RDX
NIL

现在，如果我们还添加一个速度优化，代码大小会稍微增加一点。（但这不一定是坏事。一些速度优化，如循环展开或函数内联，将生成更大的代码。）
最后，当我们删除安全性时，我们最终得到一些非常短的代码，只有9个字节：
CL-USER> 
(defun int-identity (x)
  (declare (type fixnum x)
           (optimize (speed 3)
                     (safety 0)))
  x)

STYLE-WARNING: redefining COMMON-LISP-USER::INT-IDENTITY in DEFUN
INT-IDENTITY
CL-USER> (disassemble 'int-identity)
; disassembly for INT-IDENTITY
; Size: 9 bytes. Origin: #x1004AFF3E2
; 2:       488BD1           MOV RDX, RCX                      ; no-arg-parsing entry point
; 5:       488BE5           MOV RSP, RBP
; 8:       F8               CLC
; 9:       5D               POP RBP
; A:       C3               RET

使用@coredump的：lparallel
建议，我能够获得0.125秒的一致运行时间，明显快于gcc-O3的0.175秒。在编译文件和内联图像添加
函数的注释中建议的各种技术并没有产生明显的加速效果。这是迄今为止最快的代码
(load "~/quicklisp/setup.lisp")
(ql:quickload :lparallel)

(declaim (optimize (speed 3) (safety 0)))

(defparameter HORZ 800)
(defparameter VERT 800)

(defparameter PERF-REPS 1000)

(setf lparallel:*kernel* (lparallel:make-kernel 4))

(defun test ()
  (declare (type fixnum HORZ VERT PERF-REPS))
  (let ((to (make-array (* HORZ VERT) :element-type 'single-float))
        (fm (make-array (* HORZ VERT) :element-type 'single-float)))
    (time (dotimes (_ PERF-REPS)
            (lparallel:pmap-into to (lambda (x) (+ x 42f0)) fm)))))

(test)

编辑：我会注意到这并不公平：我在Lisp代码中添加了显式并行，但没有在C代码中添加。然而，值得注意的是，用Lisp实现这一点是多么容易。因为在这种情况下，Lisp比C的主要优势是代码简洁，并且相对容易添加诸如并行性之类的特性，所以权衡的方向是正确的（说明Lisp的相对灵活性）。我怀疑并行C代码（如果我能着手实现它的话）会比Lisp代码更快。
为了参考，这里有一些稍微修改过的版本的结果
C版本
C版本的平均时间0.197s
Lisp版本
以下是输出：
Evaluation took:                                                                                                 
  0.372 seconds of real time                                                                                     
  0.372024 seconds of total run time (0.368023 user, 0.004001 system)                                            
  100.00% CPU                                                                                                    
  965,075,988 processor cycles                                                                                   
  0 bytes consed  

用lparallel:pmap-into
替换map-into
，使用由4个工作进程组成的内核获得最短时间，并给出：
Evaluation took:                                                                                                 
 0.122 seconds of real time                                                                                     
 0.496031 seconds of total run time (0.492030 user, 0.004001 system)                                            
 406.56% CPU                                                                                                    
 316,445,789 processor cycles                                                                                   
 753,280 bytes consed

请注意内存使用的差异。
即使我在上面做了说明，我也注意到，在编译文件时，SBCL提供了许多有用的提示，我现在将介绍这些提示……我使用defun测试重新编写了编译器说明
，并验证了在添加内部declare fixnum
后，它在编译时没有任何注释。平均运行时间仍为0.275秒：）@Reb.cab是的，不幸的是，似乎没有多大区别。我一直在寻找的其他一些东西，您可能需要考虑：函数内联和传递数组大小。我不确定SBCL如何存储数组大小，但每次获取数组大小都会产生影响。此外，请检查浮点类型。您真的想要单浮点数而不是双浮点数（或长浮点数等）？但在这一点上，我认为您可能想看看C如何反汇编，并将其与SBCL生成的进行比较。您应该使用文件编译器来实际启用更多优化<代码>编译文件

。加载源代码会单独编译表单。文件编译器可能做得更多。如果我在函数定义之前添加一个
（declaim（内联图像添加））
（并重新编译所有调用站点），我会看到一个加速（从0.560s到0.364s）
(declaim (optimize (speed 3) (debug 0) (safety 0))) (defconstant HORZ 800) (defconstant VERT 800) (defconstant PERF-REPS 1000) (defun test () (let ((target #1=(make-array (* HORZ VERT) :element-type 'single-float :initial-element 0f0)) (source #1#)) (declare (type (simple-array single-float (*)) target source)) (time (dotimes (_ PERF-REPS) (map-into target (lambda (x) (declare (single-float x)) (the single-float (+ x 42f0))) source)))))

Evaluation took: 0.372 seconds of real time 0.372024 seconds of total run time (0.368023 user, 0.004001 system) 100.00% CPU 965,075,988 processor cycles 0 bytes consed

Evaluation took: 0.122 seconds of real time 0.496031 seconds of total run time (0.492030 user, 0.004001 system) 406.56% CPU 316,445,789 processor cycles 753,280 bytes consed