如何减少C API和Python可执行文件之间的执行时间差异？_Python_Python 3.x_Performance_Cpython

如何减少C API和Python可执行文件之间的执行时间差异？

python python-3.x performance

如何减少C API和Python可执行文件之间的执行时间差异？,python,python-3.x,performance,cpython,Python,Python 3.x,Performance,Cpython,使用python3或使用libpython3通过嵌入式解释器运行相同的python脚本会给出不同的执行时间 $ time PYTHONPATH=. ./simple real 0m6,201s user 1m3,680s sys 0m0,212s $ time PYTHONPATH=. python3 -c 'import test; test.run()' real 0m5,193s user 0m53,349s sys 0m0,164s （在运行之

使用

python3

或使用

libpython3

通过嵌入式解释器运行相同的python脚本会给出不同的执行时间

$ time PYTHONPATH=. ./simple
real    0m6,201s
user    1m3,680s
sys     0m0,212s

$ time PYTHONPATH=. python3 -c 'import test; test.run()'
real    0m5,193s
user    0m53,349s
sys     0m0,164s

（在运行之间删除

\uuuu pycache\uuuu

的内容似乎不会产生影响）

目前，使用脚本调用

python3

的速度更快；在我的实际用例中，与从嵌入式解释器中运行的相同脚本相比，该因子快1.5

我想（1）了解差异从何而来，（2）是否可以使用嵌入式解释器获得相同的性能？（例如使用cython目前不是一个选项）

代码 simple.cpp test.py 曼德尔·皮从中调整的版本（请参见）

从contextlib导入关闭
从itertools导入islice
从操作系统导入cpu\u计数
从系统导入标准输出
def像素（y、n、abs）：
范围7=字节数组（范围（7））
像素\u位=字节数组（128>>位置用于范围（8）内的位置）
c1=2浮动（n）
c0=-1.5+1j*y*c1-1j
x=0
尽管如此：
像素=0
c=x*c1+c0
对于像素_位中的像素_位：
z=c
对于范围7中的uu：
对于范围7中的uu：
z=z*z+c
如果abs（z）>=2：断开
其他：
像素+=像素\u位
c+=c1
产量像素
x+=8
def compute_行（p）：
y、 n=p
结果=字节数组（islice（像素（y，n，abs），（n+7）//8））
结果[-1]&=0xff 0:
行=下一行（行）
订单[行[0]]=行
j-=1
如果命令[i]：
屈服令[i]
订单[i]=无
i+=1
def计算_行（n，f）：
行_作业=（（y，n）对于范围（n）中的y）
如果cpu_计数（）小于2：
map的收益率（f，行_作业）
其他：
来自多处理导入池
使用Pool（）作为池：
无序\u行=池。imap\u无序（f，行\u作业）
有序_行（无序_行，n）的产量
曼德尔布罗特（n）：
write=stdout.write
将关闭（计算_行（n，计算_行））作为行：
写入（“P4\n{0}{0}\n.format（n.encode（））
对于行中的行：
写入（第[1]行）

显然，时间差来自于静态与动态链接

libpython

。在

python.c

旁边的Makefile（来自参考实现）中，以下内容构建了一个静态链接版本的解释器：

snake: python.c
    g++ \
    -I/usr/include/python3.6m \
    -pthread \
    -specs=/usr/share/dpkg/no-pie-link.specs \
    -specs=/usr/share/dpkg/no-pie-compile.specs \
    \
    -Wall \
    -Wformat \
    -Werror=format-security \
    -Wno-unused-result \
    -Wsign-compare \
    -DNDEBUG \
    -g \
    -fwrapv \
    -fstack-protector \
    -O3 \
    \
    -Xlinker -export-dynamic \
    -Wl,-Bsymbolic-functions \
    -Wl,-z,relro \
    -Wl,-O1 \
    python.c \
    /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.a \
    -lexpat \
    -lpthread \
    -ldl \
    -lutil \
    -lexpat \
    -L/usr/lib \
    -lz \
    -lm \
    -o $@

将行

/usr/lib/../libpython3.6m.a

更改为

-llibpython3.6m

将生成速度较慢的版本（还需要

-L/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu

）

尾声

速度上的差异是存在的，但并不是我最初问题的全部答案；实际上，“较慢”的解释器是在特定的LD_预加载环境下执行的，该环境改变了系统时间函数的行为方式，使cProfile变得混乱

 g++ -std=c++11 -fPIC $(python3-config --cflags) simple.cpp \
 $(python3-config --ldflags) -o simple

import sys
sys.stdout = open('output.bin', 'bw')
import mandel
def run():
    mandel.mandelbrot(4096)

from contextlib import closing
from itertools import islice
from os import cpu_count
from sys import stdout

def pixels(y, n, abs):
    range7 = bytearray(range(7))
    pixel_bits = bytearray(128 >> pos for pos in range(8))
    c1 = 2. / float(n)
    c0 = -1.5 + 1j * y * c1 - 1j
    x = 0
    while True:
        pixel = 0
        c = x * c1 + c0
        for pixel_bit in pixel_bits:
            z = c
            for _ in range7:
                for _ in range7:
                    z = z * z + c
                if abs(z) >= 2.: break
            else:
                pixel += pixel_bit
            c += c1
        yield pixel
        x += 8

def compute_row(p):
    y, n = p

    result = bytearray(islice(pixels(y, n, abs), (n + 7) // 8))
    result[-1] &= 0xff << (8 - n % 8)
    return y, result

def ordered_rows(rows, n):
    order = [None] * n
    i = 0
    j = n
    while i < len(order):
        if j > 0:
            row = next(rows)
            order[row[0]] = row
            j -= 1

        if order[i]:
            yield order[i]
            order[i] = None
            i += 1

def compute_rows(n, f):
    row_jobs = ((y, n) for y in range(n))

    if cpu_count() < 2:
        yield from map(f, row_jobs)
    else:
        from multiprocessing import Pool
        with Pool() as pool:
            unordered_rows = pool.imap_unordered(f, row_jobs)
            yield from ordered_rows(unordered_rows, n)

def mandelbrot(n):
    write = stdout.write

    with closing(compute_rows(n, compute_row)) as rows:
        write("P4\n{0} {0}\n".format(n).encode())
        for row in rows:
            write(row[1])

snake: python.c
    g++ \
    -I/usr/include/python3.6m \
    -pthread \
    -specs=/usr/share/dpkg/no-pie-link.specs \
    -specs=/usr/share/dpkg/no-pie-compile.specs \
    \
    -Wall \
    -Wformat \
    -Werror=format-security \
    -Wno-unused-result \
    -Wsign-compare \
    -DNDEBUG \
    -g \
    -fwrapv \
    -fstack-protector \
    -O3 \
    \
    -Xlinker -export-dynamic \
    -Wl,-Bsymbolic-functions \
    -Wl,-z,relro \
    -Wl,-O1 \
    python.c \
    /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.a \
    -lexpat \
    -lpthread \
    -ldl \
    -lutil \
    -lexpat \
    -L/usr/lib \
    -lz \
    -lm \
    -o $@