Python 优化reed-solomon编码器（多项式除法）_Python_Numpy_Optimization_Cython_Pypy

Python 优化reed-solomon编码器（多项式除法）

python numpy optimization

Python 优化reed-solomon编码器（多项式除法）,python,numpy,optimization,cython,pypy,Python,Numpy,Optimization,Cython,Pypy,我试图优化一个Reed-Solomon编码器，它实际上只是Galois字段2^8上的一个多项式除法运算（这意味着值的长度超过255）。事实上，代码与Go的代码非常相似：这里使用的多项式除法是a（也称为霍纳法）我什么都试过了：小矮人，小矮人，小天鹅。我获得的最佳性能是通过将pypy与以下简单的嵌套循环一起使用： def rsenc(msg_in, nsym, gen): '''Reed-Solomon encoding using polynomial division, better

我试图优化一个Reed-Solomon编码器，它实际上只是Galois字段2^8上的一个多项式除法运算（这意味着值的长度超过255）。事实上，代码与Go的代码非常相似：

这里使用的多项式除法是a（也称为霍纳法）

我什么都试过了：小矮人，小矮人，小天鹅。我获得的最佳性能是通过将pypy与以下简单的嵌套循环一起使用：

def rsenc(msg_in, nsym, gen):
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''
    msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))])

    for i in xrange(len(msg_in)):
        coef = msg_out[i]
        # coef = gf_mul(msg_out[i], gf_inverse(gen[0]))  // for general polynomial division (when polynomials are non-monic), we need to compute: coef = msg_out[i] / gen[0]
        if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
            lcoef = gf_log[coef] # precaching

            for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] += msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:len(msg_in)] = msg_in
    return msg_out

Python优化向导能给我一些关于如何获得加速的线索吗？我的目标是获得至少3倍的加速，但更多的将是可怕的。任何方法或工具都可以接受，只要是跨平台的（至少可以在Linux和Windows上使用）

下面是一个小测试脚本，其中包含我尝试过的其他一些替代方案（不包括cython尝试，因为它比本机python慢！）：

（注意：备选方案应该是正确的，某些索引必须有点偏离，但由于它们速度较慢，我没有尝试修复它们）

/赏金的更新和目标：我发现了一个非常有趣的优化技巧，可以大大加快计算速度：到。我用新函数rsenc_precomp（）更新了上面的代码。但是，在我的实现中没有任何收益，甚至有点慢：

rsenc : total time elapsed 0.107170 seconds.
rsenc_precomp : total time elapsed 0.108788 seconds.

数组查找的成本怎么会比加法或xor之类的操作更高？为什么它在ZFEC中工作而在Python中不工作？

我将把这笔赏金归功于谁，谁能告诉我如何使这个乘法/加法查找表优化工作（比异或和加法运算更快），谁能通过引用或分析向我解释为什么这个优化不能在这里工作（使用Python/pypypypy/Cython/Numpy等。我都试过了）.

在我的机器上，以下速度比pypy快3倍（0.04秒比0.15秒）。使用Cython：

ctypedef unsigned char uint8_t # does not work with Microsoft's C Compiler: from libc.stdint cimport uint8_t
cimport cpython.array as array

cdef uint8_t[::1] gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19,
   lots of numbers omitted for space reasons
   ...])

cdef uint8_t[::1] gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, 
    more numbers omitted for space reasons
    ...])

import cython

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def rsenc(msg_in_r, nsym, gen_t):
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    cdef int[::1] gen = array.array('i',gen_t) # convert list to array

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log[gen[j]]

    cdef uint8_t coef,lcoef

    cdef int i
    for i in xrange(msg_in.shape[0]):
        coef = msg_out[i]
        if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
            lcoef = gf_log[coef] # precaching

            for j in xrange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

这是一个包含静态类型的最快版本（并从

cython-a

检查html，直到循环没有以黄色突出显示）

以下是一些简要说明：

Cython更喜欢
```
x.shape[0]
```
而不是
```
len（shape）
```
将MemoryView定义为
```
[：：1]
```
可以保证它们在内存中是连续的，这很有帮助
```
initializedcheck（False）
```
是避免对全局定义的
```
gf\u exp
```
和
```
gf\u log
```
进行大量存在性检查的必要条件。（您可能会发现，通过为这些代码创建一个局部变量引用并使用该istead，可以加快基本Python/PyPy代码的速度）
我不得不复制几个输入参数。Cython无法从只读对象生成memoryview（在本例中为
```
msg\u in
```
，一个字符串。不过，我可能只是将其设置为char*）。另外，
```
gen
```
（一个列表）需要位于具有快速元素访问的内容中

除此之外，一切都相当直截了当。（我没有尝试过它的任何变体，因为它速度更快）。PyPy的出色表现给我留下了深刻的印象。

或者，如果您了解C，我建议您用普通C重写这个Python函数并调用它（比如使用CFFI）。至少您知道，在不需要知道PyPy或Cython技巧的情况下，您在函数的内部循环中达到了最高性能

请参阅：

基于DavidW的答案，以下是我目前使用的实现，使用nogil和并行计算大约快20%：

from cython.parallel import parallel, prange

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
cdef rsenc_cython(msg_in_r, nsym, gen_t) :
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    #cdef int[::1] gen = array.array('i',gen_t) # convert list to array
    cdef uint8_t[::1] gen = gen_t

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int i, j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log_c[gen[j]]

    cdef uint8_t coef,lcoef
    with nogil:
        for i in xrange(msg_in.shape[0]):
            coef = msg_out[i]
            if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
                lcoef = gf_log_c[coef] # precaching

                for j in prange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                    msg_out[i + j] ^= gf_exp_c[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

我仍然希望它更快（在实际实现中，数据的编码速度约为6.4 MB/s，n=255，n是消息+码字的大小）

我发现更快实现的主要原因是使用LUT（查找表）方法，通过预计算乘法和加法数组。然而，在我的Python和Cython实现中，LUT方法比计算XOR和加法运算慢

还有其他方法可以实现更快的RS编码器，但我没有能力也没有时间去尝试。我将把它们作为其他感兴趣的读者的参考：

“有限域运算的快速软件实现”，程晃和徐立豪，华盛顿大学圣路易斯技术代表（2003年）。以及正确的代码实现
罗建强，等。“安全存储应用中大有限域GF（2N）的高效软件实现”，《存储上的ACM事务》（TOS）8.1（2012）：2
“用于存储的开放源代码擦除编码库的性能评估和检查”，，Plank，J.S.和Luo，J.和Schuman，C.D.和Xu，L.，以及Wilcox-O'Hearn，Z，FAST。第9卷。2009 或者是非扩展版本：“存储应用程序开源擦除编码库的性能比较”，Plank和Schuman
ZFEC库的源代码，带有乘法LUT优化
“Reed-Solomon编码器的优化算法”，Christof Paar（1997年6月）。在IEEE信息论国际研讨会上（第250-250页）。电气工程师协会（IEEE）
“在GF（2^8）上编码（255233）Reed-Solomon码的快速算法”，R.L.Miller和T.K.Truong，I.S.Reed
“针对不同处理器架构和应用优化伽罗瓦场算法”，Greenan，Kevin和M.，Ethan和L.Miller和Thomas JE Schwarz，计算机和电信系统的建模、分析和仿真，2008年。吉祥物2008。IEEE国际研讨会。IEEE，2008年
安文，H.彼得。《RAID-6的数学》（2007）。及
，是Cauchy-Reed-Solomon仅有的几个实现之一，据说速度非常快

“并行多项式除法的对数布尔时间算法”，比尼，D.和潘，V.Y.（1987），信息处理信函，24（4），233-237。另见Bini，D.和V.Pan。“快

ctypedef unsigned char uint8_t # does not work with Microsoft's C Compiler: from libc.stdint cimport uint8_t
cimport cpython.array as array

cdef uint8_t[::1] gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19,
   lots of numbers omitted for space reasons
   ...])

cdef uint8_t[::1] gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, 
    more numbers omitted for space reasons
    ...])

import cython

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def rsenc(msg_in_r, nsym, gen_t):
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    cdef int[::1] gen = array.array('i',gen_t) # convert list to array

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log[gen[j]]

    cdef uint8_t coef,lcoef

    cdef int i
    for i in xrange(msg_in.shape[0]):
        coef = msg_out[i]
        if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
            lcoef = gf_log[coef] # precaching

            for j in xrange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

from cython.parallel import parallel, prange

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
cdef rsenc_cython(msg_in_r, nsym, gen_t) :
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    #cdef int[::1] gen = array.array('i',gen_t) # convert list to array
    cdef uint8_t[::1] gen = gen_t

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int i, j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log_c[gen[j]]

    cdef uint8_t coef,lcoef
    with nogil:
        for i in xrange(msg_in.shape[0]):
            coef = msg_out[i]
            if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
                lcoef = gf_log_c[coef] # precaching

                for j in prange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                    msg_out[i + j] ^= gf_exp_c[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out