Python 优化reed-solomon编码器(多项式除法)
我试图优化一个Reed-Solomon编码器,它实际上只是Galois字段2^8上的一个多项式除法运算(这意味着值的长度超过255)。事实上,代码与Go的代码非常相似: 这里使用的多项式除法是a(也称为霍纳法) 我什么都试过了:小矮人,小矮人,小天鹅。我获得的最佳性能是通过将pypy与以下简单的嵌套循环一起使用:Python 优化reed-solomon编码器(多项式除法),python,numpy,optimization,cython,pypy,Python,Numpy,Optimization,Cython,Pypy,我试图优化一个Reed-Solomon编码器,它实际上只是Galois字段2^8上的一个多项式除法运算(这意味着值的长度超过255)。事实上,代码与Go的代码非常相似: 这里使用的多项式除法是a(也称为霍纳法) 我什么都试过了:小矮人,小矮人,小天鹅。我获得的最佳性能是通过将pypy与以下简单的嵌套循环一起使用: def rsenc(msg_in, nsym, gen): '''Reed-Solomon encoding using polynomial division, better
def rsenc(msg_in, nsym, gen):
'''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''
msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))])
for i in xrange(len(msg_in)):
coef = msg_out[i]
# coef = gf_mul(msg_out[i], gf_inverse(gen[0])) // for general polynomial division (when polynomials are non-monic), we need to compute: coef = msg_out[i] / gen[0]
if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
lcoef = gf_log[coef] # precaching
for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] += msg_out[i] * gen[j]
# Recopy the original message bytes
msg_out[:len(msg_in)] = msg_in
return msg_out
Python优化向导能给我一些关于如何获得加速的线索吗?我的目标是获得至少3倍的加速,但更多的将是可怕的。任何方法或工具都可以接受,只要是跨平台的(至少可以在Linux和Windows上使用)
下面是一个小测试脚本,其中包含我尝试过的其他一些替代方案(不包括cython尝试,因为它比本机python慢!):
(注意:备选方案应该是正确的,某些索引必须有点偏离,但由于它们速度较慢,我没有尝试修复它们)
/赏金的更新和目标:我发现了一个非常有趣的优化技巧,可以大大加快计算速度:到。我用新函数rsenc_precomp()更新了上面的代码。但是,在我的实现中没有任何收益,甚至有点慢:
rsenc : total time elapsed 0.107170 seconds.
rsenc_precomp : total time elapsed 0.108788 seconds.
数组查找的成本怎么会比加法或xor之类的操作更高?为什么它在ZFEC中工作而在Python中不工作?
我将把这笔赏金归功于谁,谁能告诉我如何使这个乘法/加法查找表优化工作(比异或和加法运算更快),谁能通过引用或分析向我解释为什么这个优化不能在这里工作(使用Python/pypypypy/Cython/Numpy等。我都试过了).在我的机器上,以下速度比pypy快3倍(0.04秒比0.15秒)。使用Cython:
ctypedef unsigned char uint8_t # does not work with Microsoft's C Compiler: from libc.stdint cimport uint8_t
cimport cpython.array as array
cdef uint8_t[::1] gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19,
lots of numbers omitted for space reasons
...])
cdef uint8_t[::1] gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104,
more numbers omitted for space reasons
...])
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def rsenc(msg_in_r, nsym, gen_t):
'''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''
cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
cdef int[::1] gen = array.array('i',gen_t) # convert list to array
cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
cdef int j
cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
for j in xrange(gen.shape[0]):
lgen[j] = gf_log[gen[j]]
cdef uint8_t coef,lcoef
cdef int i
for i in xrange(msg_in.shape[0]):
coef = msg_out[i]
if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
lcoef = gf_log[coef] # precaching
for j in xrange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]
# Recopy the original message bytes
msg_out[:msg_in.shape[0]] = msg_in
return msg_out
这是一个包含静态类型的最快版本(并从cython-a
检查html,直到循环没有以黄色突出显示)
以下是一些简要说明:
- Cython更喜欢
而不是x.shape[0]
len(shape)
- 将MemoryView定义为
可以保证它们在内存中是连续的,这很有帮助[::1]
是避免对全局定义的initializedcheck(False)
和gf\u exp
进行大量存在性检查的必要条件。(您可能会发现,通过为这些代码创建一个局部变量引用并使用该istead,可以加快基本Python/PyPy代码的速度)gf\u log
- 我不得不复制几个输入参数。Cython无法从只读对象生成memoryview(在本例中为
,一个字符串。不过,我可能只是将其设置为char*)。另外,msg\u in
(一个列表)需要位于具有快速元素访问的内容中gen
除此之外,一切都相当直截了当。(我没有尝试过它的任何变体,因为它速度更快)。PyPy的出色表现给我留下了深刻的印象。或者,如果您了解C,我建议您用普通C重写这个Python函数并调用它(比如使用CFFI)。至少您知道,在不需要知道PyPy或Cython技巧的情况下,您在函数的内部循环中达到了最高性能
请参阅:基于DavidW的答案,以下是我目前使用的实现,使用nogil和并行计算大约快20%:
from cython.parallel import parallel, prange
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
cdef rsenc_cython(msg_in_r, nsym, gen_t) :
'''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''
cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
#cdef int[::1] gen = array.array('i',gen_t) # convert list to array
cdef uint8_t[::1] gen = gen_t
cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
cdef int i, j
cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
for j in xrange(gen.shape[0]):
lgen[j] = gf_log_c[gen[j]]
cdef uint8_t coef,lcoef
with nogil:
for i in xrange(msg_in.shape[0]):
coef = msg_out[i]
if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
lcoef = gf_log_c[coef] # precaching
for j in prange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
msg_out[i + j] ^= gf_exp_c[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]
# Recopy the original message bytes
msg_out[:msg_in.shape[0]] = msg_in
return msg_out
我仍然希望它更快(在实际实现中,数据的编码速度约为6.4 MB/s,n=255,n是消息+码字的大小)
我发现更快实现的主要原因是使用LUT(查找表)方法,通过预计算乘法和加法数组。然而,在我的Python和Cython实现中,LUT方法比计算XOR和加法运算慢
还有其他方法可以实现更快的RS编码器,但我没有能力也没有时间去尝试。我将把它们作为其他感兴趣的读者的参考:
- “有限域运算的快速软件实现”,程晃和徐立豪,华盛顿大学圣路易斯技术代表(2003年)。以及正确的代码实现
- 罗建强,等。“安全存储应用中大有限域GF(2N)的高效软件实现”,《存储上的ACM事务》(TOS)8.1(2012):2
- “用于存储的开放源代码擦除编码库的性能评估和检查”,,Plank,J.S.和Luo,J.和Schuman,C.D.和Xu,L.,以及Wilcox-O'Hearn,Z,FAST。第9卷。2009 或者是非扩展版本:“存储应用程序开源擦除编码库的性能比较”,Plank和Schuman
- ZFEC库的源代码,带有乘法LUT优化
- “Reed-Solomon编码器的优化算法”,Christof Paar(1997年6月)。在IEEE信息论国际研讨会上(第250-250页)。电气工程师协会(IEEE)李>
- “在GF(2^8)上编码(255233)Reed-Solomon码的快速算法”,R.L.Miller和T.K.Truong,I.S.Reed李>
- “针对不同处理器架构和应用优化伽罗瓦场算法”,Greenan,Kevin和M.,Ethan和L.Miller和Thomas JE Schwarz,计算机和电信系统的建模、分析和仿真,2008年。吉祥物2008。IEEE国际研讨会。IEEE,2008年李>
- 安文,H.彼得。《RAID-6的数学》(2007)。及
- ,是Cauchy-Reed-Solomon仅有的几个实现之一,据说速度非常快
- “并行多项式除法的对数布尔时间算法”,比尼,D.和潘,V.Y.(1987),信息处理信函,24(4),233-237。另见Bini,D.和V.Pan。“快
ctypedef unsigned char uint8_t # does not work with Microsoft's C Compiler: from libc.stdint cimport uint8_t cimport cpython.array as array cdef uint8_t[::1] gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19, lots of numbers omitted for space reasons ...]) cdef uint8_t[::1] gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, more numbers omitted for space reasons ...]) import cython @cython.boundscheck(False) @cython.wraparound(False) @cython.initializedcheck(False) def rsenc(msg_in_r, nsym, gen_t): '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field''' cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object cdef int[::1] gen = array.array('i',gen_t) # convert list to array cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1) cdef int j cdef uint8_t[::1] lgen = bytearray(gen.shape[0]) for j in xrange(gen.shape[0]): lgen[j] = gf_log[gen[j]] cdef uint8_t coef,lcoef cdef int i for i in xrange(msg_in.shape[0]): coef = msg_out[i] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) lcoef = gf_log[coef] # precaching for j in xrange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1) msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j] # Recopy the original message bytes msg_out[:msg_in.shape[0]] = msg_in return msg_out
from cython.parallel import parallel, prange @cython.boundscheck(False) @cython.wraparound(False) @cython.initializedcheck(False) cdef rsenc_cython(msg_in_r, nsym, gen_t) : '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field''' cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object #cdef int[::1] gen = array.array('i',gen_t) # convert list to array cdef uint8_t[::1] gen = gen_t cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1) cdef int i, j cdef uint8_t[::1] lgen = bytearray(gen.shape[0]) for j in xrange(gen.shape[0]): lgen[j] = gf_log_c[gen[j]] cdef uint8_t coef,lcoef with nogil: for i in xrange(msg_in.shape[0]): coef = msg_out[i] if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw) lcoef = gf_log_c[coef] # precaching for j in prange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1) msg_out[i + j] ^= gf_exp_c[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j] # Recopy the original message bytes msg_out[:msg_in.shape[0]] = msg_in return msg_out