Python 如何以与numpy linalg相同的精度执行PyCUDA 4x4矩阵求逆;投资部;或;pinv“;功能
我面临着一个关于我的代码的准确性问题,该代码执行了一个数字(12825512)的4x4矩阵求逆。当我使用原始版本时,即numpy函数Python 如何以与numpy linalg相同的精度执行PyCUDA 4x4矩阵求逆;投资部;或;pinv“;功能,python,matrix,cuda,matrix-inverse,pycuda,Python,Matrix,Cuda,Matrix Inverse,Pycuda,我面临着一个关于我的代码的准确性问题,该代码执行了一个数字(12825512)的4x4矩阵求逆。当我使用原始版本时,即numpy函数np.linalg.inv或np.linalg.pinv,一切正常 不幸的是,使用下面的CUDA代码,我将nan和inf值转换为倒矩阵 更明确地说,我用这个矩阵来反转: 2.120771107884677649e+09 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
np.linalg.inv
或np.linalg.pinv
,一切正常
不幸的是,使用下面的CUDA代码,我将nan
和inf
值转换为倒矩阵
更明确地说,我用这个矩阵来反转:
2.120771107884677649e+09 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 3.557266600921528288e+27 3.557266600921528041e+07 3.557266600921528320e+17
0.000000000000000000e+00 3.557266600921528041e+07 3.557266600921528288e+27 3.557266600921528041e+07
0.000000000000000000e+00 3.557266600921528320e+17 3.557266600921528041e+07 1.778633300460764144e+27
如果我使用经典的numpy“inv
”,我会得到以下倒置的3x3矩阵:
4.715266047722758306e-10 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 2.811147187396482366e-28 -2.811147186834252285e-48 -5.622294374792964645e-38
0.000000000000000000e+00 -2.811147186834252285e-48 2.811147187396482366e-28 -5.622294374230735768e-48
0.000000000000000000e+00 -5.622294374792964645e-38 -5.622294374230735768e-48 5.622294374792964732e-28
为了检验这个逆矩阵的有效性,我将它乘以原始矩阵,结果就是单位矩阵
但是使用CUDA GPU反转,我在反转后得到这个矩阵:
0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
-inf -inf -9.373764907941219970e-01 -inf
inf nan -inf nan
因此,我希望提高CUDA内核或python代码的精度,以避免这些nan
和inf
值
这是CUDA内核代码和我的主代码的调用部分(我用numpyinv
函数对经典方法进行了注释:
# Create arrayFullCross_vec array
arrayFullCross_vec = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))
# Create arrayFullCross_vec array
invCrossMatrix_gpu = np.zeros((dimBlocks*dimBlocks*integ_prec**2))
# Create arrayFullCross_vec array
invCrossMatrix = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))
# Build observables covariance matrix
arrayFullCross_vec = buildObsCovarianceMatrix4_vec(k_ref, mu_ref, ir)
"""
# Compute integrand from covariance matrix
for r_p in range(integ_prec):
for s_p in range(integ_prec):
# original version (without GPU)
invCrossMatrix[:,:,r_p,s_p] = np.linalg.inv(arrayFullCross_vec[:,:,r_p,s_p])
"""
# GPU version
invCrossMatrix_gpu = gpuinv4x4(arrayFullCross_vec.flatten(),integ_prec**2)
invCrossMatrix = invCrossMatrix_gpu.reshape(dimBlocks,dimBlocks,integ_prec,integ_prec)
"""
kernel = SourceModule("""
__device__ unsigned getoff(unsigned &off){
unsigned ret = off & 0x0F;
off = off >> 4;
return ret;
}
const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = double or double only
typedef double T;
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){
__shared__ T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < n*16){
si[threadIdx.x] = in[idx];
unsigned lane = threadIdx.x & 15;
unsigned sibase = threadIdx.x & 0x03F0;
__syncwarp();
unsigned off = pat[lane];
T a,b;
a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
if (!getoff(off)) a = -a;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+16];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+32];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
T det = si[sibase + (lane>>2)]*a;
det += __shfl_down_sync(tmsk, det, 4, 16); // first add
det += __shfl_down_sync(tmsk, det, 8, 16); // second add
det = __shfl_sync(tmsk, det, 0, 16); // broadcast
out[idx] = a / det;
}
}
""")
# python function for inverting 4x4 matrices
# n should be an even number
def gpuinv4x4(inp, n):
# internal constants not to be modified
hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
# Convert parameters into numpy array
# float32
"""
inpd = np.array(inp, dtype=np.float32)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float32)
"""
# float64
"""
inpd = np.array(inp, dtype=np.float64)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float64)
"""
# float128
inpd = np.array(inp, dtype=np.float128)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float128)
# Get kernel function
matinv4x4 = kernel.get_function("inv4x4")
# Define block, grid and compute
blockDim = (256,1,1) # do not change
gridDim = ((n/16)+1,1,1)
# Kernel function
matinv4x4 (
cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
block=blockDim, grid=gridDim)
return output
这里是CUDA内核代码和gpuinv4x4
函数:
# Create arrayFullCross_vec array
arrayFullCross_vec = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))
# Create arrayFullCross_vec array
invCrossMatrix_gpu = np.zeros((dimBlocks*dimBlocks*integ_prec**2))
# Create arrayFullCross_vec array
invCrossMatrix = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))
# Build observables covariance matrix
arrayFullCross_vec = buildObsCovarianceMatrix4_vec(k_ref, mu_ref, ir)
"""
# Compute integrand from covariance matrix
for r_p in range(integ_prec):
for s_p in range(integ_prec):
# original version (without GPU)
invCrossMatrix[:,:,r_p,s_p] = np.linalg.inv(arrayFullCross_vec[:,:,r_p,s_p])
"""
# GPU version
invCrossMatrix_gpu = gpuinv4x4(arrayFullCross_vec.flatten(),integ_prec**2)
invCrossMatrix = invCrossMatrix_gpu.reshape(dimBlocks,dimBlocks,integ_prec,integ_prec)
"""
kernel = SourceModule("""
__device__ unsigned getoff(unsigned &off){
unsigned ret = off & 0x0F;
off = off >> 4;
return ret;
}
const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = double or double only
typedef double T;
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){
__shared__ T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < n*16){
si[threadIdx.x] = in[idx];
unsigned lane = threadIdx.x & 15;
unsigned sibase = threadIdx.x & 0x03F0;
__syncwarp();
unsigned off = pat[lane];
T a,b;
a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
if (!getoff(off)) a = -a;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+16];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
off = pat[lane+32];
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
if (getoff(off)) a += b;
else a -=b;
T det = si[sibase + (lane>>2)]*a;
det += __shfl_down_sync(tmsk, det, 4, 16); // first add
det += __shfl_down_sync(tmsk, det, 8, 16); // second add
det = __shfl_sync(tmsk, det, 0, 16); // broadcast
out[idx] = a / det;
}
}
""")
# python function for inverting 4x4 matrices
# n should be an even number
def gpuinv4x4(inp, n):
# internal constants not to be modified
hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
# Convert parameters into numpy array
# float32
"""
inpd = np.array(inp, dtype=np.float32)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float32)
"""
# float64
"""
inpd = np.array(inp, dtype=np.float64)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float64)
"""
# float128
inpd = np.array(inp, dtype=np.float128)
hpatd = np.array(hpat, dtype=np.uint32)
output = np.empty((n*16), dtype= np.float128)
# Get kernel function
matinv4x4 = kernel.get_function("inv4x4")
# Define block, grid and compute
blockDim = (256,1,1) # do not change
gridDim = ((n/16)+1,1,1)
# Kernel function
matinv4x4 (
cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
block=blockDim, grid=gridDim)
return output
kernel=SourceModule(“”)
__设备\未签名关闭(未签名关闭){
无符号ret=off&0x0F;
关=关>>4;
返回ret;
}
const int block_size=256;
常量无符号tmsk=0xFFFFFFFF;
//就地可接受,即out==in)
//T=双倍或仅双倍
双T型;
__全局无效inv4x4(常数T*\Uuu限制\Uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu限制\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu{
__共享块大小;
size_t idx=threadIdx.x+blockDim.x*blockIdx.x;
如果(idx>2)]*a;
det+=\uuuuuSHFL\uDOWN\uSYNC(tmsk,det,4,16);//第一次添加
det+=\uuuuuSHFL\uDOWN\uSYNC(tmsk,det,8,16);//第二次添加
det=uuu shfl_sync(tmsk,det,0,16);//广播
out[idx]=a/det;
}
}
""")
#用于反转4x4矩阵的python函数
#n应该是偶数
def gpuinv4x4(输入,n):
#不可修改的内部常数
hpat=(0x0EB40FA4,0x0EB01FA4,0x0EB01FA0 0EB01FA0 0 0 0 0 0 0 EB01FA0,0x0EB01FA0 0 0 0 EB01FA0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 EB50 FA5 10 FA10 10 10 10 10 10 10 10 FA0,0 0 0 0 0EB0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,0x11E50 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 EB01FAFAFAFA0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,0 0 0 0 0 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 D681E58,0x1D280E18、0x0D241E14、0x19240A14、0x0A7D1B6D、0x1A3D0B2D、0x063D172D、0x16390729、0x1A7C0B6C、0x0A3C1B2C、0x163C072C、0x06381728、0x097C1B5C、0x193C0B1C、0x053C171C、0x15380718、0x196C0A5C、0x092C1A1C、0x152C061C、0x05281618)
#将参数转换为numpy数组
#浮动32
"""
inpd=np.array(inp,dtype=np.float32)
hpatd=np.array(hpat,dtype=np.uint32)
output=np.empty((n*16),dtype=np.float32)
"""
#浮动64
"""
inpd=np.array(inp,dtype=np.float64)
hpatd=np.array(hpat,dtype=np.uint32)
output=np.empty((n*16),dtype=np.float64)
"""
#浮动128
inpd=np.array(inp,dtype=np.float128)
hpatd=np.array(hpat,dtype=np.uint32)
output=np.empty((n*16),dtype=np.float128)
#获取核函数
matinv4x4=内核.get_函数(“inv4x4”)
#定义块、网格和计算
blockDim=(256,1,1)#不更改
gridDim=((n/16)+1,1,1)
#核函数
matinv4x4(
cuda.In(inpd)、cuda.Out(output)、np.uint64(n)、cuda.In(hpatd),
block=blockDim,grid=gridDim)
返回输出
如您所见,我试图通过将np.float32
替换为np.float64
或np.float128
来提高反转操作的准确性,但问题仍然存在
我还将typedef float T;
替换为typedef double T;
,但没有成功
如何对这些矩阵进行正确的反演,并尽量避免“
nan
”和“inf
”值?我认为这是一个真正的精度问题,但我找不到如何避免这个问题。这个问题有以前的相关问题,并且(在较小程度上)。我不清楚为什么问题的标题指3x3,问题中的粗体文本指3x3,但提出的问题是4x4矩阵求逆(如前所述,此代码只能用于4x4矩阵求逆)。我将在假设示例案例是所需案例的情况下继续
根据我的测试,唯一需要做的事情是使用double
(或者在pycuda中,float64
)而不是float
(或者在pycuda中,float32
)我认为这应该是显而易见的,因为示例矩阵值超出了float32
类型的范围