MATLAB产生的结果与CUBLAS+不同;内核
我有以下MATLAB代码:MATLAB产生的结果与CUBLAS+不同;内核,matlab,cuda,cublas,Matlab,Cuda,Cublas,我有以下MATLAB代码: [N, d] = size(X); % data size and dimensions R = rand(d,dt); % Form a random matrix with elements in [0,1] % Random projection Y = X * R; w=720; % hashing step b = w * rand(dt,1); % Compute the hash codes of the data binId = floor(
[N, d] = size(X); % data size and dimensions
R = rand(d,dt); % Form a random matrix with elements in [0,1]
% Random projection
Y = X * R;
w=720; % hashing step
b = w * rand(dt,1);
% Compute the hash codes of the data
binId = floor( bsxfun(@plus, Y, b') / w);
我尝试使用CUBLAS和内核使其并行,如下所示:
__global__ void compute(const int N,const int dt,const int w,const float *old, int *newt){
int col = blockDim.y * blockIdx.y + threadIdx.y;
int row = blockDim.x * blockIdx.x + threadIdx.x;
int id = row+N*col;
if(row<N && col<dt){
newt[id]=(floor)(old[id]/w);
}
}
void gpu_blas_mmul(cublasHandle_t handle, const float *A, const float *B, float *C, const int m, const int k, const int n, const float bet) {
int lda=m,ldb=k,ldc=m;
const float alf = 1.0;
const float *alpha = &alf;
const float *beta = &bet;
// Do the actual multiplication and addition
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
}
float *d_R, *d_RX, *d_B_row;
int *d_H;
thrust::device_vector<float> d_X(h_X, h_X + N * d);
cudaMalloc(&d_R,d * dt * sizeof(float));
cudaMemcpy(d_R,h_R,d * dt * sizeof(float),cudaMemcpyHostToDevice);
cudaMalloc(&d_B_row,dt * sizeof(float));
cudaMemcpy(d_B_row,h_B_row,dt * sizeof(float),cudaMemcpyHostToDevice);
cudaMalloc(&d_RX,N * dt * sizeof(float));
cudaMalloc(&d_H,N * dt * sizeof(int));
//-------------------------CuBLAS-----------------------
cublasHandle_t handle;
cublasCreate(&handle);
thrust::device_vector<float> d_B_col(N, w);
gpu_blas_mmul(handle, thrust::raw_pointer_cast(&d_B_col[0]), d_B_row, d_RX, N, 1, dt,0.0);
gpu_blas_mmul(handle, thrust::raw_pointer_cast(&d_X[0]), d_R, d_RX, N, d, dt, 1.0);
cublasDestroy(handle);
//-----------------------Kernel----------------------------
dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
int linGrid1 = (int)ceil(N/(float)BLOCK_SIZE);
int linGrid2 = (int)ceil(dt/(float)BLOCK_SIZE);
dim3 gridSize(linGrid1,linGrid2,1);
compute<<<gridSize, blockSize>>>(N, dt, w, d_RX, d_H);
\uuuu全局\uuuuu无效计算(常数int N,常数int dt,常数int w,常数float*old,int*newt){
int col=blockDim.y*blockIdx.y+threadIdx.y;
int row=blockDim.x*blockIdx.x+threadIdx.x;
int id=行+N*列;
如果(行您的MATLAB代码使用双点精度,因此结果更准确。与此相反,您提供的CUDA内核使用单点精度,键入float
,因此产生的结果不太准确。通常,当面临单点精度与双点精度问题时,一旦开始增加,问题只会变得更糟g输入数据的大小
解决方案是使用类型double
而不是float