串行工作时Cuda版本不工作 我的下面极简CUDA代码返回一个错误的结果(所有的多边形在结尾有0个顶点),而在C++中运行的相同的代码工作得很好。问题是令人尴尬的并行:没有通信,没有同步线程等,Cuda内存分配是成功的。对于Cuda版本,即使是出于调试目的存储输入数组内容的伪变量也是0。由于我的阵列很大程度上足够大,因此没有访问权限。用Cuda中的循环替换memcpy不会改变任何事情。 我真的不明白发生了什么。。。有什么想法吗?谢谢
Cuda代码:串行工作时Cuda版本不工作 我的下面极简CUDA代码返回一个错误的结果(所有的多边形在结尾有0个顶点),而在C++中运行的相同的代码工作得很好。问题是令人尴尬的并行:没有通信,没有同步线程等,Cuda内存分配是成功的。对于Cuda版本,即使是出于调试目的存储输入数组内容的伪变量也是0。由于我的阵列很大程度上足够大,因此没有访问权限。用Cuda中的循环替换memcpy不会改变任何事情。 我真的不明白发生了什么。。。有什么想法吗?谢谢,c++,cuda,C++,Cuda,Cuda代码: #include <stdio.h> #include <iostream> #include <stdlib.h> #include <cuda.h> class Point2D { public: __device__ Point2D(double xx=0, double yy=0):x(xx),y(yy){}; double x, y; };
#include <stdio.h>
#include <iostream>
#include <stdlib.h>
#include <cuda.h>
class Point2D {
public:
__device__ Point2D(double xx=0, double yy=0):x(xx),y(yy){};
double x, y;
};
__device__ double dot(const Point2D &A, const Point2D &B) {
return A.x*B.x + A.y*B.y;
}
__device__ Point2D operator*(double a, const Point2D &P) {
return Point2D(a*P.x, a*P.y);
}
__device__ Point2D operator+(Point2D A, const Point2D &B) {
return Point2D(A.x + B.x, A.y + B.y);
}
__device__ Point2D operator-(Point2D A, const Point2D &B) {
return Point2D(A.x - B.x, A.y - B.y);
}
__device__ Point2D inter(const Point2D &A, const Point2D &B, const Point2D &C, const Point2D &D) { //intersects AB by *the mediator* of CD
Point2D M = 0.5*(C+D);
return A - (dot(A-M, D-C)/dot(B-A, D-C)) * (B-A);
}
class Polygon {
public:
__device__ Polygon():nbpts(0){};
__device__ void addPts(Point2D pt) {
pts[nbpts] = pt;
nbpts++;
};
__device__ Polygon& operator=(const Polygon& rhs) {
nbpts = rhs.nbpts;
dummy = rhs.dummy;
memcpy(pts, rhs.pts, nbpts*sizeof(Point2D));
return *this;
}
__device__ void cut(const Point2D &inside_pt, const Point2D &outside_pt) {
int new_nbpts = 0;
Point2D newpts[128];
Point2D AB(outside_pt-inside_pt);
Point2D M(0.5*(outside_pt+inside_pt));
double ABM = dot(AB, M);
Point2D S = pts[nbpts-1];
for (int i=0; i<nbpts; i++) {
Point2D E = pts[i];
double ddot = -ABM + dot(AB, E);
if (ddot<0) { // E inside clip edge
double ddot2 = -ABM + dot(AB, S);
if (ddot2>0) {
newpts[new_nbpts] = inter(S,E, inside_pt, outside_pt);
new_nbpts++;
}
newpts[new_nbpts] = E;
new_nbpts++;
} else {
double ddot2 = -ABM + dot(AB, S);
if (ddot2<0) {
newpts[new_nbpts] = inter(S,E, inside_pt, outside_pt);
new_nbpts++;
}
}
S = E;
}
memcpy(pts, newpts, min(128, new_nbpts)*sizeof(Point2D));
nbpts = new_nbpts;
}
//private:
Point2D pts[128];
int nbpts;
float dummy;
};
__global__ void cut_poly(float *a, Polygon* polygons, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx>=N/2) return;
Polygon pol;
pol.addPts(Point2D(0.,0.));
pol.addPts(Point2D(1.,0.));
pol.addPts(Point2D(1.,1.));
pol.addPts(Point2D(0.,1.));
Point2D curPt(a[2*idx], a[2*idx+1]);
for (int i=0; i<N/2; i++) {
Point2D other_pt(a[2*i], a[2*i+1]);
pol.cut(curPt, other_pt);
}
pol.dummy = a[idx];
polygons[idx] = pol;
}
int main(int argc, unsigned char* argv[])
{
const int N = 100;
float a_h[N], *a_d;
Polygon p_h[N/2], *p_d;
size_t size = N * sizeof(float);
size_t size_pol = N/2 * sizeof(Polygon);
cudaError_t err = cudaMalloc((void **) &a_d, size);
cudaError_t err2 = cudaMalloc((void **) &p_d, size_pol);
for (int i=0; i<N; i++) a_h[i] = (float)(rand()%1000)*0.001;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
cut_poly <<< n_blocks, block_size >>> (a_d, p_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaMemcpy(p_h, p_d, sizeof(Polygon)*N/2, cudaMemcpyDeviceToHost);
for (int i=0; i<N/2; i++)
printf("%f \t %f \t %u\n", a_h[i], p_h[i].dummy, p_h[i].nbpts);
cudaFree(a_d);
cudaFree(p_d);
return 0;
}
#包括
#包括
#包括
#包括
类Point2D{
公众:
__设备_uu2;点2d(双xx=0,双yy=0):x(xx),y(yy){};
双x,y;
};
__设备双点(常数点2D&A、常数点2D&B){
返回A.x*B.x+A.y*B.y;
}
__设备\点2D运算符*(双a、常数点2D和P){
返回点2d(a*P.x,a*P.y);
}
__设备\点2D运算符+(点2D A、常数点2D和B){
返回点2d(A.x+B.x,A.y+B.y);
}
__设备\点2D运算符-(点2D A、常数点2D和B){
返回点2d(A.x-B.x,A.y-B.y);
}
__设备_uuu2d inter(常数点2D&A、常数点2D&B、常数点2D&C、常数点2D&D){//通过CD的*中介器*与AB相交
点2D M=0.5*(C+D);
返回A-(dot(A-M,D-C)/dot(B-A,D-C))*(B-A);
}
类多边形{
公众:
__设备{Polygon():nbpts(0){};
__设备无效添加点(点2D点){
pts[nbpts]=pt;
nbpts++;
};
__设备\多边形和运算符=(常量多边形和rhs){
nbpts=rhs.nbpts;
假人=rhs.假人;
memcpy(pts、rhs.pts、nbpts*sizeof(Point2D));
归还*这个;
}
__设备无效切割(常数点2D和内部、常数点2D和外部){
int new_nbpts=0;
Point2D newpts[128];
点2D AB(外部点-内部点);
点2D M(0.5*(外部点+内部点));
双ABM=点(AB,M);
点2D S=pts[nbpts-1];
对于(int i=0;i好吧,我想你可以忽略我的大部分评论。我错误地在一台我用CUDA 3.2安装的机器上工作,它在内核启动失败时表现不同。当我切换到CUDA 4.1和CUDA 5.0时,事情开始变得有意义了。为我在那里的困惑道歉
不管怎么说,在经历了这些之后,我很快注意到CPU和GPU实现之间存在差异
我建议您针对这些问题发布新的问题。我想您可以忽略我的大部分评论。我误操作了一台我用CUDA 3.2安装的机器,它在内核启动失败时表现不同。当我切换到CUDA 4.1和CUDA 5.0时,事情开始变得有意义了。抱歉这是我的困惑
不管怎么说,在经历了这些之后,我很快注意到CPU和GPU实现之间存在差异
我建议针对这些问题发布新的问题。在我们检查整个代码之前,您是否已经将其缩小到了?是的-我昨天尝试了这个问题:但是显然,缩小代码范围是一个坏主意,因为错误不再出现,但算法的行为现在完全不同了erent.:b在浏览代码之前,请检查cudaMemcpy()中的返回值
调用错误。实际上,内核中出现了未指定的启动失败。检查所有cuda调用是否存在错误始终是一个良好的做法。我在代码中没有看到任何实际的错误检查。我还不知道未指定的启动失败是关于什么。下一步可能是依次删除或注释掉pie内核的ces,直到启动失败消失,这与缩小CPU代码中seg错误的范围没有什么不同。如果注释掉这行内核代码,将使启动错误消失(并且现在打印出从内核返回的一些附加数据):pol.cut(curPt,other)
在我们检查整个代码板之前,您是否已经将其缩小到了?是的-我昨天尝试了这个问题:但是显然,缩小代码范围是一个坏主意,因为该错误不再出现,但算法的行为现在完全不同了。:s在检查代码之前,请检查返回值来自cudaMemcpy()的
调用错误。实际上,内核中出现了未指定的启动失败。检查所有cuda调用是否存在错误始终是一个良好的做法。我在代码中没有看到任何实际的错误检查。我还不知道未指定的启动失败是关于什么。下一步可能是依次删除或注释掉pie内核的ces,直到启动失败消失,这与缩小CPU代码中seg错误的范围没有什么不同。如果注释掉这行内核代码,将使启动错误消失(并且现在打印出从内核返回的一些附加数据):pol.cut(curPt,other)
Arrrgg!非常感谢!我正在并行调试串行和并行实现,但我错过了在串行代码中所做的更正…!非常感谢您发现了错误并花费了时间!事实上,这段代码仍然存在一个我无法解决的问题:返回到主机的多边形都有自己的错误坐标设置为0的顶点(在pol[i].pts[…]
)而不是它们的实际值(尽管顶点的数量现在是正确的,并且在变量nbpoints
)中):s我检查了my Polygon::operator=是否正确调用。只是我的数组没有正确返回到
#include <stdio.h>
#include <iostream>
#include <stdlib.h>
class Point2D {
public:
Point2D(double xx=0, double yy=0):x(xx),y(yy){};
double x, y;
};
double dot(const Point2D &A, const Point2D &B) {
return A.x*B.x + A.y*B.y;
}
Point2D operator*(double a, const Point2D &P) {
return Point2D(a*P.x, a*P.y);
}
Point2D operator+(Point2D A, const Point2D &B) {
return Point2D(A.x + B.x, A.y + B.y);
}
Point2D operator-(Point2D A, const Point2D &B) {
return Point2D(A.x - B.x, A.y - B.y);
}
Point2D inter(const Point2D &A, const Point2D &B, const Point2D &C, const Point2D &D) { //intersects AB by *the mediator* of CD
Point2D M = 0.5*(C+D);
return A - (dot(A-M, D-C)/dot(B-A, D-C)) * (B-A);
}
class Polygon {
public:
Polygon():nbpts(0){};
void addPts(Point2D pt) {
pts[nbpts] = pt;
nbpts++;
};
Polygon& operator=(const Polygon& rhs) {
nbpts = rhs.nbpts;
dummy = rhs.dummy;
memcpy(pts, rhs.pts, nbpts*sizeof(Point2D));
return *this;
}
void cut(const Point2D &inside_pt, const Point2D &outside_pt) {
int new_nbpts = 0;
Point2D newpts[128];
Point2D AB(outside_pt-inside_pt);
Point2D M(0.5*(outside_pt+inside_pt));
double ABM = dot(AB, M);
Point2D S = pts[nbpts-1];
for (int i=0; i<nbpts; i++) {
Point2D E = pts[i];
double ddot = -ABM + dot(AB, E);
if (ddot<0) { // E inside clip edge
double ddot2 = -ABM + dot(AB, S);
if (ddot2>0) {
newpts[new_nbpts] = inter(S,E, inside_pt, outside_pt);
new_nbpts++;
}
newpts[new_nbpts] = E;
new_nbpts++;
} else {
double ddot2 = -ABM + dot(AB, S);
if (ddot2<0) {
newpts[new_nbpts] = inter(S,E, inside_pt, outside_pt);
new_nbpts++;
}
}
S = E;
}
memcpy(pts, newpts, std::min(128, new_nbpts)*sizeof(Point2D));
/*for (int i=0; i<128; i++) {
pts[i] = newpts[i];
}*/
nbpts = new_nbpts;
}
//private:
Point2D pts[128];
int nbpts;
float dummy;
};
void cut_poly(int idx, float *a, Polygon* polygons, int N)
{
if (idx>=N/2) return;
Polygon pol;
pol.addPts(Point2D(0.,0.));
pol.addPts(Point2D(1.,0.));
pol.addPts(Point2D(1.,1.));
pol.addPts(Point2D(0.,1.));
Point2D curPt(a[2*idx], a[2*idx+1]);
for (int i=0; i<N/2; i++) {
if (idx==i) continue;
Point2D other_pt(a[2*i], a[2*i+1]);
pol.cut(curPt, other_pt);
}
pol.dummy = a[idx];
polygons[idx] = pol;
}
int main(int argc, unsigned char* argv[])
{
const int N = 100; // Number of elements in arrays
float a_h[N], *a_d; // Pointer to host & device arrays
Polygon p_h[N/2], *p_d;
for (int i=0; i<N; i++) a_h[i] = (float)(rand()%1000)*0.001;
for (int idx=0; idx<N; idx++)
cut_poly(idx, a_h, p_h, N);
for (int i=0; i<N/2; i++)
printf("%f \t %f \t %u\n", a_h[i], p_h[i].dummy, p_h[i].nbpts);
return 0;
}
void cut_poly(int idx, float *a, Polygon* polygons, int N)
{
if (idx>=N/2) return;
Polygon pol;
pol.addPts(Point2D(0.,0.));
pol.addPts(Point2D(1.,0.));
pol.addPts(Point2D(1.,1.));
pol.addPts(Point2D(0.,1.));
Point2D curPt(a[2*idx], a[2*idx+1]);
for (int i=0; i<N/2; i++) {
if (idx==i) continue; /* NOTE THIS LINE MISSING FROM YOUR GPU CODE */
Point2D other_pt(a[2*i], a[2*i+1]);
pol.cut(curPt, other_pt);
}
pol.dummy = a[idx];
polygons[idx] = pol;
}
#include <stdio.h>
#include <iostream>
#include <stdlib.h>
// #include <cuda.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
class Point2D {
public:
__host__ __device__ Point2D(double xx=0, double yy=0):x(xx),y(yy){};
double x, y;
};
__host__ __device__ double dot(const Point2D &A, const Point2D &B) {
return A.x*B.x + A.y*B.y;
}
__host__ __device__ Point2D operator*(double a, const Point2D &P) {
return Point2D(a*P.x, a*P.y);
}
__host__ __device__ Point2D operator+(Point2D A, const Point2D &B) {
return Point2D(A.x + B.x, A.y + B.y);
}
__host__ __device__ Point2D operator-(Point2D A, const Point2D &B) {
return Point2D(A.x - B.x, A.y - B.y);
}
__host__ __device__ Point2D inter(const Point2D &A, const Point2D &B, const Point2D &C, const Point2D &D) { //intersects AB by *the mediator* of CD
Point2D M = 0.5*(C+D);
return A - (dot(A-M, D-C)/dot(B-A, D-C)) * (B-A);
}
class Polygon {
public:
__host__ __device__ Polygon():nbpts(0){};
__host__ __device__ void addPts(Point2D pt) {
pts[nbpts] = pt;
nbpts++;
};
__host__ __device__ Polygon& operator=(const Polygon& rhs) {
nbpts = rhs.nbpts;
dummy = rhs.dummy;
memcpy(pts, rhs.pts, nbpts*sizeof(Point2D));
return *this;
}
__host__ __device__ Point2D getpoint(unsigned i){
if (i<128) return pts[i];
else return pts[0];
}
__host__ __device__ void cut(const Point2D &inside_pt, const Point2D &outside_pt) {
int new_nbpts = 0;
Point2D newpts[128];
Point2D AB(outside_pt-inside_pt);
Point2D M(0.5*(outside_pt+inside_pt));
double ABM = dot(AB, M);
Point2D S = pts[nbpts-1];
for (int i=0; i<nbpts; i++) {
Point2D E = pts[i];
double ddot = -ABM + dot(AB, E);
if (ddot<0) { // E inside clip edge
double ddot2 = -ABM + dot(AB, S);
if (ddot2>0) {
newpts[new_nbpts] = inter(S,E, inside_pt, outside_pt);
new_nbpts++;
}
newpts[new_nbpts] = E;
new_nbpts++;
} else {
double ddot2 = -ABM + dot(AB, S);
if (ddot2<0) {
newpts[new_nbpts] = inter(S,E, inside_pt, outside_pt);
new_nbpts++;
}
}
S = E;
}
memcpy(pts, newpts, min(128, new_nbpts)*sizeof(Point2D));
nbpts = new_nbpts;
}
//private:
Point2D pts[128];
int nbpts;
float dummy;
};
__global__ void cut_poly(float *a, Polygon* polygons, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx>=N/2) return;
Polygon pol;
pol.addPts(Point2D(0.,0.));
pol.addPts(Point2D(1.,0.));
pol.addPts(Point2D(1.,1.));
pol.addPts(Point2D(0.,1.));
Point2D curPt(a[2*idx], a[2*idx+1]);
for (int i=0; i<N/2; i++) {
if (idx==i) continue;
Point2D other_pt(a[2*i], a[2*i+1]);
pol.cut(curPt, other_pt);
}
pol.dummy = pol.getpoint(0).x;
polygons[idx] = pol;
}
int main(int argc, unsigned char* argv[])
{
const int N = 100;
float a_h[N], *a_d;
Polygon p_h[N/2], *p_d;
size_t size = N * sizeof(float);
size_t size_pol = N/2 * sizeof(Polygon);
cudaMalloc((void **) &a_d, size);
cudaCheckErrors("cm1");
cudaMalloc((void **) &p_d, size_pol);
cudaCheckErrors("cm2");
for (int i=0; i<N; i++) a_h[i] = (float)(rand()%1000)*0.001;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaCheckErrors("cmcp1");
int block_size = 128;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
cut_poly <<< n_blocks, block_size >>> (a_d, p_d, N);
cudaCheckErrors("kernel");
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaCheckErrors("cmcp2");
cudaMemcpy(p_h, p_d, sizeof(Polygon)*N/2, cudaMemcpyDeviceToHost);
cudaCheckErrors("cmcp3");
for (int i=0; i<N/2; i++)
printf("%f \t %f \t %f \t %u\n", a_h[i], p_h[i].dummy, p_h[i].getpoint(0).x, p_h[i].nbpts);
cudaFree(a_d);
cudaFree(p_d);
return 0;
}