C++ 光线跟踪器中BVH遍历的优化_C++_C_Performance_Optimization_Depth First Search

C++ 光线跟踪器中BVH遍历的优化

c++ c performance optimization

C++ 光线跟踪器中BVH遍历的优化,c++,c,performance,optimization,depth-first-search,C++,C,Performance,Optimization,Depth First Search,在我的CPU光线跟踪器（路径跟踪器）中，花费的大部分CPU时间都在BVH遍历函数中。根据我的分析器，光线跟踪花费的75%的时间花在这个函数及其调用的函数上，而35%的时间花在函数本身上。另外40%在它调用的不同交叉口测试中基本上，代码会遍历所有与其相交的边界框和三角形。它在堆栈上使用一个静态分配的数组来保存要探索的节点（BVHSTACKSIZE设置为32，绝大多数空间都不需要它），因此不会动态分配内存。然而，在我看来，35%的时间都花在这里似乎很疯狂。我花了一段时间优化代码，目前它是我所能达到

在我的CPU光线跟踪器（路径跟踪器）中，花费的大部分CPU时间都在BVH遍历函数中。根据我的分析器，光线跟踪花费的75%的时间花在这个函数及其调用的函数上，而35%的时间花在函数本身上。另外40%在它调用的不同交叉口测试中

基本上，代码会遍历所有与其相交的边界框和三角形。它在堆栈上使用一个静态分配的数组来保存要探索的节点（BVHSTACKSIZE设置为32，绝大多数空间都不需要它），因此不会动态分配内存。然而，在我看来，35%的时间都花在这里似乎很疯狂。我花了一段时间优化代码，目前它是我所能达到的最快速度，但它仍然是我程序中最大的单一瓶颈

有没有人能提供更多的优化建议？我已经有了一个像样的BVH构造算法，所以我认为使用不同的BVH不会有任何加速。有没有人知道如何在Mac上最好地进行逐行分析

作为参考，示例场景中的这段代码取自bb.intersect（ray））{//不修改光线如果（cur->isLeaf（））{ 用于（常量自动和基本体：cur->primitives）{ 点击|=基本体->相交（光线）；//修改光线！ } }否则{ _相交[head++]=cur->r； _相交[head++]=cur->l； } } } 回击； } 布尔BBox:：相交（常数光线和r）常数{ 双txmin=（最小x-r.o.x）*r.inv\u d.x；双txmax=（最大x-r.o.x）*r.inv\u d.x；双季明=（最小年-相对年）*r.inv\u d.y；双季最大值=（最大值y-r.o.y）*r.inv\u d.y；双tzmin=（min.z-r.o.z）*r.inv_d.z；双tzmax=（最大z-r.o.z）*r.inv_d.z；上升（txmin，txmax）；上升（季敏，季马克斯）；上升（tzmin，tzmax）；双t0=std:：max（txmin，std:：max（tymin，tzmin））；双t1=std:：min（txmax，std:：min（tymax，tzmax））；如果（t1r.max|u t | t1b）{ 标准：交换（a，b）； } }

您的代码似乎至少有一个问题。复制

原语

可能是一项昂贵的操作

bool BVHAccel::intersect(Ray ray) const {
  bool hit = false;

  BVHNode* to_intersect[BVHSTACKSIZE];
  int head = 0;
  to_intersect[head++] = root;

  while (head != 0) {
    assert(head < BVHSTACKSIZE);
    BVHNode* cur = to_intersect[--head];

    if (cur->bb.intersect(ray)) { // Does not modify the ray
      if (cur->isLeaf()) {
        for (const auto& primitive : cur->primitives) { // this code made a copy of primitives on every call!
          hit |= primitive->intersect(ray); // Modifies the ray!
        }
      } else {
        to_intersect[head++] = cur->r;
        to_intersect[head++] = cur->l;
      }
    }
  }

  return hit;
}

bool BVHAccel:：intersect（光线）常量{
bool-hit=false；
BVHNode*到_相交[BVHSTACKSIZE]；
int头=0；
_相交[head++]=根；
while（头！=0）{
断言（头bb.intersect（光线））{//不修改光线
如果（cur->isLeaf（））{
对于（const auto&primitive:cur->primitives）{//这段代码在每次调用时都复制了一份原语！
点击|=基本体->相交（光线）；//修改光线！
}
}否则{
_相交[head++]=cur->r；
_相交[head++]=cur->l；
}
}
}
回击；
}

为什么需要修改光线的副本

编辑1：我们可以假设BVHNode看起来像这样吗

constexpr auto BVHSTACKSIZE = 32;

struct Primitive;

struct BVHNode {
    std::vector<Primitive> primitives;
    AABB        bb;   
    BVHNode*    r = nullptr;
    BVHNode*    l = nullptr;

    bool isLeaf() const { return r == nullptr && l == nullptr; }
};

constexpr auto BVHSTACKSIZE=32；
结构原语；
结构BVHNode{
std：：向量原语；
AABB-bb；
BVHNode*r=nullptr；
BVHNode*l=nullptr；
bool isLeaf（）常量{return r==nullptr&&l==nullptr；}
};

我认为您可以做三项改进

第一个大问题（很难）是代码中有许多条件分支，这肯定会降低CPU的速度，因为它无法很好地预测代码路径（编译时也是如此）。例如，我看到您首先相交，然后测试节点是否为叶，然后与所有prim相交。你能先测试一下它是不是一片叶子，然后再做正确的交叉吗？这将略微减少分支

其次，您的BVH内存布局是什么？你能优化它使它对你友好吗。您可以尝试查看遍历过程中发生的缓存未命中数，这将很好地指示内存是否具有正确的布局。尽管没有直接的联系，但现在很高兴了解您的平台和底层硬件。我推荐阅读

最后，这是您将对性能产生最大影响的地方，请使用SSE/AVX！通过在交集代码中进行一些重构，您可以同时将四个边界框交集，从而在应用程序中获得良好的提升。您可以看看（英特尔跟踪器）的作用，尤其是在数学库中

另外，我刚才看到您正在使用

double

。这有什么原因吗？我们的pathtracer根本不使用double，因为在任何情况下，渲染都不需要这种精度

希望有帮助

编辑：我制作了一个sse版本的bbox交叉点，如果你想试试的话。它部分基于我们的代码，但我不确定它是否有效，你应该对它进行基准测试和测试

#包括
#包括
#包括
#包括
#包括
constexpr float pos_inf=std:：numeric_limits:：max（）；
constexpr float neg_inf=std:：numeric_limits:：min（）；
尺寸t型bsf（尺寸v型）
{
大小r=0；asm（“bsf%1，%0”：“=r”（r）：“r”（v））；
返回r；
}
__m128迷你型（常数m128 a、常数m128 b）
{
返回(a),(b);；
}
__m128最大值（常数m128 a，常数m128 b）
{
返回(a),(b);；
}
__m128防抱死制动系统（常数m128 a）
{
返回“mm”和“not”ps（\u mm\u set1\u ps（-0.0f），a）；
}
__m128选择（常数m128掩码、常数m128 t、常数m128 f）
{ 
返回（f，t，mask）；
}
模板
__m128随机播放（常量m128 b）
{
return"mm_castsi128_ps(mm_shuffle_epi32)(mm_castsi128(b),mm_shuffle(i3,i2,i1,i0);；
}
__m128分钟（常数m128 a，常数m128 b）{返回，
while stack not empty:
    cur = pop from top of stack;
    //we already know that we want to enter this node!
    if cur is leaf:
        intersect primitives
    else:
        t_left = intersect bbox of cur->l
        t_right = intersect bbox of cur->r
        if both intersected:
            if t_left < t_right:
                push cur->r, cur->l in that order (so that cur->l will be on top)
            else:
                push cur->l, cur->r in that order (so that cur->r will be on top)
        else if one intersected:
            push only that one
        else:
            push nothing