C++ 优化C++；性能代码_C++_Iphone_Performance_Optimization

C++ 优化C++；性能代码

c++ iphone performance optimization

C++ 优化C++；性能代码,c++,iphone,performance,optimization,C++,Iphone,Performance,Optimization,你能想出一些方法来优化这段代码吗？它将在ARMv7处理器（Iphone 3GS）中执行：所有这些代码都取自OpenSURF库。下面是函数的上下文（有些人要求上下文）： /！计算所提供图层的DoH响应 void FastHessian:：buildResponseLayer（ResponseLayer*rl） { float*responses=rl->responses；//响应存储无符号字符*laplacian=rl->laplacian；//laplacian符号存储 int step=

你能想出一些方法来优化这段代码吗？它将在ARMv7处理器（Iphone 3GS）中执行：

所有这些代码都取自OpenSURF库。下面是函数的上下文（有些人要求上下文）：

/！计算所提供图层的DoH响应
void FastHessian:：buildResponseLayer（ResponseLayer*rl）
{
float*responses=rl->responses；//响应存储
无符号字符*laplacian=rl->laplacian；//laplacian符号存储
int step=rl->step；//此筛选器的步长
int b=（rl->filter-1）*0.5+1；//此筛选器的边框
int l=rl->filter/3；//此滤波器的波瓣（滤波器大小/3）
int w=rl->filter；//过滤器大小
浮点反_面积=1.f/（w*w）；//归一化因子
浮动Dxx，Dyy，Dxy；
对于（int r，c，ar=0，index=0；arheight；++ar）
{
对于（int ac=0；acwidth；++ac，index++）
{
//获取图像坐标
r=ar*阶跃；
c=交流*步进；
//计算响应组件
Dxx=箱积分（img，r-l+1，c-b，2*l-1，w）
-盒积分（img，r-l+1，c-l*0.5，2*l-1，l）*3；
Dyy=箱积分（img，r-b，c-l+1，w，2*l-1）
-盒积分（img，r-l*0.5，c-l+1，l，2*l-1）*3；
Dxy=+BoxIntegral（img，r-l，c+1，l，l）
+盒积分（img，r+1，c-l，l，l）
-盒积分（img，r-l，c-l，l，l）
-盒积分（img，r+1，c+1，l，l）；
//根据过滤器响应的大小使其正常化
Dxx*=反向面积；
Dyy*=逆_面积；
Dxy*=逆_面积；
//得到hessian响应的行列式&拉普拉斯符号
响应[指数]=（Dxx*Dyy-0.81f*Dxy*Dxy）；
拉普拉斯[指数]=（Dxx+Dyy>=0？1:0）；
#ifdef RL_调试
//为每个响应创建映像坐标列表
rl->coords.push_back（std:：make_pair（r，c））；
#恩迪夫
}
}
}

一些问题：
函数是内联的好主意吗？

使用内联程序集会带来显著的加速吗？

有几个地方可以重用临时变量，但它是否会提高性能必须直接衡量：

改变

  if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1]; 
  if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2]; 
  if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1]; 
  if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];

到

如果if语句很少提供true，您可能会经常执行temp多重操作。

专门处理边，这样您就不需要在每一行和每一列中检查它们。我假设这个调用是在一个嵌套循环中，并且调用了很多次。这一职能将成为：

inline float BoxIntegralNonEdge(IplImage *img, int row, int col, int rows, int cols) 
{
  float *data = (float *) img->imageData;
  int step = img->widthStep/sizeof(float);

  // The subtraction by one for row/col is because row/col is inclusive.
  int r1 = row - 1;
  int c1 = col - 1;
  int r2 = row + rows - 1;
  int c2 = col + cols - 1;

  float A(data[r1 * step + c1]), B(data[r1 * step + c2]), C(data[r2 * step + c1]), D(data[r2 * step + c2]);

  return std::max(0.f, A - B - C + D);
}

对于每一个min和两个条件和一个if分支，您可以去掉一个条件和一个分支。只有在满足条件的情况下才能调用此函数——在调用者中为整行检查一次，而不是每个像素

我写了一些优化图像处理的技巧，当你必须处理每个像素时：

博客中的其他内容：

您正在使用2个倍数重新计算图像数据中的一个位置（索引是乘法）——您应该增加一个指针

不要传入img、row、row、col和col，而是传入指向要处理的确切像素的指针——这是通过递增指针而不是索引得到的

如果不执行上述操作，步骤对于所有像素都是相同的，请在调用者中计算并传入。如果您执行1和2，则根本不需要步骤

编译器可能会在适当的地方自动处理内联

没有任何背景知识。是否需要进行if（r1>=0&&c1>=0）检查

行和列参数不是必须大于0吗

float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) 
{
  assert(row > 0 && col > 0);
  float *data = (float*)img->imageData; // Don't use C-style casts
  int step = img->widthStep/sizeof(float);

  // Is the min check rly necessary?
  int r1 = std::min(row,          img->height) - 1;
  int c1 = std::min(col,          img->width)  - 1;
  int r2 = std::min(row + rows,   img->height) - 1;
  int c2 = std::min(col + cols,   img->width)  - 1;

  int r1_step = r1 * step;
  int r2_step = r2 * step;

  float A = data[r1_step + c1];
  float B = data[r1_step + c2];
  float C = data[r2_step + c1];
  float D = data[r2_step + c2];

  return std::max(0.0f, A - B - C + D);
}

我不确定您的问题是否与此有关，但这可能会允许您同时对映像执行多个操作，并使您的性能得到很好的改善。我假设您正在内联和优化，因为您多次执行该操作。看看：

如果启用了正确的标志，编译器确实对Neon有一些支持，但您可能需要自己推出一些

编辑

要获得neon的编译器支持，您需要使用编译器标志

-mfpu=neon

您对四个变量

，

，但只对组合

A-B-C+D

感兴趣

试一试

一些示例表示直接初始化

、

和

，并使用

跳过初始化，但这在某些方面与原始代码的功能不同。不过，我会这样做：

inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)  {

    const float *data = (float *) img->imageData;
    const int step = img->widthStep/sizeof(float);

    // The subtraction by one for row/col is because row/col is inclusive.
    const int r1 = std::min(row,          img->height) - 1;
    const int r2 = std::min(row + rows,   img->height) - 1;
    const int c1 = std::min(col,          img->width)  - 1;
    const int c2 = std::min(col + cols,   img->width)  - 1;

    const float A = (r1 >= 0 && c1 >= 0) ? data[r1 * step + c1] : 0.0f;
    const float B = (r1 >= 0 && c2 >= 0) ? data[r1 * step + c2] : 0.0f;
    const float C = (r2 >= 0 && c1 >= 0) ? data[r2 * step + c1] : 0.0f;
    const float D = (r2 >= 0 && c2 >= 0) ? data[r2 * step + c2] : 0.0f;

    return std::max(0.f, A - B - C + D);
}

与原始代码一样，这将使

、

和

具有来自

data[]

的值（如果条件为

true

或

0.0f

）。此外，我会（如我所示）在适当的地方使用

const

。许多编译器无法基于

const

-ness对代码进行太多的改进，但向编译器提供有关其操作数据的更多信息肯定不会有什么坏处。最后，我对

r1

r2

c1

c2

变量进行了重新排序，以鼓励重用获取的宽度和高度

显然，你需要进行配置以确定这是否有任何改进。

两个问题的唯一正确答案是：度量。是的，看看最近的C++问题——在向量与数组的速度上有一个问题。代码显示了如何使用Boost计时器进行分析。您还可以查看graphics.stanford.edu/~seander/bithacks.html——其中的许多小技巧可以提供更快的方法

inline float BoxIntegralNonEdge(IplImage *img, int row, int col, int rows, int cols) 
{
  float *data = (float *) img->imageData;
  int step = img->widthStep/sizeof(float);

  // The subtraction by one for row/col is because row/col is inclusive.
  int r1 = row - 1;
  int c1 = col - 1;
  int r2 = row + rows - 1;
  int c2 = col + cols - 1;

  float A(data[r1 * step + c1]), B(data[r1 * step + c2]), C(data[r2 * step + c1]), D(data[r2 * step + c2]);

  return std::max(0.f, A - B - C + D);
}

float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) 
{
  assert(row > 0 && col > 0);
  float *data = (float*)img->imageData; // Don't use C-style casts
  int step = img->widthStep/sizeof(float);

  // Is the min check rly necessary?
  int r1 = std::min(row,          img->height) - 1;
  int c1 = std::min(col,          img->width)  - 1;
  int r2 = std::min(row + rows,   img->height) - 1;
  int c2 = std::min(col + cols,   img->width)  - 1;

  int r1_step = r1 * step;
  int r2_step = r2 * step;

  float A = data[r1_step + c1];
  float B = data[r1_step + c2];
  float C = data[r2_step + c1];
  float D = data[r2_step + c2];

  return std::max(0.0f, A - B - C + D);
}

float result(0.0f);
if (r1 >= 0 && c1 >= 0) result += data[r1 * step + c1];
if (r1 >= 0 && c2 >= 0) result -= data[r1 * step + c2];
if (r2 >= 0 && c1 >= 0) result -= data[r2 * step + c1];
if (r2 >= 0 && c2 >= 0) result += data[r2 * step + c2];

if (result > 0f) return result;
return 0f;

inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)  {

    const float *data = (float *) img->imageData;
    const int step = img->widthStep/sizeof(float);

    // The subtraction by one for row/col is because row/col is inclusive.
    const int r1 = std::min(row,          img->height) - 1;
    const int r2 = std::min(row + rows,   img->height) - 1;
    const int c1 = std::min(col,          img->width)  - 1;
    const int c2 = std::min(col + cols,   img->width)  - 1;

    const float A = (r1 >= 0 && c1 >= 0) ? data[r1 * step + c1] : 0.0f;
    const float B = (r1 >= 0 && c2 >= 0) ? data[r1 * step + c2] : 0.0f;
    const float C = (r2 >= 0 && c1 >= 0) ? data[r2 * step + c1] : 0.0f;
    const float D = (r2 >= 0 && c2 >= 0) ? data[r2 * step + c2] : 0.0f;

    return std::max(0.f, A - B - C + D);
}