C++ 发生在对齐的地址。如有必要,您可以用与上面代码中最后几个元素相同的方式处理前几个未对齐的元素

C++ 发生在对齐的地址。如有必要,您可以用与上面代码中最后几个元素相同的方式处理前几个未对齐的元素,c++,performance,optimization,C++,Performance,Optimization,另外要注意的是内存带宽。如果ClusterCentroid结构中有几个成员在循环过程中没有使用,那么您将从内存中读取比实际需要多得多的数据,因为内存是在缓存线大小的块中读取的,每个块都是64字节。太糟糕了,我们不能使用SSE4.1,但很好,SSE2就是这样。我没有对此进行测试,只是对其进行编译以查看是否存在语法错误,并查看程序集是否有意义(基本上没有问题,尽管GCC溢出了min_index,即使没有使用一些xmm寄存器,也不确定为什么会发生这种情况) 由此 min_index = _mm_ble

另外要注意的是内存带宽。如果ClusterCentroid结构中有几个成员在循环过程中没有使用,那么您将从内存中读取比实际需要多得多的数据,因为内存是在缓存线大小的块中读取的,每个块都是64字节。

太糟糕了,我们不能使用SSE4.1,但很好,SSE2就是这样。我没有对此进行测试,只是对其进行编译以查看是否存在语法错误,并查看程序集是否有意义(基本上没有问题,尽管GCC溢出了
min_index
,即使没有使用一些
xmm
寄存器,也不确定为什么会发生这种情况)

由此

min_index = _mm_blendv_epi8(min_index, index, mask);
这是一个asm版本,专为,经过测试(似乎有效)

在C++中:

extern "C" int find_closest(int n, float** points, float* reference_point);

一种可能的微观优化:将min_dist和min_索引存储在局部变量中。编译器可能需要以您编写的方式更频繁地写入内存;在某些体系结构上,这会对性能产生很大影响。另一个例子见

亚当斯建议同时进行4次比较,这也是一个很好的建议

然而,你最好的加速将来自减少你必须检查的质心的数量。理想情况下,围绕质心构建kd树(或类似树),然后查询该树以找到最近的点

如果您没有任何树构建代码,下面是我最喜欢的“穷人”最近点搜索:

Sort the points by one coordinate, e.g. cent.pos[0]
Pick a starting index for the query point (pt)
Iterate forwards through the candidate points until you reach the end, OR when abs(pt.pos[0] - cent.pos[0]) > min_dist
Repeat the previous step going the opposite direction.
搜索的额外停止条件意味着你应该跳过相当多的分数;你还可以保证不会跳过任何比你已经找到的最好成绩更接近的分数

对于您的代码,这看起来像

// sort centroid by x coordinate.
min_index = -1;
min_dist = numeric_limits<float>::max();

// pick the start index. This works well if the points are evenly distributed.
float min_x = centroids[0].pos[0];
float max_x = centroids[num_centroids-1].pos[0];
float cur_x = pt.pos[0];
float t = (max_x - cur_x) / (max_x - min_x);
// TODO clamp t between 0 and 1
int start_index = int(t * float(num_centroids))

// Forward search
for (int i=start_index ; i < num_centroids; ++i)
{
    const ClusterCentroid& cent = centroids[i];
    if (fabs(cent.pos[0] - pt.pos[0]) > min_i)
        // Everything to the right of this must be further min_dist, so break.
        // This is where the savings comes from!
        break; 
    const float dist = ...;
    if (dist < min_dist)
    {
        min_dist = dist;
        min_index = i;
    }
}

// Backwards search
for (int i=start_index ; i >= 0; --i)
{
    // same as above
}
pt.min_dist = min_dist
pt.min_index = min_index
//按x坐标对质心排序。
最小指数=-1;
最小距离=数值极限::最大();
//选择开始索引。如果点分布均匀,则效果良好。
float min_x=质心[0]。位置[0];
float max_x=质心[num_质心-1]。位置[0];
浮动电流x=pt.pos[0];
浮动t=(最大值x-当前值x)/(最大值x-最小值x);
//TODO卡箍t介于0和1之间
int start_index=int(t*float(num_质心))
//向前搜索
对于(int i=开始索引;imin_i)
//这右边的东西必须再远一点,所以要打破。
//这就是储蓄的来源!
打破
常量浮点距离=。。。;
如果(距离<最小距离)
{
最小距离=距离;
最小指数=i;
}
}
//反向搜索
对于(int i=start_index;i>=0;--i)
{
//同上
}
pt.min\U dist=min\U dist
pt.min\u索引=min\u索引
(请注意,这假定您正在计算点之间的距离,但您的部件表明这是距离的平方。相应地调整打断条件)


构建树或排序质心的开销很小,但这应该通过在更大的循环中加快计算(超过点数)来抵消。

看到有人请求性能帮助,说他们已经分析并找到了热点,这让人耳目一新。这将是微小的改进,但您可以将循环的第一次迭代提升出来,只需将min_索引和min_dist初始化为第一个质心。检查它没有意义;你知道答案是什么。@SimonAndréForsberg:当然,你必须至少添加整个函数体,包括距离计算、点和质心的定义,但为了对性能做出有意义的陈述,这无论如何都是非常充分的。你有多确定这是罪魁祸首?许多剖析者会指出“需要很长时间才能产生价值的消费者”是罪魁祸首,因为它将被长期搁置。无论如何,如果你发布了距离计算,我会为你写一个AVX版本(包括“分支”,因为它不是分支),你看得完全错了——你需要优化算法,而不是优化检查。微基准<算法。如果不天真地实现算法,您可以得到显著的提升——这里有两篇论文可以让您开始学习——它们还引用了许多其他好东西。还有-这是一个简单但有效的实现,你可以从@Ike阅读和学习:对不起,这并不能回答你的问题,但是a)你在什么机器上运行这个,b)你为什么要使用这样一个古老的编译器?我向你保证,与我们建议的大多数优化相比,仅仅切换到当前编译器对性能的影响更大,因为您的编译器根本不知道有什么机器指令。另外,请在问题中提及编译器、操作系统和硬件的类型。到目前为止,我认为我们正在处理一些最新的技术。如果你想衡量改进,请添加一个评论。我想我们都想知道它是怎么回事。我不明白
bitselect
dist
pt.min\u dist
不是
float
类型的吗?浮点数上的位操作(重新解释转换)是否定义了行为?实际上,不允许从float转换为int,我认为从
float*
转换为
int*
并通过该指针访问值是不正确的。但是,如果有人能告诉我,g++4.9 for x64是否会在存在其他优化的情况下“按预期”编译这样的代码(可能关闭了严格别名?),我会很满意。顺便说一句:浮点运算根本没有定义位运算符-这就是为什么我要问转换为int的原因。@Ike除了visual studio之外,我没有在其他任何东西上尝试过浮点运算版本,但它的int版本在unbuntu、android和windows(以及在这些处理器上:ARM、x86、x64)上运行良好这是真的。但是,编译器并不总是正确的。只有一定程度的复杂性
        +const bool found_closer = min_dist_squared > dist_squared;
        +pt.centroid = bitselect(found_closer, c, pt.centroid);
        +min_dist_squared = bitselect(found_closer, dist_squared, min_dist_squared);
    // New version of Centroids::find_nearest (from harold's solution):
    int find_nearest(const float* pt_xyz) const
    {
        __m128i min_index = _mm_set_epi32(3, 2, 1, 0);
        __m128 xdif = _mm_sub_ps(_mm_set1_ps(pt_xyz[0]), _mm_load_ps(cen_x));
        __m128 ydif = _mm_sub_ps(_mm_set1_ps(pt_xyz[1]), _mm_load_ps(cen_y));
        __m128 zdif = _mm_sub_ps(_mm_set1_ps(pt_xyz[2]), _mm_load_ps(cen_z));
        __m128 min_dist = _mm_add_ps(_mm_add_ps(_mm_mul_ps(xdif, xdif), 
                                                _mm_mul_ps(ydif, ydif)), 
                                                _mm_mul_ps(zdif, zdif));
        __m128i index = min_index;
        for (int i=4; i < num_centroids; i += 4) 
        {
            xdif = _mm_sub_ps(_mm_set1_ps(pt_xyz[0]), _mm_load_ps(cen_x + i));
            ydif = _mm_sub_ps(_mm_set1_ps(pt_xyz[1]), _mm_load_ps(cen_y + i));
            zdif = _mm_sub_ps(_mm_set1_ps(pt_xyz[2]), _mm_load_ps(cen_z + i));
            __m128 dist = _mm_add_ps(_mm_add_ps(_mm_mul_ps(xdif, xdif), 
                                                _mm_mul_ps(ydif, ydif)), 
                                                _mm_mul_ps(zdif, zdif));
            __m128i mask = _mm_castps_si128(_mm_cmplt_ps(dist, min_dist));
            min_dist = _mm_min_ps(min_dist, dist);
            min_index = _mm_or_si128(_mm_and_si128(index, mask), 
                                     _mm_andnot_si128(mask, min_index));
            index = _mm_add_epi32(index, _mm_set1_epi32(4));
        }

        ALIGN16 float mdist[4];
        ALIGN16 uint32_t mindex[4];
        _mm_store_ps(mdist, min_dist);
        _mm_store_si128((__m128i*)mindex, min_index);

        float closest = mdist[0];
        int closest_i = mindex[0];
        for (int i=1; i < 4; i++)
        {
            if (mdist[i] < closest) 
            {
                closest = mdist[i];
                closest_i = mindex[i];
            }
        }
        return closest_i;
    }
int find_nearest_simd(const float* pt_xyz) const
{
    __m128i min_index = _mm_set_epi32(3, 2, 1, 0);
    __m128 pt_xxxx = _mm_set1_ps(pt_xyz[0]);
    __m128 pt_yyyy = _mm_set1_ps(pt_xyz[1]);
    __m128 pt_zzzz = _mm_set1_ps(pt_xyz[2]);

    __m128 xdif = _mm_sub_ps(pt_xxxx, _mm_load_ps(cen_x));
    __m128 ydif = _mm_sub_ps(pt_yyyy, _mm_load_ps(cen_y));
    __m128 zdif = _mm_sub_ps(pt_zzzz, _mm_load_ps(cen_z));
    __m128 min_dist = _mm_add_ps(_mm_add_ps(_mm_mul_ps(xdif, xdif), 
                                            _mm_mul_ps(ydif, ydif)), 
                                            _mm_mul_ps(zdif, zdif));
    __m128i index = min_index;
    for (int i=4; i < num_centroids; i += 4) 
    {
        xdif = _mm_sub_ps(pt_xxxx, _mm_load_ps(cen_x + i));
        ydif = _mm_sub_ps(pt_yyyy, _mm_load_ps(cen_y + i));
        zdif = _mm_sub_ps(pt_zzzz, _mm_load_ps(cen_z + i));
        __m128 dist = _mm_add_ps(_mm_add_ps(_mm_mul_ps(xdif, xdif), 
                                            _mm_mul_ps(ydif, ydif)), 
                                            _mm_mul_ps(zdif, zdif));
        index = _mm_add_epi32(index, _mm_set1_epi32(4));
        __m128i mask = _mm_castps_si128(_mm_cmplt_ps(dist, min_dist));
        min_dist = _mm_min_ps(min_dist, dist);
        min_index = _mm_or_si128(_mm_and_si128(index, mask), 
                                 _mm_andnot_si128(mask, min_index));
    }

    ALIGN16 float mdist[4];
    ALIGN16 uint32_t mindex[4];
    _mm_store_ps(mdist, min_dist);
    _mm_store_si128((__m128i*)mindex, min_index);

    float closest = mdist[0];
    int closest_i = mindex[0];
    for (int i=1; i < 4; i++)
    {
        if (mdist[i] < closest) 
        {
            closest = mdist[i];
            closest_i = mindex[i];
        }
    }
    return closest_i;
}
inline static int bitselect(int condition, int truereturnvalue, int falsereturnvalue)
{
    return (truereturnvalue & -condition) | (falsereturnvalue & ~(-condition)); //a when TRUE and b when FALSE
}

inline static float bitselect(int condition, float truereturnvalue, float falsereturnvalue)
{
    //Reinterpret floats. Would work because it's just a bit select, no matter the actual value
    int& at = reinterpret_cast<int&>(truereturnvalue);
    int& af = reinterpret_cast<int&>(falsereturnvalue);
    int res = (at & -condition) | (af & ~(-condition)); //a when TRUE and b when FALSE
    return  reinterpret_cast<float&>(res);
}
for (int i=0; i < num_centroids; ++i)
{
  const ClusterCentroid& cent = centroids[i];
  const float dist = ...;
  bool isSmaeller = dist < pt.min_dist;

  //use same value if not smaller
  pt.min_index = bitselect(isSmaeller, i, pt.min_index);
  pt.min_dist = bitselect(isSmaeller, dist, pt.min_dist);
}
std::vector<float> centDists(num_centroids); //<-- one for each thread. 
for (size_t p=0; p<num_points; ++p) {
    Point& pt = points[p];
    for (size_t c=0; c<num_centroids; ++c) {
        const float dist = ...;
        centDists[c]=dist;
    }
    pt.min_idx it= min_element(centDists.begin(),centDists.end())-centDists.begin();    
}
int g(int, int);

int f(const int *arr)
{
    int min = 10000, minIndex = -1;
    for ( int i = 0; i < 1000; ++i )
    {
        if ( arr[i] < min )
        {
            min = arr[i];
            minIndex = i;
        }
    }
    return g(min, minIndex);
}
movl    (%rdi,%rax,4), %ecx
movl    %edx, %r8d
cmpl    %edx, %ecx
cmovle  %ecx, %r8d
cmovl   %eax, %esi
addq    $1, %rax
  float MinimumDistance(const float *values, int count)
  {
    __m128 min = _mm_set_ps(FLT_MAX, FLT_MAX, FLT_MAX, FLT_MAX);
    int i=0;
    for (; i < count - 3; i+=4)
    {
        __m128 distances = _mm_loadu_ps(&values[i]);
        min = _mm_min_ps(min, distances);
    }
    // Combine the four separate minimums to a single value
    min = _mm_min_ps(min, _mm_shuffle_ps(min, min, _MM_SHUFFLE(2, 3, 0, 1)));
    min = _mm_min_ps(min, _mm_shuffle_ps(min, min, _MM_SHUFFLE(1, 0, 3, 2)));

    // Deal with the last 0-3 elements the slow way
    float result = FLT_MAX;
    if (count > 3) _mm_store_ss(&result, min);
    for (; i < count; i++)
    {
        result = min(values[i], result);
    }

    return result;
  }
int find_closest(float *x, float *y, float *z,
                 float pt_x, float pt_y, float pt_z, int n) {
    __m128i min_index = _mm_set_epi32(3, 2, 1, 0);
    __m128 xdif = _mm_sub_ps(_mm_set1_ps(pt_x), _mm_load_ps(x));
    __m128 ydif = _mm_sub_ps(_mm_set1_ps(pt_y), _mm_load_ps(y));
    __m128 zdif = _mm_sub_ps(_mm_set1_ps(pt_z), _mm_load_ps(z));
    __m128 min_dist = _mm_add_ps(_mm_add_ps(_mm_mul_ps(xdif, xdif), 
                                            _mm_mul_ps(ydif, ydif)), 
                                            _mm_mul_ps(zdif, zdif));
    __m128i index = min_index;
    for (int i = 4; i < n; i += 4) {
        xdif = _mm_sub_ps(_mm_set1_ps(pt_x), _mm_load_ps(x + i));
        ydif = _mm_sub_ps(_mm_set1_ps(pt_y), _mm_load_ps(y + i));
        zdif = _mm_sub_ps(_mm_set1_ps(pt_z), _mm_load_ps(z + i));
        __m128 dist = _mm_add_ps(_mm_add_ps(_mm_mul_ps(xdif, xdif), 
                                            _mm_mul_ps(ydif, ydif)), 
                                            _mm_mul_ps(zdif, zdif));
        index = _mm_add_epi32(index, _mm_set1_epi32(4));
        __m128i mask = _mm_castps_si128(_mm_cmplt_ps(dist, min_dist));
        min_dist = _mm_min_ps(min_dist, dist);
        min_index = _mm_or_si128(_mm_and_si128(index, mask), 
                                 _mm_andnot_si128(mask, min_index));
    }
    float mdist[4];
    _mm_store_ps(mdist, min_dist);
    uint32_t mindex[4];
    _mm_store_si128((__m128i*)mindex, min_index);
    float closest = mdist[0];
    int closest_i = mindex[0];
    for (int i = 1; i < 4; i++) {
        if (mdist[i] < closest) {
            closest = mdist[i];
            closest_i = mindex[i];
        }
    }
    return closest_i;
}
min_index = _mm_or_si128(_mm_and_si128(index, mask), 
                         _mm_andnot_si128(mask, min_index));
min_index = _mm_blendv_epi8(min_index, index, mask);
bits 64

section .data

align 16
centroid_four:
    dd 4, 4, 4, 4
centroid_index:
    dd 0, 1, 2, 3

section .text

global find_closest

proc_frame find_closest
    ;
    ;   arguments:
    ;       ecx: number of points (multiple of 4 and at least 4)
    ;       rdx -> array of 3 pointers to floats (x, y, z) (the points)
    ;       r8 -> array of 3 floats (the reference point)
    ;
    alloc_stack 0x58
    save_xmm128 xmm6, 0
    save_xmm128 xmm7, 16
    save_xmm128 xmm8, 32
    save_xmm128 xmm9, 48
[endprolog]
    movss xmm0, [r8]
    shufps xmm0, xmm0, 0
    movss xmm1, [r8 + 4]
    shufps xmm1, xmm1, 0
    movss xmm2, [r8 + 8]
    shufps xmm2, xmm2, 0
    ; pointers to x, y, z in r8, r9, r10
    mov r8, [rdx]
    mov r9, [rdx + 8]
    mov r10, [rdx + 16]
    ; reference point is in xmm0, xmm1, xmm2 (x, y, z)
    movdqa xmm3, [rel centroid_index]   ; min_index
    movdqa xmm4, xmm3                   ; current index
    movdqa xmm9, [rel centroid_four]     ; index increment
    paddd xmm4, xmm9
    ; calculate initial min_dist, xmm5
    movaps xmm5, [r8]
    subps xmm5, xmm0
    movaps xmm7, [r9]
    subps xmm7, xmm1
    movaps xmm8, [r10]
    subps xmm8, xmm2
    mulps xmm5, xmm5
    mulps xmm7, xmm7
    mulps xmm8, xmm8
    addps xmm5, xmm7
    addps xmm5, xmm8
    add r8, 16
    add r9, 16
    add r10, 16
    sub ecx, 4
    jna _tail
_loop:
    movaps xmm6, [r8]
    subps xmm6, xmm0
    movaps xmm7, [r9]
    subps xmm7, xmm1
    movaps xmm8, [r10]
    subps xmm8, xmm2
    mulps xmm6, xmm6
    mulps xmm7, xmm7
    mulps xmm8, xmm8
    addps xmm6, xmm7
    addps xmm6, xmm8
    add r8, 16
    add r9, 16
    add r10, 16
    movaps xmm7, xmm6
    cmpps xmm6, xmm5, 1
    minps xmm5, xmm7
    movdqa xmm7, xmm6
    pand xmm6, xmm4
    pandn xmm7, xmm3
    por xmm6, xmm7
    movdqa xmm3, xmm6
    paddd xmm4, xmm9
    sub ecx, 4
    ja _loop
_tail:
    ; calculate horizontal minumum
    pshufd xmm0, xmm5, 0xB1
    minps xmm0, xmm5
    pshufd xmm1, xmm0, 0x4E
    minps xmm0, xmm1
    ; find index of the minimum
    cmpps xmm0, xmm5, 0
    movmskps eax, xmm0
    bsf eax, eax
    ; index into xmm3, sort of
    movaps [rsp + 64], xmm3
    mov eax, [rsp + 64 + rax * 4]
    movaps xmm9, [rsp + 48]
    movaps xmm8, [rsp + 32]
    movaps xmm7, [rsp + 16]
    movaps xmm6, [rsp]
    add rsp, 0x58
    ret
endproc_frame
extern "C" int find_closest(int n, float** points, float* reference_point);
Sort the points by one coordinate, e.g. cent.pos[0]
Pick a starting index for the query point (pt)
Iterate forwards through the candidate points until you reach the end, OR when abs(pt.pos[0] - cent.pos[0]) > min_dist
Repeat the previous step going the opposite direction.
// sort centroid by x coordinate.
min_index = -1;
min_dist = numeric_limits<float>::max();

// pick the start index. This works well if the points are evenly distributed.
float min_x = centroids[0].pos[0];
float max_x = centroids[num_centroids-1].pos[0];
float cur_x = pt.pos[0];
float t = (max_x - cur_x) / (max_x - min_x);
// TODO clamp t between 0 and 1
int start_index = int(t * float(num_centroids))

// Forward search
for (int i=start_index ; i < num_centroids; ++i)
{
    const ClusterCentroid& cent = centroids[i];
    if (fabs(cent.pos[0] - pt.pos[0]) > min_i)
        // Everything to the right of this must be further min_dist, so break.
        // This is where the savings comes from!
        break; 
    const float dist = ...;
    if (dist < min_dist)
    {
        min_dist = dist;
        min_index = i;
    }
}

// Backwards search
for (int i=start_index ; i >= 0; --i)
{
    // same as above
}
pt.min_dist = min_dist
pt.min_index = min_index