C++ 如何使用OpenMP并行化最近邻搜索_C++_Multithreading_Optimization_Openmp_Nearest Neighbor

C++ 如何使用OpenMP并行化最近邻搜索

c++ multithreading optimization

C++ 如何使用OpenMP并行化最近邻搜索,c++,multithreading,optimization,openmp,nearest-neighbor,C++,Multithreading,Optimization,Openmp,Nearest Neighbor,基本上，我有一个集合std:：vector，其中包含大小为512（2048bytes）的成对模板std:：vector，以及它们相应的标识符unsigned int 我正在编写一个函数，其中提供了一个模板，我需要返回集合中最相似模板的标识符。我用点积来计算相似度我的天真实现如下所示： // Should return false if no match is found (ie. similarity is 0 for all templates in collection) bool ide

基本上，我有一个集合

std:：vector

，其中包含大小为512（2048

bytes

）的成对模板

std:：vector

，以及它们相应的标识符

unsigned int

我正在编写一个函数，其中提供了一个模板，我需要返回集合中最相似模板的标识符。我用点积来计算相似度

我的天真实现如下所示：

// Should return false if no match is found (ie. similarity is 0 for all templates in collection)
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    bool found = false;
    similarity = 0.f;

    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data();
        float consinSimilarity = getSimilarity(data, candidateTemplate, length); // computes cosin sim between two vectors, implementation depends on architecture. 

        if (consinSimilarity > similarity) {
            found = true;
            similarity = consinSimilarity;
            label = collection[i].second;
        }
    }

    return found;
}

//如果未找到匹配项，则应返回false（即集合中所有模板的相似度均为0）
布尔标识（常量浮点*数据、无符号整数长度、无符号整数和标签、浮点和相似性）{
bool-found=false；
相似性=0.f；
对于（size_t i=0；i相似性）{
发现=真；
相似性=一致性；
label=集合[i]。第二；
}
}
发现退货；
}

如何使用并行化来加速这一过程。我的收藏可能包含数百万个模板。我已经读到，您可以添加

#pragma omp parallel进行缩减

，但我不完全确定如何使用它（如果这是最好的选择）

另请注意：对于我的dot产品实现，如果基础架构支持AVX和FMA，我将使用实现。

由于SIMD寄存器数量有限，因此并行化时这会影响性能吗

由于我们无法访问实际编译的示例（这会很好），因此我实际上没有尝试编译下面的示例。尽管如此，除了一些小的打字错误（可能），总体思路应该是清楚的

任务是找到相似性的最高值和相应的标签，为此，我们确实可以使用

缩减

，但由于我们需要找到一个值的最大值，然后存储相应的标签，因此我们使用一对来同时存储这两个值，以便在OpenMP中将其作为

缩减

实现

我已经稍微重写了您的代码，使用变量的原始命名（

temp

）可能会使事情更难阅读。基本上，我们并行执行搜索，因此每个线程都会找到一个最佳值，然后我们要求OpenMP在线程之间找到最佳解决方案（

reduce

），我们就完成了

//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair. 
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
    if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function. 
#pragma omp declare reduction(custom_reduction : \
    std::pair<float, unsigned int>: \
    reduce_custom(omp_out, omp_in)) \
    initializer(omp_priv(omp_orig))

bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label. 

#pragma omp parallel for reduction(custom_reduction:temp)
    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data(); 
        float consinSimilarity = getSimilarity(data, candidateTemplate, length);

        if (consinSimilarity > temp.first) {
            temp.first = consinSimilarity;
            temp.second = collection[i].second;
        }
    }

    if (temp.first > 0.f) {
        similarity = temp.first;
        label = temp.second;
        return true;
    }

    return false;
}

//通过查找最大值并存储相应的标签来减少，这就是我们使用std:：pair的原因。
void reduce_custom（std:：pair&output，std:：pair&input）{
如果（input.first>output.first）输出=输入；
}
//使用我们的配对和自定义缩减函数声明OpenMP缩减。
#pragma omp声明缩减（自定义缩减：\
std:：pair:\
减少自定义（omp_输出，omp_输入））\
初始值设定项（omp_priv（omp_orig））
布尔标识（常量浮点*数据、无符号整数长度、无符号整数和标签、浮点和相似性）{
std:：pair temp（0.0，标签）；//存储线程局部相似性和对应的最佳标签。
#pragma omp并行还原（自定义还原：温度）
对于（size_t i=0；i温度优先）{
第一温度=稠度相似性；
temp.second=集合[i]。秒；
}
}
如果（第一温度>0.f）{
相似性=温度优先；
标签=温度秒；
返回true；
}
返回false；
}

关于SIMD寄存器数量有限的问题，它们的数量取决于您使用的特定CPU。据我所知，每个内核都有一组可用的向量寄存器，因此只要您使用的向量寄存器数量不超过以前的数量，现在也应该可以了。此外，例如，AVX512为每个内核提供32个向量寄存器和2个用于向量运算的算术单元，因此，计算资源的耗尽并不是一件小事，因为内存局部性差（特别是在到处保存向量的情况下），您更容易受到影响。我当然可能是错的，如果是这样，请随时在评论中纠正我

任务是找到相似性的最高值和相应的标签，为此，我们确实可以使用

缩减

，但由于我们需要找到一个值的最大值，然后存储相应的标签，因此我们使用一对来同时存储这两个值，以便在OpenMP中将其作为

缩减

实现

我已经稍微重写了您的代码，使用变量的原始命名（

temp

）可能会使事情更难阅读。基本上，我们并行执行搜索，因此每个线程都会找到一个最佳值，然后我们要求OpenMP在线程之间找到最佳解决方案（

reduce

），我们就完成了

//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair. 
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
    if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function. 
#pragma omp declare reduction(custom_reduction : \
    std::pair<float, unsigned int>: \
    reduce_custom(omp_out, omp_in)) \
    initializer(omp_priv(omp_orig))

bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label. 

#pragma omp parallel for reduction(custom_reduction:temp)
    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data(); 
        float consinSimilarity = getSimilarity(data, candidateTemplate, length);

        if (consinSimilarity > temp.first) {
            temp.first = consinSimilarity;
            temp.second = collection[i].second;
        }
    }

    if (temp.first > 0.f) {
        similarity = temp.first;
        label = temp.second;
        return true;
    }

    return false;
}

//通过查找最大值并存储相应的标签来减少，这就是我们使用std:：pair的原因。
void reduce_custom（std:：pair&output，std:：pair&input）{
如果（input.first>output.first）输出=输入；
}
//使用我们的配对和自定义缩减函数声明OpenMP缩减。
#pragma omp声明缩减（自定义缩减：\
std:：pair:\
减少自定义（omp_输出，omp_输入））\
初始值设定项（omp_priv（omp_orig））
布尔标识（常数浮点*da