使用metal swift并行计算数组值之和_Swift_Sum_Shader_Metal

使用metal swift并行计算数组值之和

swift

使用metal swift并行计算数组值之和,swift,sum,shader,metal,Swift,Sum,Shader,Metal,我试图用金属雨燕并行计算大数组的和有没有什么好办法我的平面是，我把数组分成子数组，并行计算一个子数组的和，然后当并行计算完成时，计算子数组和的和例如，如果我有 array = [a0,....an] 我将数组划分为子数组： array_1 = [a_0,...a_i], array_2 = [a_i+1,...a_2i], .... array_n/i = [a_n-1, ... a_n] 这个数组的和是并行计算的，我得到 sum_1, sum_2, sum_3, ... sum_n

我试图用金属雨燕并行计算大数组的和

有没有什么好办法

我的平面是，我把数组分成子数组，并行计算一个子数组的和，然后当并行计算完成时，计算子数组和的和

例如，如果我有

array = [a0,....an]

我将数组划分为子数组：

array_1 = [a_0,...a_i],
array_2 = [a_i+1,...a_2i],
....
array_n/i = [a_n-1, ... a_n]

这个数组的和是并行计算的，我得到

sum_1, sum_2, sum_3, ... sum_n/1

最后，只需计算子和的和

我创建的应用程序运行我的金属着色器，但有些事情我不太明白

        var array:[[Float]] = [[1,2,3], [4,5,6], [7,8,9]]

        // get device
        let device: MTLDevice! = MTLCreateSystemDefaultDevice()

        // get library
        let defaultLibrary:MTLLibrary! = device.newDefaultLibrary()

        // queue
        let commandQueue:MTLCommandQueue! = device.newCommandQueue()

        // function
        let kernerFunction: MTLFunction! = defaultLibrary.newFunctionWithName("calculateSum")

        // pipeline with function
        let pipelineState: MTLComputePipelineState! = try device.newComputePipelineStateWithFunction(kernerFunction)

        // buffer for function
        let commandBuffer:MTLCommandBuffer! = commandQueue.commandBuffer()

        // encode function
        let commandEncoder:MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder()

        // add function to encode
        commandEncoder.setComputePipelineState(pipelineState)

        // options
        let resourceOption = MTLResourceOptions()

        let arrayBiteLength = array.count * array[0].count * sizeofValue(array[0][0])

        let arrayBuffer = device.newBufferWithBytes(&array, length: arrayBiteLength, options: resourceOption)

        commandEncoder.setBuffer(arrayBuffer, offset: 0, atIndex: 0)

        var result:[Float] = [0,0,0]

        let resultBiteLenght = sizeofValue(result[0])

        let resultBuffer = device.newBufferWithBytes(&result, length: resultBiteLenght, options: resourceOption)

        commandEncoder.setBuffer(resultBuffer, offset: 0, atIndex: 1)

        let threadGroupSize = MTLSize(width: 1, height: 1, depth: 1)

        let threadGroups = MTLSize(width: (array.count), height: 1, depth: 1)

        commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)

        commandEncoder.endEncoding()

        commandBuffer.commit()

        commandBuffer.waitUntilCompleted()

        let data = NSData(bytesNoCopy: resultBuffer.contents(), length: sizeof(Float), freeWhenDone: false)

        data.getBytes(&result, length: result.count * sizeof(Float))

        print(result)

是我的Swift密码

我的着色器是：

kernel void calculateSum(const device float *inFloat [[buffer(0)]],
                     device float *result [[buffer(1)]],
                     uint id [[ thread_position_in_grid ]]) {


    float * f = inFloat[id];
    float sum = 0;
    for (int i = 0 ; i < 3 ; ++i) {
        sum = sum + f[i];
    }

    result = sum;
}

kernelvoid calculateSum（常量设备浮点*inFloat[[buffer（0）]，
设备浮动*结果[[缓冲区（1）]，
uint id[[螺纹位置在网格中]]）{
float*f=inFloat[id]；
浮点数和=0；
对于（int i=0；i<3；++i）{
总和=总和+f[i]；
}
结果=总和；
}

我不知道如何定义inFloat是数组的数组。我不知道什么是threadGroupSize和threadGroups。我不知道着色器属性中的设备和uint是什么

这是正确的方法吗？

我花时间创建了一个金属问题的完整工作示例。解释见评论：

let count=10_000_000
设elementsPerSum=10_000
//数据类型，必须与着色器中的相同
typealias数据类型=CInt
让设备=MTLCreateSystemDefaultDevice（）！
let library=self.library（设备：设备）
让parsum=library.makeFunction（名称：“parsum”）！
让我们试试！device.makeComputePipelineState（函数：parsum）
//我们的数据，随机生成：
var data=（0..threadgroup的数量为'ResultCount`/'threadExecutionWidth`（四舍五入），因为每个threadgroup将处理'threadExecutionWidth'线程
让threadgroupsPerGrid=MTLSize（宽度：（resultCount+pipeline.threadExecutionWidth-1）/pipeline.threadExecutionWidth，高度：1，深度：1）
//这里我们设置每个线程组都应该处理'threadExecutionWidth'线程，对性能来说唯一重要的是这个数字是'threadExecutionWidth'的倍数（这里是1倍）
让threadsPerThreadgroup=MTLSize（宽度：pipeline.threadExecutionWidth，高度：1，深度：1）
encoder.dispatchThreadgroups（threadgroupsPerGrid，threadsPerThreadgroup:threadsPerThreadgroup）
encoder.endEncoding（）
变量开始、结束：UInt64
变量结果：数据类型=0
起动=马赫绝对时间（）
cmds.commit（）
cmds.waitUntillCompleted（）
对于结果中的元素{
结果+=元素
}
结束=马赫绝对时间（）
打印（“金属结果：\（结果），时间：\（双精度（结束-开始）/双精度（纳秒秒）））
结果=0
起动=马赫绝对时间（）
data.withUnsafeBufferPointer{buffer in
对于缓冲区中的元素{
结果+=元素
}
}
结束=马赫绝对时间（）
打印（“CPU结果：\（结果），时间：\（双精度（结束-开始）/双精度（纳秒/秒）））

我用我的Mac电脑来测试它，但它在iOS上应该可以正常工作

输出：

Metal result: 494936505, time: 0.024611456
CPU result: 494936505, time: 0.163341018

金属版大约快7倍。我相信，如果你实现类似于分而治之的截取或其他功能，你可以获得更快的速度。

我一直在运行该应用程序。在gt 740（384核）和i7-4790上，使用多线程向量和实现，我的数字如下：

Metal lap time: 19.959092
cpu MT lap time: 4.353881

这是cpu的5/1比率，因此，除非你有一个强大的gpu使用着色器是不值得的

我在i7-3610qm w/igpu intel hd 4000上测试了相同的代码，令人惊讶的是，对于金属来说，结果要好得多：2/1

编辑：在调整线程参数后，我终于提高了gpu性能，现在它的性能高达16xcpu

公认的答案是令人恼火的缺少为它编写的内核。但下面是可以作为swift命令行应用程序运行的完整程序和着色器

/*
*用于数据处理的命令行金属计算着色器
*/
进口金属
进口基金会
//------------------------------------------------------------------------------
让计数=10_000_000
设elementsPerSum=10_000
//------------------------------------------------------------------------------
typealias DataType=CInt//数据类型必须与着色器中的相同
//------------------------------------------------------------------------------
让设备=MTLCreateSystemDefaultDevice（）！
让library=device.makeDefaultLibrary（）！
让parsum=library.makeFunction（名称：“parsum”）！
let pipeline=try！device.makeComputePipelineState（函数：parsum）
//------------------------------------------------------------------------------
//我们的数据，随机生成：
var data=（0..threadgroup的数量为'ResultCount`/'threadExecutionWidth`（四舍五入），因为每个threadgroup将处理'threadExecutionWidth'线程
让threadgroupsPerGrid=MTLSize（宽度：（resultCount+pipeline.threadExecutionWidth-1）/pipeline.threadExecutionWidth，高度：1，深度：1）
//这里我们设置每个线程组都应该处理'threadExecutionWidth'线程，对性能来说唯一重要的是这个数字是'threadExecutionWidth'的倍数（这里是1倍）
让threadsPerThreadgroup=MTLSize（宽度：pipeline.threadExecutionWidth，高度：1，深度：1）
//------------------------------------------------------------------------------
encoder.dispatchThreadgroups（threadgroupsPerGrid，threadsPerThreadgroup:threadsPerThreadgroup）
encoder.endEncoding（）
//------------------------------------------------------------------------------
变量开始、结束：UInt64
变量结果：数据类型=0
//------------------------------------------------------------------------------
起动=马赫绝对时间（）
cmds.commit（）
cmds.waitUntillCompleted（）
对于结果中的元素{
结果+=元素
}
结束=马赫绝对时间（）
//------------------------------------------------------------------------------
打印（“金属结果：\（结果），时间：\（双精度（结束-开始）/双精度（纳秒秒）））
//--
#include <metal_stdlib>
using namespace metal;

typedef unsigned int uint;
typedef int DataType;

kernel void parsum(const device DataType* data [[ buffer(0) ]],
                   const device uint& dataLength [[ buffer(1) ]],
                   device DataType* sums [[ buffer(2) ]],
                   const device uint& elementsPerSum [[ buffer(3) ]],

                   const uint tgPos [[ threadgroup_position_in_grid ]],
                   const uint tPerTg [[ threads_per_threadgroup ]],
                   const uint tPos [[ thread_position_in_threadgroup ]]) {

    uint resultIndex = tgPos * tPerTg + tPos;

    uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin
    uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end

    for (; dataIndex < endIndex; dataIndex++)
        sums[resultIndex] += data[dataIndex];
}