Python 为什么运行时不使用触发器进行缩放-逐点乘法与2D卷积背景_Python_Pytorch

Python 为什么运行时不使用触发器进行缩放-逐点乘法与2D卷积背景

python pytorch

Python 为什么运行时不使用触发器进行缩放-逐点乘法与2D卷积背景,python,pytorch,Python,Pytorch,基于傅里叶变换的卷积定理，空间域中的卷积等价于傅里叶域中的逐点乘法（反之亦然）。我实现了一个torch.nn.Conv2d，它通过在PyTorch中执行逐点乘法而不是卷积（使用转换为输入大小的内核）在傅里叶域中“运行”（如这里所述：）期望和结果我发现它的性能不好，类似于：经过多次基准测试后，逐点乘法似乎是操作的主要瓶颈。在基准测试期间，我排除了FFT过程，以隔离层的操作（并使用适当大小的保存内核）这是令人困惑的，因为在考虑2D卷积（步长=1）和元素乘法所需的触发器数量时： Conv2d

基于傅里叶变换的卷积定理，空间域中的卷积等价于傅里叶域中的逐点乘法（反之亦然）。我实现了一个

torch.nn.Conv2d

，它通过在PyTorch中执行逐点乘法而不是卷积（使用转换为输入大小的内核）在傅里叶域中“运行”（如这里所述：）

期望和结果 我发现它的性能不好，类似于：

经过多次基准测试后，逐点乘法似乎是操作的主要瓶颈。在基准测试期间，我排除了FFT过程，以隔离层的操作（并使用适当大小的保存内核）

这是令人困惑的，因为在考虑2D卷积（步长=1）和元素乘法所需的触发器数量时：

Conv2d触发器：

Kernel\u H*Kernel\u W*C\u in*C\u out*H*W

点式触发器：
```
C_输入*C_输出*H*W
```

例如，给定的

H=32，W=60，C_-in=64，C_-out=256

：

Conv2d触发器（k=16）：
```
16*16*32*60*64*256=8053 MFLOPs
```
逐点（浮点）：
```
64*256*32*60=31.46 MFLOPs
```

考虑到浮点运算的巨大差异，我预计2D卷积运算的运行时间要长得多（读过的GPU对于点积运算优化得很好）

我创建了一个简单的脚本，将

torch.Tensor

的点式乘法与

torch.nn.Conv2d

进行基准测试，因为与2D卷积相比，元素式乘法的性能似乎相当，甚至更慢

下面是CPU和GPU上的两个这样的基准测试结果的概述（i9900k，带

火炬。设置线程数（1）

）

结果-CPU（i9900k）

结果-GPU（RTX Titan）

如果我更改

或

或频道，结果似乎没有显著变化。但对于较小的内核，逐点显示的速度要慢得多

问题有人能告诉我为什么当触发器至少大2个数量级时，逐点乘法显得如此缓慢，或者在我的思维或代码中可能存在错误吗

基准实施

(# Kernel Size = 16)

Benchmark Overview (device = cuda:1):
    Number of test iterations: 1000
    Number of warm-up iterations: 5
    Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
    Conv2d(in_ch=256, out_ch=64, kernel_size=16): Conv2d([1, 256, 32, 60)
    FLOP Estimation:
        Conv2d:      8053.06368 MFlops
        Pointwise:   31.45728 MFlops

Benchmark Results (device = cuda:1)
    Pointwise:   0.698 +/- 0.031 ms
    Conv2d:      2.916 +/- 0.161 ms

------------------------------------

(# Kernel size = 3)

Benchmark Overview (device = cuda:1):
    Number of test iterations: 100
    Number of warm-up iterations: 5
    Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
    Conv2d(in_ch=256, out_ch=64, kernel_size=3): Conv2d([1, 256, 32, 60)
    FLOP Estimation:
        Conv2d:      283.11552 MFlops
        Pointwise:   31.45728 MFlops
        FreqConv:    62.91456 MFlops

Benchmark Results (device = cuda:1)
    Pointwise:   0.681 +/- 0.011 ms
    Conv2d:      0.126 +/- 0.034 ms

import torch
import numpy as np
from torch import nn
from time import time

torch.set_num_threads(1)

in_ch = 256
out_ch = 64
height = 32
width = 60
kernel_size = 16

warmup = 5
iters = 100

flops_pointwise = (out_ch * in_ch * height * width)
m_flops_conv = (flops_pointwise * kernel_size ** 2) / 1e6
m_flops_pw = (flops_pointwise) / 1e6

# Device to run benchmark on, e.g. 'cpu' or 'cuda:X'
device = 'cpu'

print(f'Benchmark Overview (device = {device}):')
print(f'\tNumber of test iterations: {iters}')
print(f'\tNumber of warm-up iterations: {warmup}')
print(f'\tPointwise: [1, {in_ch}, {height}, {width}] * [{out_ch}, {in_ch}, {height}, {width}]') 
print(f'\tConv2d(in_ch={in_ch}, out_ch={out_ch}, kernel_size={kernel_size}): Conv2d([1, {in_ch}, {height}, {width})')

print('\tFLOP Estimation:')
print(f'\t\tConv2d:\t\t {m_flops_conv} MFlops')
print(f'\t\tPointwise:\t {m_flops_pw} MFlops')
print(f'\t\tFreqConv:\t {m_flops_freq_conv} MFlops')

print()

def benchmark(input_gen, operation, warmup=5, iters=1000):
    duration = []
    for i in range(iters + warmup):

        input = input_gen()

        start = time() # start timer
        with torch.no_grad():
            operation(input)

        # Sync if using cuda
        if device[:4] == 'cuda':
            torch.cuda.synchronize(device)
        end = time() # end timer

        if i < warmup:
            continue

        duration.append((end - start) * 1e3) # ms

    return np.array(duration)


def pointwise(input):
    x, y = input
    x * y

# Helper methods to generate new data
# for every iteration inside of the benchmark method

def _gen_pw_input(in_ch, out_ch, height, width):
    x = torch.rand(1, in_ch, height, width).to(device)
    k = torch.randn(out_ch, in_ch, height, width).to(device)
    return x, k

gen_pw_input = lambda : _gen_pw_input(in_ch, out_ch, height, width)

def _gen_conv_input(in_ch, out_ch, height, width):
    x = torch.rand(1, in_ch, height, width).to(device)
    return x

gen_conv_input = lambda : _gen_conv_input(in_ch, out_ch, height, width)



conv2d = nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size).to(device)

pw_res = benchmark(gen_pw_input, pointwise, warmup=warmup, iters=iters)
conv_res = benchmark(gen_conv_input, conv2d, warmup=warmup, iters=iters)

print(f'Benchmark Results (device = {device})')
print('\tPointwise:\t {:.3f} +/- {:.3f} ms'.format(pw_res.mean(), pw_res.std()))
print('\tConv2d:\t\t {:.3f} +/- {:.3f} ms'.format(conv_res.mean(), conv_res.std()))

导入火炬
将numpy作为np导入
从火炬进口
从时间导入时间
焊炬。设置螺纹数（1）
in_ch=256
out_ch=64
高度=32
宽度=60
内核大小=16
热身=5
iters=100
浮点=（出/入/高/宽）
m_flops_conv=（浮点*内核大小**2）/1e6
m_-flops_-pw=（逐点浮点）/1e6
#运行基准测试的设备，例如“cpu”或“cuda:X”
设备='cpu'
打印（f'benchmarkoverview（device={device}）：'）
打印（f'\t测试迭代次数：{iters}'）
打印（f'\t预热迭代次数：{warmup}'）
打印（f'\t逐点：[1，{in_ch}，{height}，{width}]*[{out_ch}，{in_ch}，{height}，{width}]）
打印（f'\tConv2d（in_ch={in_ch}，out_ch={out_ch}，kernel_size={kernel_size}）：Conv2d（[1，{in_ch}，{height}，{width}）
打印（'\t吞吐量估算：'）
打印（f'\t\tConv2d:\t\t{m\u flops\u conv}MFlops'）
打印（f'\t\t按点：\t{m\u flops\u pw}MFlops'）
打印（f'\t\tFreqConv:\t{m_flops\u freq_conv}MFlops'）
打印（）
def基准（输入、运行、预热=5、iters=1000）：
持续时间=[]
对于范围内的i（iters+预热）：
输入=输入_gen（）
开始=时间（）#开始计时器
使用手电筒。无梯度（）
操作（输入）
#如果使用cuda进行同步
如果设备[：4]=“cuda”：
torch.cuda.synchronize（设备）
结束=时间（）#结束计时器
如果我<热身：
持续
持续时间。追加（（结束-开始）*1e3）#毫秒
返回np.array（持续时间）
def逐点（输入）：
x、 y=输入
x*y
#生成新数据的助手方法
#对于基准方法内部的每个迭代
def_gen_pw_输入（输入、输出、高度、宽度）：
x=火炬。兰特（1，英寸，高度，宽度）。至（装置）
k=火炬。randn（向外、向内、高度、宽度）。至（设备）
返回x，k
gen_pw_输入=λ：_gen_pw_输入（进、出、高、宽）
def-gen-conv输入（输入、输出、高度、宽度）：
x=火炬。兰特（1，英寸，高度，宽度）。至（装置）
返回x
gen_conv_input=lambda:_gen_conv_input（进、出、高、宽）
conv2d=nn.conv2d（输入、输出、内核大小=内核大小）。到（设备）
pw_res=基准（gen_pw_输入，逐点，预热=预热，iters=iters）
conv_res=基准（gen_conv_输入，conv2d，预热=预热，iters=iters）
打印（f'benchmarkresults（device={device}'））
打印（'\t按点：\t{.3f}+/-{.3f}ms'。格式（pw_res.mean（），pw_res.std（））
打印（'\tConv2d:\t\t{.3f}+/-{.3f}ms'。格式（conv_res.mean（），conv_res.std（））

Eigen

(# Kernel Size = 16)

Benchmark Overview (device = cuda:1):
    Number of test iterations: 1000
    Number of warm-up iterations: 5
    Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
    Conv2d(in_ch=256, out_ch=64, kernel_size=16): Conv2d([1, 256, 32, 60)
    FLOP Estimation:
        Conv2d:      8053.06368 MFlops
        Pointwise:   31.45728 MFlops

Benchmark Results (device = cuda:1)
    Pointwise:   0.698 +/- 0.031 ms
    Conv2d:      2.916 +/- 0.161 ms

------------------------------------

(# Kernel size = 3)

Benchmark Overview (device = cuda:1):
    Number of test iterations: 100
    Number of warm-up iterations: 5
    Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
    Conv2d(in_ch=256, out_ch=64, kernel_size=3): Conv2d([1, 256, 32, 60)
    FLOP Estimation:
        Conv2d:      283.11552 MFlops
        Pointwise:   31.45728 MFlops
        FreqConv:    62.91456 MFlops

Benchmark Results (device = cuda:1)
    Pointwise:   0.681 +/- 0.011 ms
    Conv2d:      0.126 +/- 0.034 ms

import torch
import numpy as np
from torch import nn
from time import time

torch.set_num_threads(1)

in_ch = 256
out_ch = 64
height = 32
width = 60
kernel_size = 16

warmup = 5
iters = 100

flops_pointwise = (out_ch * in_ch * height * width)
m_flops_conv = (flops_pointwise * kernel_size ** 2) / 1e6
m_flops_pw = (flops_pointwise) / 1e6

# Device to run benchmark on, e.g. 'cpu' or 'cuda:X'
device = 'cpu'

print(f'Benchmark Overview (device = {device}):')
print(f'\tNumber of test iterations: {iters}')
print(f'\tNumber of warm-up iterations: {warmup}')
print(f'\tPointwise: [1, {in_ch}, {height}, {width}] * [{out_ch}, {in_ch}, {height}, {width}]') 
print(f'\tConv2d(in_ch={in_ch}, out_ch={out_ch}, kernel_size={kernel_size}): Conv2d([1, {in_ch}, {height}, {width})')

print('\tFLOP Estimation:')
print(f'\t\tConv2d:\t\t {m_flops_conv} MFlops')
print(f'\t\tPointwise:\t {m_flops_pw} MFlops')
print(f'\t\tFreqConv:\t {m_flops_freq_conv} MFlops')

print()

def benchmark(input_gen, operation, warmup=5, iters=1000):
    duration = []
    for i in range(iters + warmup):

        input = input_gen()

        start = time() # start timer
        with torch.no_grad():
            operation(input)

        # Sync if using cuda
        if device[:4] == 'cuda':
            torch.cuda.synchronize(device)
        end = time() # end timer

        if i < warmup:
            continue

        duration.append((end - start) * 1e3) # ms

    return np.array(duration)


def pointwise(input):
    x, y = input
    x * y

# Helper methods to generate new data
# for every iteration inside of the benchmark method

def _gen_pw_input(in_ch, out_ch, height, width):
    x = torch.rand(1, in_ch, height, width).to(device)
    k = torch.randn(out_ch, in_ch, height, width).to(device)
    return x, k

gen_pw_input = lambda : _gen_pw_input(in_ch, out_ch, height, width)

def _gen_conv_input(in_ch, out_ch, height, width):
    x = torch.rand(1, in_ch, height, width).to(device)
    return x

gen_conv_input = lambda : _gen_conv_input(in_ch, out_ch, height, width)



conv2d = nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size).to(device)

pw_res = benchmark(gen_pw_input, pointwise, warmup=warmup, iters=iters)
conv_res = benchmark(gen_conv_input, conv2d, warmup=warmup, iters=iters)

print(f'Benchmark Results (device = {device})')
print('\tPointwise:\t {:.3f} +/- {:.3f} ms'.format(pw_res.mean(), pw_res.std()))
print('\tConv2d:\t\t {:.3f} +/- {:.3f} ms'.format(conv_res.mean(), conv_res.std()))

我还在Eigen（C++）中实现了一个基本的基准测试，以比较元素级乘法，它与Pytork中观察到的结果类似（稍微慢一点）；Pytork使用的后端BLAS看起来是优化的