Python 为什么运行时不使用触发器进行缩放-逐点乘法与2D卷积 背景
基于傅里叶变换的卷积定理,空间域中的卷积等价于傅里叶域中的逐点乘法(反之亦然)。我实现了一个Python 为什么运行时不使用触发器进行缩放-逐点乘法与2D卷积 背景,python,pytorch,Python,Pytorch,基于傅里叶变换的卷积定理,空间域中的卷积等价于傅里叶域中的逐点乘法(反之亦然)。我实现了一个torch.nn.Conv2d,它通过在PyTorch中执行逐点乘法而不是卷积(使用转换为输入大小的内核)在傅里叶域中“运行”(如这里所述:) 期望和结果 我发现它的性能不好,类似于: 经过多次基准测试后,逐点乘法似乎是操作的主要瓶颈。在基准测试期间,我排除了FFT过程,以隔离层的操作(并使用适当大小的保存内核) 这是令人困惑的,因为在考虑2D卷积(步长=1)和元素乘法所需的触发器数量时: Conv2d
torch.nn.Conv2d
,它通过在PyTorch中执行逐点乘法而不是卷积(使用转换为输入大小的内核)在傅里叶域中“运行”(如这里所述:)
期望和结果
我发现它的性能不好,类似于:
经过多次基准测试后,逐点乘法似乎是操作的主要瓶颈。在基准测试期间,我排除了FFT过程,以隔离层的操作(并使用适当大小的保存内核)
这是令人困惑的,因为在考虑2D卷积(步长=1)和元素乘法所需的触发器数量时:
- Conv2d触发器:
Kernel\u H*Kernel\u W*C\u in*C\u out*H*W
- 点式触发器:
C_输入*C_输出*H*W
H=32,W=60,C_-in=64,C_-out=256
:
- Conv2d触发器(k=16):
16*16*32*60*64*256=8053 MFLOPs
- 逐点(浮点):
64*256*32*60=31.46 MFLOPs
torch.Tensor
的点式乘法与torch.nn.Conv2d
进行基准测试,因为与2D卷积相比,元素式乘法的性能似乎相当,甚至更慢
下面是CPU和GPU上的两个这样的基准测试结果的概述(i9900k,带火炬。设置线程数(1)
)
结果-CPU(i9900k)
结果-GPU(RTX Titan)
如果我更改H
或W
或频道,结果似乎没有显著变化。但对于较小的内核,逐点显示的速度要慢得多
问题
有人能告诉我为什么当触发器至少大2个数量级时,逐点乘法显得如此缓慢,或者在我的思维或代码中可能存在错误吗
基准实施
(# Kernel Size = 16)
Benchmark Overview (device = cuda:1):
Number of test iterations: 1000
Number of warm-up iterations: 5
Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
Conv2d(in_ch=256, out_ch=64, kernel_size=16): Conv2d([1, 256, 32, 60)
FLOP Estimation:
Conv2d: 8053.06368 MFlops
Pointwise: 31.45728 MFlops
Benchmark Results (device = cuda:1)
Pointwise: 0.698 +/- 0.031 ms
Conv2d: 2.916 +/- 0.161 ms
------------------------------------
(# Kernel size = 3)
Benchmark Overview (device = cuda:1):
Number of test iterations: 100
Number of warm-up iterations: 5
Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
Conv2d(in_ch=256, out_ch=64, kernel_size=3): Conv2d([1, 256, 32, 60)
FLOP Estimation:
Conv2d: 283.11552 MFlops
Pointwise: 31.45728 MFlops
FreqConv: 62.91456 MFlops
Benchmark Results (device = cuda:1)
Pointwise: 0.681 +/- 0.011 ms
Conv2d: 0.126 +/- 0.034 ms
import torch
import numpy as np
from torch import nn
from time import time
torch.set_num_threads(1)
in_ch = 256
out_ch = 64
height = 32
width = 60
kernel_size = 16
warmup = 5
iters = 100
flops_pointwise = (out_ch * in_ch * height * width)
m_flops_conv = (flops_pointwise * kernel_size ** 2) / 1e6
m_flops_pw = (flops_pointwise) / 1e6
# Device to run benchmark on, e.g. 'cpu' or 'cuda:X'
device = 'cpu'
print(f'Benchmark Overview (device = {device}):')
print(f'\tNumber of test iterations: {iters}')
print(f'\tNumber of warm-up iterations: {warmup}')
print(f'\tPointwise: [1, {in_ch}, {height}, {width}] * [{out_ch}, {in_ch}, {height}, {width}]')
print(f'\tConv2d(in_ch={in_ch}, out_ch={out_ch}, kernel_size={kernel_size}): Conv2d([1, {in_ch}, {height}, {width})')
print('\tFLOP Estimation:')
print(f'\t\tConv2d:\t\t {m_flops_conv} MFlops')
print(f'\t\tPointwise:\t {m_flops_pw} MFlops')
print(f'\t\tFreqConv:\t {m_flops_freq_conv} MFlops')
print()
def benchmark(input_gen, operation, warmup=5, iters=1000):
duration = []
for i in range(iters + warmup):
input = input_gen()
start = time() # start timer
with torch.no_grad():
operation(input)
# Sync if using cuda
if device[:4] == 'cuda':
torch.cuda.synchronize(device)
end = time() # end timer
if i < warmup:
continue
duration.append((end - start) * 1e3) # ms
return np.array(duration)
def pointwise(input):
x, y = input
x * y
# Helper methods to generate new data
# for every iteration inside of the benchmark method
def _gen_pw_input(in_ch, out_ch, height, width):
x = torch.rand(1, in_ch, height, width).to(device)
k = torch.randn(out_ch, in_ch, height, width).to(device)
return x, k
gen_pw_input = lambda : _gen_pw_input(in_ch, out_ch, height, width)
def _gen_conv_input(in_ch, out_ch, height, width):
x = torch.rand(1, in_ch, height, width).to(device)
return x
gen_conv_input = lambda : _gen_conv_input(in_ch, out_ch, height, width)
conv2d = nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size).to(device)
pw_res = benchmark(gen_pw_input, pointwise, warmup=warmup, iters=iters)
conv_res = benchmark(gen_conv_input, conv2d, warmup=warmup, iters=iters)
print(f'Benchmark Results (device = {device})')
print('\tPointwise:\t {:.3f} +/- {:.3f} ms'.format(pw_res.mean(), pw_res.std()))
print('\tConv2d:\t\t {:.3f} +/- {:.3f} ms'.format(conv_res.mean(), conv_res.std()))
导入火炬
将numpy作为np导入
从火炬进口
从时间导入时间
焊炬。设置螺纹数(1)
in_ch=256
out_ch=64
高度=32
宽度=60
内核大小=16
热身=5
iters=100
浮点=(出/入/高/宽)
m_flops_conv=(浮点*内核大小**2)/1e6
m_-flops_-pw=(逐点浮点)/1e6
#运行基准测试的设备,例如“cpu”或“cuda:X”
设备='cpu'
打印(f'benchmarkoverview(device={device}):')
打印(f'\t测试迭代次数:{iters}')
打印(f'\t预热迭代次数:{warmup}')
打印(f'\t逐点:[1,{in_ch},{height},{width}]*[{out_ch},{in_ch},{height},{width}])
打印(f'\tConv2d(in_ch={in_ch},out_ch={out_ch},kernel_size={kernel_size}):Conv2d([1,{in_ch},{height},{width})
打印('\t吞吐量估算:')
打印(f'\t\tConv2d:\t\t{m\u flops\u conv}MFlops')
打印(f'\t\t按点:\t{m\u flops\u pw}MFlops')
打印(f'\t\tFreqConv:\t{m_flops\u freq_conv}MFlops')
打印()
def基准(输入、运行、预热=5、iters=1000):
持续时间=[]
对于范围内的i(iters+预热):
输入=输入_gen()
开始=时间()#开始计时器
使用手电筒。无梯度()
操作(输入)
#如果使用cuda进行同步
如果设备[:4]=“cuda”:
torch.cuda.synchronize(设备)
结束=时间()#结束计时器
如果我<热身:
持续
持续时间。追加((结束-开始)*1e3)#毫秒
返回np.array(持续时间)
def逐点(输入):
x、 y=输入
x*y
#生成新数据的助手方法
#对于基准方法内部的每个迭代
def_gen_pw_输入(输入、输出、高度、宽度):
x=火炬。兰特(1,英寸,高度,宽度)。至(装置)
k=火炬。randn(向外、向内、高度、宽度)。至(设备)
返回x,k
gen_pw_输入=λ:_gen_pw_输入(进、出、高、宽)
def-gen-conv输入(输入、输出、高度、宽度):
x=火炬。兰特(1,英寸,高度,宽度)。至(装置)
返回x
gen_conv_input=lambda:_gen_conv_input(进、出、高、宽)
conv2d=nn.conv2d(输入、输出、内核大小=内核大小)。到(设备)
pw_res=基准(gen_pw_输入,逐点,预热=预热,iters=iters)
conv_res=基准(gen_conv_输入,conv2d,预热=预热,iters=iters)
打印(f'benchmarkresults(device={device}'))
打印('\t按点:\t{.3f}+/-{.3f}ms'。格式(pw_res.mean(),pw_res.std())
打印('\tConv2d:\t\t{.3f}+/-{.3f}ms'。格式(conv_res.mean(),conv_res.std())
Eigen
(# Kernel Size = 16)
Benchmark Overview (device = cuda:1):
Number of test iterations: 1000
Number of warm-up iterations: 5
Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
Conv2d(in_ch=256, out_ch=64, kernel_size=16): Conv2d([1, 256, 32, 60)
FLOP Estimation:
Conv2d: 8053.06368 MFlops
Pointwise: 31.45728 MFlops
Benchmark Results (device = cuda:1)
Pointwise: 0.698 +/- 0.031 ms
Conv2d: 2.916 +/- 0.161 ms
------------------------------------
(# Kernel size = 3)
Benchmark Overview (device = cuda:1):
Number of test iterations: 100
Number of warm-up iterations: 5
Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
Conv2d(in_ch=256, out_ch=64, kernel_size=3): Conv2d([1, 256, 32, 60)
FLOP Estimation:
Conv2d: 283.11552 MFlops
Pointwise: 31.45728 MFlops
FreqConv: 62.91456 MFlops
Benchmark Results (device = cuda:1)
Pointwise: 0.681 +/- 0.011 ms
Conv2d: 0.126 +/- 0.034 ms
import torch
import numpy as np
from torch import nn
from time import time
torch.set_num_threads(1)
in_ch = 256
out_ch = 64
height = 32
width = 60
kernel_size = 16
warmup = 5
iters = 100
flops_pointwise = (out_ch * in_ch * height * width)
m_flops_conv = (flops_pointwise * kernel_size ** 2) / 1e6
m_flops_pw = (flops_pointwise) / 1e6
# Device to run benchmark on, e.g. 'cpu' or 'cuda:X'
device = 'cpu'
print(f'Benchmark Overview (device = {device}):')
print(f'\tNumber of test iterations: {iters}')
print(f'\tNumber of warm-up iterations: {warmup}')
print(f'\tPointwise: [1, {in_ch}, {height}, {width}] * [{out_ch}, {in_ch}, {height}, {width}]')
print(f'\tConv2d(in_ch={in_ch}, out_ch={out_ch}, kernel_size={kernel_size}): Conv2d([1, {in_ch}, {height}, {width})')
print('\tFLOP Estimation:')
print(f'\t\tConv2d:\t\t {m_flops_conv} MFlops')
print(f'\t\tPointwise:\t {m_flops_pw} MFlops')
print(f'\t\tFreqConv:\t {m_flops_freq_conv} MFlops')
print()
def benchmark(input_gen, operation, warmup=5, iters=1000):
duration = []
for i in range(iters + warmup):
input = input_gen()
start = time() # start timer
with torch.no_grad():
operation(input)
# Sync if using cuda
if device[:4] == 'cuda':
torch.cuda.synchronize(device)
end = time() # end timer
if i < warmup:
continue
duration.append((end - start) * 1e3) # ms
return np.array(duration)
def pointwise(input):
x, y = input
x * y
# Helper methods to generate new data
# for every iteration inside of the benchmark method
def _gen_pw_input(in_ch, out_ch, height, width):
x = torch.rand(1, in_ch, height, width).to(device)
k = torch.randn(out_ch, in_ch, height, width).to(device)
return x, k
gen_pw_input = lambda : _gen_pw_input(in_ch, out_ch, height, width)
def _gen_conv_input(in_ch, out_ch, height, width):
x = torch.rand(1, in_ch, height, width).to(device)
return x
gen_conv_input = lambda : _gen_conv_input(in_ch, out_ch, height, width)
conv2d = nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size).to(device)
pw_res = benchmark(gen_pw_input, pointwise, warmup=warmup, iters=iters)
conv_res = benchmark(gen_conv_input, conv2d, warmup=warmup, iters=iters)
print(f'Benchmark Results (device = {device})')
print('\tPointwise:\t {:.3f} +/- {:.3f} ms'.format(pw_res.mean(), pw_res.std()))
print('\tConv2d:\t\t {:.3f} +/- {:.3f} ms'.format(conv_res.mean(), conv_res.std()))
我还在Eigen(C++)中实现了一个基本的基准测试,以比较元素级乘法,它与Pytork中观察到的结果类似(稍微慢一点);Pytork使用的后端BLAS看起来是优化的