Python 并行化pytorch中的嵌套for循环

Python 并行化pytorch中的嵌套for循环,python,parallel-processing,pytorch,Python,Parallel Processing,Pytorch,我正在使用基于pytorch的代码。我的部分代码有4个嵌套for循环。基本上,它是寻找图像的补丁和估计两个补丁之间的相似性。因为参数是torch元素,所以像joblib这样的python库都不能工作。我是pycuda的新手,我很乐意寻求帮助来并行化这段代码。目前,这是一个耗时超过1.5秒的缓慢计算。这是我的部分代码 import torch import numpy as np import cv2 import time def mul_val(a,b,ax=None): retu

我正在使用基于pytorch的代码。我的部分代码有4个嵌套for循环。基本上,它是寻找图像的补丁和估计两个补丁之间的相似性。因为参数是torch元素,所以像joblib这样的python库都不能工作。我是pycuda的新手,我很乐意寻求帮助来并行化这段代码。目前,这是一个耗时超过1.5秒的缓慢计算。这是我的部分代码

import torch
import numpy as np
import cv2
import time 

def mul_val(a,b,ax=None):
    return torch.mean(((b-a)/(b+0.01))**2)

def np_to_torch(img_np):
    return torch.from_numpy(img_np)[None,None, :]


def main_fn(a,alpha = 0.25):
    start = time.time()
    final_u = torch.zeros(w,h)
    frame1 = a
    for y1 in range(w):
        i = y1*grid_size
        for x1 in range(h):
            
            j = x1*grid_size  ## batch, channel, width, height 
            block1 = frame1[:,:,i:i+grid_size, j:j+grid_size]
            corr_list = []
            for y2 in range(y1-radius,y1+radius+1):
                    i2 = y2*grid_size
                    if not (0 <= y2 < h):
                        continue
                    for x2 in range(x1-radius,x1+radius+1):
                        j2 = x2*grid_size
                        if not (0 <= x2 < w):
                            continue
                        
                        block2 = frame1[:,:,i2:i2+grid_size, j2:j2+grid_size]
                        if not (block1.shape == block2.shape):
                            continue
                        corr = mul_val(block1, block2)
                        corr_list.append(corr)
                        
            corr_list=sorted(corr_list, reverse = True)
            del corr_list[20:]
            uncorr = 1.0 - (sum(corr_list)/20.0)
            final_u[y1,x1] = torch.mul(a[:,:,i,j],uncorr)

    new_val = torch.norm(torch.sub(1.0,final_u))
    final_value = alpha*new_val
    print("Time taken is ..", time.time()-start)
    return final_value
 

grid_size = 9  
radius = 3  
im1 = cv2.imread("1.png", 0)
img = np.asarray(im1)
img = img.astype(np.float64) 
img_torch = np_to_torch(img)   ### made this as a torch element intentionally.
frame_width = img.shape[0]
frame_height = img.shape[1]
h = int(frame_height//grid_size)
w = int(frame_width//grid_size)
main_fn(img_torch)
导入火炬
将numpy作为np导入
进口cv2
导入时间
def mul_val(a、b、ax=无):
返回火炬。平均值(((b-a)/(b+0.01))**2)
def np到火炬(img np):
返回火炬。从\u numpy(img\u np)[无,无,:]
def main_fn(a,α=0.25):
开始=时间。时间()
最终_=火炬零点(w,h)
框架1=a
对于范围(w)内的y1:
i=y1*网格尺寸
对于范围(h)内的x1:
j=x1*网格尺寸##批次、通道、宽度、高度
block1=frame1[:,:,i:i+网格大小,j:j+网格大小]
corr_list=[]
对于范围内的y2(y1半径,y1+半径+1):
i2=y2*网格尺寸
若否(0),