Machine learning Pytorch：计算for循环在GPU和CPU上的运行时间_Machine Learning_Deep Learning_Neural Network_Pytorch

Machine learning Pytorch：计算for循环在GPU和CPU上的运行时间

machine-learning deep-learning neural-network pytorch

Machine learning Pytorch：计算for循环在GPU和CPU上的运行时间,machine-learning,deep-learning,neural-network,pytorch,Machine Learning,Deep Learning,Neural Network,Pytorch,我对pytorch真的很陌生。当我试图弄明白为什么我的nn在GPU上运行得比CPU慢时，我一整天都很困惑。我不明白当我使用time.time（）计算运行时间时，整个循环的时间与每个运行时间的总和相差很大。这是我的部分代码。有人能帮我吗？谢谢你 time_out = 0 time_in = 0 for epoch in tqdm(range(self.n_epoch)): running_loss = 0 running_error =

我对pytorch真的很陌生。当我试图弄明白为什么我的nn在GPU上运行得比CPU慢时，我一整天都很困惑。我不明白当我使用time.time（）计算运行时间时，整个循环的时间与每个运行时间的总和相差很大。这是我的部分代码。有人能帮我吗？谢谢你

    time_out = 0
    time_in = 0

    for epoch in tqdm(range(self.n_epoch)):

        running_loss = 0
        running_error = 0
        running_acc = 0

        if self.cuda:
            torch.cuda.synchronize()                #time_out_start
        epst1 = time.time()


        for step, (batch_x, batch_y) in enumerate(self.normal_loader):

            if self.cuda:
                torch.cuda.synchronize()                        #time_in_start
            t1 = time.time()

            batch_x, batch_y = batch_x.to(self.device), batch_y.to(self.device)

            b_x = Variable(batch_x)
            b_y = Variable(batch_y)
            
            pred_y = self.model(b_x)
            #print (pred_y)
            
            loss = self.criterion(pred_y, b_y)

            error = mae(pred_y.detach().cpu().numpy(),b_y.detach().cpu().numpy())
            acc = r2(b_y.detach().cpu().numpy(),pred_y.detach().cpu().numpy())

            #print (loss)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            running_acc += acc
            running_loss += loss.item()
            running_error += error

            if self.cuda:
                torch.cuda.synchronize()                        #time_in_end
            t6 = time.time()

            time_in += t6-t1

        if self.cuda:
            torch.cuda.synchronize()                    #time_out_end         
        eped1 = time.time()

        time_out += eped1-epst1

    print ('loop time(out)',time_out)
    print ('loop time(in)',time_in)

结果是：
CPU:
第10纪元：输出：1.283s输入：0.695s
第50纪元：输出：6.43s输入：3.288s
纪元100:out:12.646s in:6.386s
GPU:
第10纪元：输出：3.92s输入：1.471s
第50纪元：输出：9.35秒输入：3.04秒
第100纪元：向外：18.418s向内：5.655

我知道将数据从cpu传输到gpu需要一些时间。因此，随着时代的发展，GPU的计算时间应该小于CPU时间。我的问题是:

为什么我在循环外记录的时间与在循环内记录的时间如此不同？记录跑步时间时是否有遗漏的步骤

为什么GPU的外部时间比CPU的时间要多，即使内部时间也比CPU的时间少

网络非常简单，即：

class Model(nn.Module):
def __init__(self,n_input,n_nodes1,n_nodes2):
    super(Model, self).__init__()

    self.n_input = n_input
    self.n_nodes1 = n_nodes1
    self.n_nodes2 = n_nodes2

    self.l1 = nn.Linear(self.n_input, self.n_nodes1)
    self.l2 = nn.Linear(self.n_nodes1, self.n_nodes2)
    self.l3 = nn.Linear(self.n_nodes2, 1)

def forward(self,x):

    h1 = F.relu(self.l1(x))
    h2 = F.relu(self.l2(h1))
    h = self.l3(h2)

    return h

训练数据的形式如下：（回归问题，输入x为描述符，y为目标值）

为什么我在循环外记录的时间与在循环内记录的时间如此不同？记录跑步时间时是否有遗漏的步骤

self.normal\u loader

不仅仅是一个普通的字典、向量或像那样简单的东西。迭代它需要大量的时间

为什么GPU的外部时间比CPU的时间要多，即使内部时间也比CPU的时间少

torch.cuda.synchronize（）

是一项繁重的操作。即使它没有做任何有用的事情，比如在本例中，

pred_y.detach（）.cpu（）

已经强制执行了同步

至于如何让他们更快？停止

synchronize（）

调用，它们对您没有任何好处

然后将

pred_y

的处理推迟到以后。很久以后。您希望在触发第一次下载结果之前至少调用模型2或3次。模型越简单，数据越小，需要等待的迭代次数就越多

因为与GPU之间的传输不只是“需要时间”，它们意味着同步。没有同步，GPU上的执行模型大多“滞后”，数据上传到GPU在幕后已经是异步的，而实际执行只在后面排队。如果您没有意外地或明确地同步，工作负载开始重叠，那么东西（上传、执行、CPU工作）开始并行运行。您的有效执行时间接近最大值（上传、下载、GPU执行、CPU执行）

如果同步，则不会有任务重叠，也不会从相同类型的任务中形成批。上传，执行，下载，CPU部分，这一切都是按顺序进行的。您的执行时间结束于

上传+下载+GPU执行+CPU执行

。一些额外的开销用于在顶部的驱动程序级别中断批处理。所以很容易比实际速度慢5-10倍。

最好给我们一个完整的例子来重现这个问题。运行时间有时可能取决于数据的大小，而这并没有反映在代码中。您好，Ext3h，感谢您的精彩解释。我已经弄明白了。谢谢！

def load_train_normal(self,x,y,batch_size = 100):       
    if batch_size:
        self.batch_size = batch_size

    
    self.x_train_n, self.y_train_n = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
    
    #x, y = Variable(torch.from_numpy(x).float()), Variable(torch.from_numpy(y).float())
    self.dataset = Data.TensorDataset(self.x_train_n,self.y_train_n)
    self.normal_loader = Data.DataLoader(
                        dataset = self.dataset,
                        batch_size = self.batch_size,
                        shuffle = True, num_workers=2,)