Python Pytork';DataParallel不会分割数据内存,而是复制数据内存

Python Pytork';DataParallel不会分割数据内存,而是复制数据内存,python,pytorch,lstm,Python,Pytorch,Lstm,我正在PyTorch上运行一个批处理LSTM模型(批大小为32),每个GPU具有8xV100 GPU和16GB内存 我有一个具有以下结构的LSTM模型: # Class containing the LSTM model initialization and feed-forward logic class LSTMClassifier(nn.Module): # LSTM initialization def __init__(self, embedding_dim, hidd

我正在PyTorch上运行一个批处理LSTM模型(批大小为32),每个GPU具有8xV100 GPU和16GB内存

我有一个具有以下结构的LSTM模型:

# Class containing the LSTM model initialization and feed-forward logic
class LSTMClassifier(nn.Module):
    # LSTM initialization
    def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size, static_size):
        super(LSTMClassifier, self).__init__()

        # Setting the hidden layer dimension of the LSTM
        self.hidden_dim = hidden_dim
        # Initializing the embedding layer
        self.embeddings = nn.Embedding(vocab_size, embedding_dim-2)
        # Initializing the LSTM layer with one hidden layer 
        self.lstm = nn.LSTM(((embedding_dim*vocab_size)+static_size), hidden_dim, num_layers = 1, batch_first=True)
        # Initializing linear linear that takes the hidden layer output
        self.hidden2label = nn.Linear(hidden_dim, label_size)


    # Defining the hidden state of the LSTM
    def init_hidden(self, batch_size):
        # the first is the hidden h
        # the second is the cell  c
        return [autograd.Variable(torch.zeros(batch_size, 1, self.hidden_dim).cuda()),
                autograd.Variable(torch.zeros(batch_size, 1, self.hidden_dim).cuda())]

    # Defining the feed forward logic of the LSTM. It contains:
    # 1. The embedding layer
    # 2. The LSTM layer with one hidden layer
    # 3. The softmax layer
    def forward(self, seq, freq, time, static):
        print(seq.size())

        # Grab the mini-batch length and max sequence length (pre-ordered)
        # (need to do this in the forward logic because of data parallelism and how the GPU's will split up the batch)
        sequence_length = seq.size()[1]
        batch_length = seq.size()[0]

        # reset the LSTM hidden state. 
        # Must be done before you run a new batch. Otherwise the LSTM will treat a new batch as a continuation of a sequence
        self.hidden = self.init_hidden(batch_length)

        # Permute the cell and hidden layers. This is because when using Batch_first = True on data parallel,
        # the hidden state will still expect an input of (nLayer, batch size, hidden dim), but we are feeding it (batch size, nLayer, hidden dim)
        # Thus, to fix it, we need to swap the first and 2nd inputs before feeding to hidden dim
        self.hidden[0] = self.hidden[0].permute(1, 0, 2).contiguous()
        self.hidden[1] = self.hidden[1].permute(1, 0, 2).contiguous()

        # This is the pass to the embedding layer. 
        # The sequence is of dimension N and the output is N x Demb
        embeds = self.embeddings(seq)

        # Concatenate the embedding output with the time and frequency vectors
        embeds = torch.cat((embeds,freq), dim=3)
        embeds = torch.cat((embeds,time), dim=3)

        # Flatten the tensor
        x = embeds.view(batch_length, sequence_length, -1) 

        # Concatenate the static information
        x = torch.cat((x, static), dim=2)

        # Grab the list of lengths of sequences, for the purpose of packing the padded sequenes
        seq_lengths = torch.LongTensor(list(map(len, seq)))

        # pack the padded sequence so that paddings are ignored
        packed_x = torch.nn.utils.rnn.pack_padded_sequence(x, seq_lengths, batch_first=True)

        # Feed to the LSTM layer
        self.lstm.flatten_parameters()
        lstm_out, self.hidden = self.lstm(packed_x, self.hidden)

        # Swap back the 1st and 2nd inputs to the hidden layer back to its original configuration
        self.hidden = list(self.hidden)
        self.hidden[0] = self.hidden[0].permute(1, 0, 2).contiguous()
        self.hidden[1] = self.hidden[1].permute(1, 0, 2).contiguous()

        # Unpack the packed padded sequence so that it is ready for prediction
        unpacked_lstm_out, input_sizes = torch.nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)

        # Feed the last layer of the LSTM into the linear layer
        y = self.hidden2label(unpacked_lstm_out[:,-1,:])

        # Produce the softmax probabilities
        log_probs = F.log_softmax(y)

        return log_probs
我只运行了一次批处理数据集的迭代(批大小为32):

这是我在加载到模型之前的nvidia smi输出:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   48C    P0    64W / 300W |   2438MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   42C    P0    44W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   41C    P0    46W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   44C    P0    44W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   44C    P0    42W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   44C    P0    45W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   42C    P0    44W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    47W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     93165      C   ...r/anaconda3/envs/pytorch_p36/bin/python  2427MiB |
+-----------------------------------------------------------------------------+
加载模型后,我在向前传递的开始调用nvidia smi,nvidia smi输出为:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   50C    P0    59W / 300W |   3040MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   44C    P0    58W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   44C    P0    62W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   47C    P0    57W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    57W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   47C    P0    59W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   44C    P0    59W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   48C    P0    62W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
我希望8个GPU上的内存更均匀、更小,因为我传入的批处理数据只占用了大约2GB的空间。我做错什么了吗

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   50C    P0    59W / 300W |   3040MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   44C    P0    58W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   44C    P0    62W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   47C    P0    57W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    57W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   47C    P0    59W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   44C    P0    59W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   48C    P0    62W / 300W |   1342MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+