Keras DQN代理不收敛_Keras_Deep Learning_Conv Neural Network_Reinforcement Learning

Keras DQN代理不收敛

keras deep-learning

Keras DQN代理不收敛,keras,deep-learning,conv-neural-network,reinforcement-learning,Keras,Deep Learning,Conv Neural Network,Reinforcement Learning,我是深度学习新手，我正在尝试为我创建的自定义环境创建一个DQN代理，其中状态是代理位置和0和1的2D矩阵的组合，该矩阵告诉代理每个用户是否正在请求服务。状态矩阵有3行（因为我们有3个用户）和3列（因为我们有3种类型的用户）例如： ( (0, 1, 0), (0, 0, 0), (1, 0, 0) ) 这意味着user0和user2（行[0]和行[2]）正在请求服务，而user1没有。（最多可以有一列，每行的值为一）这个动作是3件事情的组合，但现在我重点关注动作的第二个元素，它是一个0

我是深度学习新手，我正在尝试为我创建的自定义环境创建一个DQN代理，其中状态是代理位置和0和1的2D矩阵的组合，该矩阵告诉代理每个用户是否正在请求服务。状态矩阵有3行（因为我们有3个用户）和3列（因为我们有3种类型的用户）

例如：

(
 (0, 1, 0),
 (0, 0, 0),
 (1, 0, 0)
)

这意味着user0和user2（

行[0]和行[2]

）正在请求服务，而user1没有。（最多可以有一列，每行的值为一）

这个动作是3件事情的组合，但现在我重点关注动作的第二个元素，它是一个0和1的2D矩阵（或者更确切地说是“O”代表1，“Z”代表0），有4行（等于代理拥有的资源数量，并且将提供给用户）和3列（等于用户数量）

Examaple：

(
 ('O', 'Z', 'Z'),
 ('Z', 'Z', 'Z'),
 ('Z', 'Z', 'Z'),
 ('Z', 'Z', 'O')
)

这意味着user0和user2分别获得了resource0和resource3

每个资源可以分配给一个用户，每个用户可以在每个时间步分配一个资源

当资源分配给请求服务的用户时，环境中的step（）方法将返回奖励，如果资源分配给未请求服务的用户，则返回惩罚（这意味着在状态矩阵中，该行将如下所示（0，0，0））

我决定创建一个卷积神经网络，输入为3行3列的2D矩阵

其思想是CNN将扫描每一行并提取1（如果存在），否则它将使用MaxPooling层提取0，然后对于输入的每一行，它将选择2个值中的一个作为输出（如分类示例中所示），其中，第一个值表示不会向用户分配资源，第二个值表示将向用户分配资源

然后，输出将是一个0和1的数组，之后我有一个方法，该方法将为数组中每个等于1的值选择资源

以下是创建CNN的代码：

def create_cnn(self, width, height, depth=1, regress=False): # for the matrix
        # initialize the input shape and channel dimension
        inputShape = (height, width, depth)
        output_nodes = 2
        
        # define the model input
        inputs = Input(shape=inputShape)
        
        input_layer = Input(shape=(height, width, depth))
        conv1 = Conv2D(height, kernel_size=(3, 3), padding="same", activation="relu", input_shape=inputShape, data_format='channels_last') (input_layer)
        
        pool1 = MaxPooling2D(pool_size=(1, 3), strides=None, padding="valid")(conv1)

        output_layer1 = Dense(output_nodes, activation="softmax")(pool1) 

        model = Model(inputs=input_layer, outputs=output_layer1)

        # return the CNN
        return model

以下是构建模型的方法：

def _build_model(self):
        # create the CNN models
        cnn = self.create_cnn(3, len(env.UEs))
        
        opt = Adam(lr=self.learning_rate, decay=self.epsilon_decay)
        
        cnn.compile(loss="mean_absolute_percentage_error", optimizer=opt)
        
        return cnn

以下是重播方法：

def replay(self, batch_size, change_eps): # method that trains NN with experiences sampled from memory
        minibatch = sample(self.memory, batch_size) 
        for state, action, reward, next_state, done in minibatch: 
            actions = sorted(env.getActionsAsDict(action[1]).items()) # returns list of tuples
            print("Actions in replay", actions)
            target = reward 

            if not done: # if not done, then predict future discounted reward
                sub_next_state = self.addDims(next_state[1])
                predictions = self.model.predict(np.array(sub_next_state))[0]
                predictions_array = self.getPredictionsMaxArray(predictions)

                target = self.getTargetValues(predictions_array)
                
            sub_state = self.addDims(state[1])
            target_f = self.model.predict(np.array(sub_state))

            if isinstance(target, Sequence):
                target_f[0] = self.populateTargetFArrayVal(actions, target_f[0], target)
            else:
                target_f[0] = self.populateTargetFOneVal(actions, target_f[0], target)

            self.model.fit(np.array(sub_state), target_f, epochs=1, verbose=0) 
            
        if change_eps == True:
            self.epsilon *= self.epsilon_decay
            

def populateTargetFOneVal(self, actions, target_f, target):
        for action in actions: # list of tuples
            target_f[action[0]][0][action[1]] = target
            
        return target_f
        
def populateTargetFArrayVal(self, actions, target_f, target):
        for action in actions: # list of tuples
            target_f[action[0]][0][action[1]] = target[action[0]]
            
        return target_f
    
def addDims(self, matrix):
        sub_state = env.copyMatrix(matrix) # state[1]
        sub_state = np.expand_dims(sub_state, axis=0)
        sub_state = np.expand_dims(sub_state, axis=3)
        
        return sub_state

def getTargetValues(self, predictions_array):
        target = []
        for elt in predictions_array:
            target.append(reward + self.gamma * elt)
            
        return target
        
    
def getPredictionsMaxArray(self, predictions):
        predictions_array = []
        for row in predictions:
            predictions_array.append(np.amax(row))
            
        return predictions_array

以下是CNN的摘要：

这是CNN架构：

这是输出（按情节奖励）：