Python 如何提高DQN的性能?

Python 如何提高DQN的性能?,python,machine-learning,reinforcement-learning,Python,Machine Learning,Reinforcement Learning,我创建了一个深度Q网络来玩snake。代码运行良好,除了性能在培训周期内没有真正提高。最后,它与执行随机操作的代理几乎没有区别。以下是培训代码: def train(self): self.build_model() for episode in range(self.max_episodes): self.current_episode = episode env = SnakeEnv(self.screen)

我创建了一个深度Q网络来玩snake。代码运行良好,除了性能在培训周期内没有真正提高。最后,它与执行随机操作的代理几乎没有区别。以下是培训代码:

def train(self):
        self.build_model()
        for episode in range(self.max_episodes):
            self.current_episode = episode
            env = SnakeEnv(self.screen)
            episode_reward = 0
            for timestep in range(self.max_steps):
                env.render(self.screen)
                state = env.get_state()
                action = None
                epsilon = self.current_eps
                if epsilon > random.random():
                    action = np.random.choice(env.action_space) #explore
                else:
                    values = self.policy_model.predict(env.get_state()) #exploit
                    action = np.argmax(values)
                experience = env.step(action)
                if(experience['done'] == True):
                    episode_reward += 5 * (len(env.snake.List) - 1)
                    episode_reward += experience['reward']
                    break
                episode_reward += experience['reward']
                if(len(self.memory) < self.memory_size):
                    self.memory.append(Experience(experience['state'], experience['action'], experience['reward'], experience['next_state']))
                else:
                    self.memory[self.push_count % self.memory_size] = Experience(experience['state'], experience['action'], experience['reward'], experience['next_state'])
                self.push_count += 1
                self.decay_epsilon(episode)
                if self.can_sample_memory():
                    memory_sample = self.sample_memory()
                    #q_pred = np.zeros((self.batch_size, 1))
                    #q_target = np.zeros((self.batch_size, 1))
                    #i = 0
                    for memory in memory_sample:
                        memstate = memory.state
                        action = memory.action
                        next_state = memory.next_state
                        reward = memory.reward
                        max_q = reward + self.discount_rate * self.replay_model.predict(next_state)
                        #q_pred[i] = q_value
                        #q_target[i] = max_q
                        #i += 1
                        self.policy_model.fit(memstate, max_q, epochs=1, verbose=0)
            print("Episode: ", episode, " Total Reward: ", episode_reward)
            if episode % self.target_update == 0:
                self.replay_model.set_weights(self.policy_model.get_weights())
        self.policy_model.save_weights('weights.hdf5')
        pygame.quit() 
以下是网络体系结构:

model = models.Sequential()
model.add(Dense(500, activation = 'relu', kernel_initializer = 'random_uniform', bias_initializer = 'zeros', input_dim = 400))
model.add(Dense(500, activation = 'relu', kernel_initializer = 'random_uniform', bias_initializer = 'zeros'))
model.add(Dense(5, activation = 'tanh', kernel_initializer = 'random_uniform', bias_initializer = 'zeros')) #tanh for last layer because q value can be > 1
model.compile(loss='mean_squared_error', optimizer = 'adam')

作为参考,由于蛇可以移动的4个方向,网络输出5个值,如果不采取行动,则额外输出1个值。此外,我没有像传统的DQN那样成为游戏的屏幕截图,而是传入一个400维向量,作为游戏发生的20 x 20网格的表示。代理人靠近食物或吃了食物会得到1的奖励,如果它死了会得到-1的奖励。我怎样才能提高成绩?

我认为主要问题是你的学习率高。尝试使用低于0.001的值。Atari DQN使用0.00025

同时将traget_update设置为高于10。例如,500和更多

要查看某些内容,步骤数至少应为10000

将批处理大小降低到32或64

您是否考虑过实施其他一些改进?像佩尔,决斗DQN? 看看这个:

<>也许你不想重新实现一个轮子,考虑< /P> 最后,您可以检查类似的项目:

model = models.Sequential()
model.add(Dense(500, activation = 'relu', kernel_initializer = 'random_uniform', bias_initializer = 'zeros', input_dim = 400))
model.add(Dense(500, activation = 'relu', kernel_initializer = 'random_uniform', bias_initializer = 'zeros'))
model.add(Dense(5, activation = 'tanh', kernel_initializer = 'random_uniform', bias_initializer = 'zeros')) #tanh for last layer because q value can be > 1
model.compile(loss='mean_squared_error', optimizer = 'adam')