Deep learning 关于训练后测试深度Q网络。训练数据和测试数据不一致_Deep Learning_Reinforcement Learning_Openai Gym

Deep learning 关于训练后测试深度Q网络。训练数据和测试数据不一致

deep-learning

Deep learning 关于训练后测试深度Q网络。训练数据和测试数据不一致,deep-learning,reinforcement-learning,openai-gym,Deep Learning,Reinforcement Learning,Openai Gym,我在玩雅达利突破一些最新的培训结果： running reward: 10.19 at episode 19285, frame count 1900000 running reward: 9.95 at episode 19320, frame count 1910000 running reward: 9.12 at episode 19359, frame count 1920000 running reward: 8.89 at episode 19396, frame count 1

我在玩雅达利突破

一些最新的培训结果：

running reward: 10.19 at episode 19285, frame count 1900000
running reward: 9.95 at episode 19320, frame count 1910000
running reward: 9.12 at episode 19359, frame count 1920000
running reward: 8.89 at episode 19396, frame count 1930000
running reward: 8.26 at episode 19434, frame count 1940000
running reward: 8.71 at episode 19468, frame count 1950000
running reward: 8.04 at episode 19508, frame count 1960000
running reward: 8.17 at episode 19545, frame count 1970000
running reward: 8.10 at episode 19582, frame count 1980000
running reward: 8.66 at episode 19618, frame count 1990000
running reward: 8.42 at episode 19662, frame count 2000000
Solved at episode 19663!

在我的测试中，奖励是：

Returns:[0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]

有什么问题吗

测试代码：

from baselines.common.atari_wrappers import make_atari, wrap_deepmind
import numpy as np
import tensorflow as tf
from tensorflow import keras
import gym

seed = 42

model = keras.models.load_model('/content/drive/MyDrive/ai_games_assignment/model', compile=False)

env = make_atari("BreakoutNoFrameskip-v4")
env = wrap_deepmind(env, frame_stack=True, scale=True)
env.seed(seed)

env = gym.wrappers.Monitor(env, '/content/drive/MyDrive/ai_games_assignment/videosss', video_callable=lambda episode_id: True,force=True)

epsilon = 0
n_episodes = 10
returns = []

for _ in range(n_episodes):
  ret = 0

  state = np.array(env.reset())

  done = False
  while not done:
    if epsilon > np.random.rand(1)[0]:
      action = np.random.choice(num_actions)
    else:
      # Predict action Q-values
      # From environment state
      state_tensor = tf.convert_to_tensor(state)
      state_tensor = tf.expand_dims(state_tensor, 0)
      state_tensor = np.array(state_tensor)
      action_probs = model.predict(state_tensor)
      # Take best action
      action = tf.argmax(action_probs[0]).numpy()

    # Apply the sampled action in our environment
    state_next, reward, done, _ = env.step(action)
    state_next = np.array(state_next)

    ret += reward
  returns.append(ret)

env.close()

print('Returns:{}'.format(returns))

您没有更新状态。请在while循环结束时添加以下内容：

state = state_next

欢迎来到SO。请花点时间正确设置您的问题的格式，并仔细选择要共享的最小信息量，以帮助他人帮助您