Deep learning 关于训练后测试深度Q网络。训练数据和测试数据不一致
我在玩雅达利突破 一些最新的培训结果:Deep learning 关于训练后测试深度Q网络。训练数据和测试数据不一致,deep-learning,reinforcement-learning,openai-gym,Deep Learning,Reinforcement Learning,Openai Gym,我在玩雅达利突破 一些最新的培训结果: running reward: 10.19 at episode 19285, frame count 1900000 running reward: 9.95 at episode 19320, frame count 1910000 running reward: 9.12 at episode 19359, frame count 1920000 running reward: 8.89 at episode 19396, frame count 1
running reward: 10.19 at episode 19285, frame count 1900000
running reward: 9.95 at episode 19320, frame count 1910000
running reward: 9.12 at episode 19359, frame count 1920000
running reward: 8.89 at episode 19396, frame count 1930000
running reward: 8.26 at episode 19434, frame count 1940000
running reward: 8.71 at episode 19468, frame count 1950000
running reward: 8.04 at episode 19508, frame count 1960000
running reward: 8.17 at episode 19545, frame count 1970000
running reward: 8.10 at episode 19582, frame count 1980000
running reward: 8.66 at episode 19618, frame count 1990000
running reward: 8.42 at episode 19662, frame count 2000000
Solved at episode 19663!
在我的测试中,奖励是:
Returns:[0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
有什么问题吗
测试代码:
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
import numpy as np
import tensorflow as tf
from tensorflow import keras
import gym
seed = 42
model = keras.models.load_model('/content/drive/MyDrive/ai_games_assignment/model', compile=False)
env = make_atari("BreakoutNoFrameskip-v4")
env = wrap_deepmind(env, frame_stack=True, scale=True)
env.seed(seed)
env = gym.wrappers.Monitor(env, '/content/drive/MyDrive/ai_games_assignment/videosss', video_callable=lambda episode_id: True,force=True)
epsilon = 0
n_episodes = 10
returns = []
for _ in range(n_episodes):
ret = 0
state = np.array(env.reset())
done = False
while not done:
if epsilon > np.random.rand(1)[0]:
action = np.random.choice(num_actions)
else:
# Predict action Q-values
# From environment state
state_tensor = tf.convert_to_tensor(state)
state_tensor = tf.expand_dims(state_tensor, 0)
state_tensor = np.array(state_tensor)
action_probs = model.predict(state_tensor)
# Take best action
action = tf.argmax(action_probs[0]).numpy()
# Apply the sampled action in our environment
state_next, reward, done, _ = env.step(action)
state_next = np.array(state_next)
ret += reward
returns.append(ret)
env.close()
print('Returns:{}'.format(returns))
您没有更新状态。请在while循环结束时添加以下内容:
state = state_next
欢迎来到SO。请花点时间正确设置您的问题的格式,并仔细选择要共享的最小信息量,以帮助他人帮助您