Python DQN-无法解决Cartpole-v1-我做错了什么?
我一直试图通过在100个连续步骤中获得475的平均奖励来解决CartPole-V1 这就是我需要运行的算法: 我尝试过许多具有固定Q值的DQN架构。我做错了什么 这些是我的超参数:Python DQN-无法解决Cartpole-v1-我做错了什么?,python,keras,deep-learning,reinforcement-learning,openai-gym,Python,Keras,Deep Learning,Reinforcement Learning,Openai Gym,我一直试图通过在100个连续步骤中获得475的平均奖励来解决CartPole-V1 这就是我需要运行的算法: 我尝试过许多具有固定Q值的DQN架构。我做错了什么 这些是我的超参数: TOTAL_EPISODES = 5000 T = 500 LR = 0.01 GAMMA = 0.95 MIN_EPSILON = 0.01 EPSILON_DECAY_RATE = 0.9995 epsilon = 1.0 # moving epsilo
TOTAL_EPISODES = 5000
T = 500
LR = 0.01
GAMMA = 0.95
MIN_EPSILON = 0.01
EPSILON_DECAY_RATE = 0.9995
epsilon = 1.0 # moving epsilon
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
batch_size = 64
C = 8
reward_discount = 10
deque_size = 2000
experience_replay = deque(maxlen=deque_size)
我试着在[0.01,0.02,0.001]中使用LR,降低ε衰变率,批次大小=32,C=4
实现与图中相同,我将把非琐碎部分放在下面:
def train_on_batch(batch_size, memory, gamma, model, ddqn_target_model, losses):
minibatch = random.sample(memory, batch_size)
states = np.zeros((batch_size, 4))
targets = np.zeros((batch_size, 2))
for index, (state, action, reward, next_state, done) in enumerate(minibatch):
states[index] = state.reshape(1, 4)
model_target = model.predict(state.reshape(1, 4))
target_pred = ddqn_target_model.predict(next_state.reshape(1, 4))
if done:
target = reward
else:
target = reward + gamma * (np.amax(target_pred))
model_target[0][action] = target
targets[index] = model_target[0]
history = model.fit(states, targets, batch_size=batch_size, epochs=1, verbose=0)
losses.append(history.history['loss'][0])
def build_model(state_size, action_size, learning_rate, layers_num=3):
model = Sequential()
if layers_num == 3:
model.add(Dense(24, input_dim=state_size, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(24, activation='relu'))
else:
model.add(Dense(18, input_dim=state_size, activation='relu'))
model.add(Dense(18, activation='relu'))
model.add(Dense(18, activation='relu'))
model.add(Dense(18, activation='relu'))
model.add(Dense(18, activation='relu'))
model.add(Dense(action_size, activation='linear'))
model.compile(loss='mse',
optimizer=Adam(lr=learning_rate))
return model
def sample_action(model, state, epsilon):
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action_pred = model.predict(state)
action = np.argmax(action_pred[0])
return action
def序列批量(批量大小、内存、伽马、型号、ddqn目标型号、损耗):
minibatch=随机样本(内存、批次大小)
状态=np.零((批次大小,4))
目标=np.零((批次大小,2))
对于枚举(小批量)中的索引(状态、操作、奖励、下一个状态、完成):
状态[索引]=状态。重塑(1,4)
model_target=model.predict(状态重塑(1,4))
target_pred=ddqn_target_model.predict(下一个状态.重塑(1,4))
如果这样做:
目标=奖励
其他:
目标=奖励+伽马*(np.amax(target_pred))
模型_目标[0][action]=目标
目标[索引]=模型_目标[0]
历史=model.fit(状态、目标、批次大小=批次大小、年代=1、详细程度=0)
loss.append(history.history['loss'][0])
def构建模型(状态大小、动作大小、学习率、层数=3):
模型=顺序()
如果层数=3:
添加(密集(24,输入尺寸=状态尺寸,激活='relu'))
model.add(密集(24,activation='relu'))
model.add(密集(24,activation='relu'))
其他:
添加(密集(18,输入尺寸=状态尺寸,激活='relu'))
model.add(密集(18,activation='relu'))
model.add(密集(18,activation='relu'))
model.add(密集(18,activation='relu'))
model.add(密集(18,activation='relu'))
添加(密集(动作大小,激活=线性)
model.compile(loss='mse',
优化器=Adam(lr=学习率))
回归模型
def样本_动作(模型、状态、ε):
如果随机。均匀(0,1)<ε:
action=env.action\u space.sample()
其他:
动作\u pred=模型预测(状态)
action=np.argmax(action\u pred[0])
返回动作
首先,我执行ddqn\u目标\u模型。设置\u权重(model.get\u weights())
,如果插曲%C==0,则在插曲迭代中执行:
ddqn\u目标\u模型。设置权重(模型。获取权重())
我错过了什么
谢谢问题是我每C集更新目标模型权重,而不是每C步。问题是我每C集更新目标模型权重,而不是每C步。