Python DQN-无法解决Cartpole-v1-我做错了什么?

Python DQN-无法解决Cartpole-v1-我做错了什么?,python,keras,deep-learning,reinforcement-learning,openai-gym,Python,Keras,Deep Learning,Reinforcement Learning,Openai Gym,我一直试图通过在100个连续步骤中获得475的平均奖励来解决CartPole-V1 这就是我需要运行的算法: 我尝试过许多具有固定Q值的DQN架构。我做错了什么 这些是我的超参数: TOTAL_EPISODES = 5000 T = 500 LR = 0.01 GAMMA = 0.95 MIN_EPSILON = 0.01 EPSILON_DECAY_RATE = 0.9995 epsilon = 1.0 # moving epsilo

我一直试图通过在100个连续步骤中获得475的平均奖励来解决CartPole-V1

这就是我需要运行的算法:

我尝试过许多具有固定Q值的DQN架构。我做错了什么

这些是我的超参数:

    TOTAL_EPISODES = 5000
    T = 500
    LR = 0.01
    GAMMA = 0.95
    MIN_EPSILON = 0.01
    EPSILON_DECAY_RATE = 0.9995
    epsilon = 1.0  # moving epsilon

    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n
    batch_size = 64
    C = 8
    reward_discount = 10
    deque_size = 2000
    experience_replay = deque(maxlen=deque_size)
我试着在[0.01,0.02,0.001]中使用LR,降低ε衰变率,批次大小=32,C=4

实现与图中相同,我将把非琐碎部分放在下面:

def train_on_batch(batch_size, memory, gamma, model, ddqn_target_model, losses):
    minibatch = random.sample(memory, batch_size)
    states = np.zeros((batch_size, 4))
    targets = np.zeros((batch_size, 2))

    for index, (state, action, reward, next_state, done) in enumerate(minibatch):
        states[index] = state.reshape(1, 4)
        model_target = model.predict(state.reshape(1, 4))
        target_pred = ddqn_target_model.predict(next_state.reshape(1, 4))
        if done:
            target = reward
        else:
            target = reward + gamma * (np.amax(target_pred))

        model_target[0][action] = target
        targets[index] = model_target[0]
    history = model.fit(states, targets, batch_size=batch_size, epochs=1, verbose=0)
    losses.append(history.history['loss'][0])


def build_model(state_size, action_size, learning_rate, layers_num=3):
    model = Sequential()
    if layers_num == 3:
        model.add(Dense(24, input_dim=state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(24, activation='relu'))
    else:
        model.add(Dense(18, input_dim=state_size, activation='relu'))
        model.add(Dense(18, activation='relu'))
        model.add(Dense(18, activation='relu'))
        model.add(Dense(18, activation='relu'))
        model.add(Dense(18, activation='relu'))
    model.add(Dense(action_size, activation='linear'))
    model.compile(loss='mse',
                  optimizer=Adam(lr=learning_rate))

    return model

def sample_action(model, state, epsilon):
    if random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action_pred = model.predict(state)
        action = np.argmax(action_pred[0])

    return action
def序列批量(批量大小、内存、伽马、型号、ddqn目标型号、损耗):
minibatch=随机样本(内存、批次大小)
状态=np.零((批次大小,4))
目标=np.零((批次大小,2))
对于枚举(小批量)中的索引(状态、操作、奖励、下一个状态、完成):
状态[索引]=状态。重塑(1,4)
model_target=model.predict(状态重塑(1,4))
target_pred=ddqn_target_model.predict(下一个状态.重塑(1,4))
如果这样做:
目标=奖励
其他:
目标=奖励+伽马*(np.amax(target_pred))
模型_目标[0][action]=目标
目标[索引]=模型_目标[0]
历史=model.fit(状态、目标、批次大小=批次大小、年代=1、详细程度=0)
loss.append(history.history['loss'][0])
def构建模型(状态大小、动作大小、学习率、层数=3):
模型=顺序()
如果层数=3:
添加(密集(24,输入尺寸=状态尺寸,激活='relu'))
model.add(密集(24,activation='relu'))
model.add(密集(24,activation='relu'))
其他:
添加(密集(18,输入尺寸=状态尺寸,激活='relu'))
model.add(密集(18,activation='relu'))
model.add(密集(18,activation='relu'))
model.add(密集(18,activation='relu'))
model.add(密集(18,activation='relu'))
添加(密集(动作大小,激活=线性)
model.compile(loss='mse',
优化器=Adam(lr=学习率))
回归模型
def样本_动作(模型、状态、ε):
如果随机。均匀(0,1)<ε:
action=env.action\u space.sample()
其他:
动作\u pred=模型预测(状态)
action=np.argmax(action\u pred[0])
返回动作
首先,我执行
ddqn\u目标\u模型。设置\u权重(model.get\u weights())
,如果插曲%C==0,则在插曲迭代中执行
:
ddqn\u目标\u模型。设置权重(模型。获取权重())

我错过了什么


谢谢

问题是我每C集更新目标模型权重,而不是每C步。问题是我每C集更新目标模型权重,而不是每C步。