Python DDQN无法解决Tic Tac Toe

Python DDQN无法解决Tic Tac Toe,python,deep-learning,neural-network,reinforcement-learning,dqn,Python,Deep Learning,Neural Network,Reinforcement Learning,Dqn,很长一段时间以来,我一直在尝试用DDQN方法解决Tic-Tac-Toe问题。我花了一段时间来填补知识上的漏洞,但现在我的代码似乎很好。然而,我不确定如何训练这位经纪人,因为这是一场双人比赛。目前,我让代理玩X,让O由一个随机玩家玩,该玩家做出随机但合法的动作,而代理也可以玩非法动作,并因此获得负面奖励。step函数如下所示: def step(self, action): reward = 0. info = None if self.state[action] != 0

很长一段时间以来,我一直在尝试用DDQN方法解决Tic-Tac-Toe问题。我花了一段时间来填补知识上的漏洞,但现在我的代码似乎很好。然而,我不确定如何训练这位经纪人,因为这是一场双人比赛。目前,我让代理玩X,让O由一个随机玩家玩,该玩家做出随机但合法的动作,而代理也可以玩非法动作,并因此获得负面奖励。step函数如下所示:

def step(self, action):
    reward = 0.
    info = None
    if self.state[action] != 0:  # illegal move
        reward = -1.
        self.done = True
        return self.state, reward, self.done, info
    self.state[action] = self.turn  # make move
    self.turn = -self.turn
    self.state[-1] = self.turn  # update last state, which refers to the turn
    if self.is_winner():  # check for win
        reward = 1.0
        self.done = True
    elif self.state.count(0) == 0:  # check for draw
        reward = 1.0
        self.done = True
        info = 'draw'
    elif self.state.count(0) == 1: # check for draw in final move of the opponent
        final_action = self.state.index(0)
        self.state[final_action] == self.turn
        if not self.is_winner():
            reward = 1.0
            info = 'draw'
            self.done = True
    return self.state, reward, self.done, info
lr = < 0.001 (I trained many)
memory size = 100.000
target network update rate = 1000
epsilon start = 1.0, epsilon end = 0.1
batch size = 512
因此,如果代理赢了,平局了,或者他玩了一个招式,就会得到一个正的奖励,这将导致随机玩家在下一个招式中打平局

不幸的是,DDQN没有收敛。我无法获得超过0.5的平均奖励。为了跟踪训练进度,我让代理在当前参数和ε为0.01的情况下每1000场玩1000场游戏。有时,在找到一个好的策略后,平均值突然变为负值,因此它似乎也相当不稳定

我的超参数如下:

def step(self, action):
    reward = 0.
    info = None
    if self.state[action] != 0:  # illegal move
        reward = -1.
        self.done = True
        return self.state, reward, self.done, info
    self.state[action] = self.turn  # make move
    self.turn = -self.turn
    self.state[-1] = self.turn  # update last state, which refers to the turn
    if self.is_winner():  # check for win
        reward = 1.0
        self.done = True
    elif self.state.count(0) == 0:  # check for draw
        reward = 1.0
        self.done = True
        info = 'draw'
    elif self.state.count(0) == 1: # check for draw in final move of the opponent
        final_action = self.state.index(0)
        self.state[final_action] == self.turn
        if not self.is_winner():
            reward = 1.0
            info = 'draw'
            self.done = True
    return self.state, reward, self.done, info
lr = < 0.001 (I trained many)
memory size = 100.000
target network update rate = 1000
epsilon start = 1.0, epsilon end = 0.1
batch size = 512
lr=<0.001(我训练了很多)
内存大小=100.000
目标网络更新率=1000
ε开始=1.0,ε结束=0.1
批量大小=512

有没有人能告诉我哪些方面我可以做得更好?对于一个简单的游戏,如“Tic Tac Toe”,预计需要多少个训练时间?

嗨,我遇到了同样的问题,但经过一些尝试,我能够训练DDQN进行3x3、4x4,并且在5x5游戏板上也观察到了一些好的结果。在这里检查我的代码

您好,我遇到了同样的问题,但经过一些尝试,我能够为3x3、4x4训练DDQN,并且在5x5游戏板上也观察到了一些良好的结果。在此处检查我的代码

请发布MRE:请发布MRE: