Neural network 用Q-学习和函数逼近求解网格世界_Neural Network_Decision Tree_Reinforcement Learning_Q Learning_Function Approximation

Neural network 用Q-学习和函数逼近求解网格世界

neural-network

Neural network 用Q-学习和函数逼近求解网格世界,neural-network,decision-tree,reinforcement-learning,q-learning,function-approximation,Neural Network,Decision Tree,Reinforcement Learning,Q Learning,Function Approximation,我正在研究简单网格世界（3x4，如Russell&Norvig Ch.21.2所述）问题；我已经用Q-学习和QTable解决了这个问题，现在我想用函数逼近器代替矩阵我正在使用MATLAB，并尝试了神经网络和决策树，但没有得到预期的结果，即发现了错误的策略。我读过一些关于这个主题的文章，但大多数都是理论性的，没有太多关于实际实现的内容我一直在使用离线学习，因为它更简单。我的方法如下：用16个输入二进制单元初始化决策树（或NN）-网格中每个位置一个，加上4个可能的操作（上、下、左、右）进行大

我正在研究简单网格世界（3x4，如Russell&Norvig Ch.21.2所述）问题；我已经用Q-学习和QTable解决了这个问题，现在我想用函数逼近器代替矩阵

我正在使用MATLAB，并尝试了神经网络和决策树，但没有得到预期的结果，即发现了错误的策略。我读过一些关于这个主题的文章，但大多数都是理论性的，没有太多关于实际实现的内容

我一直在使用离线学习，因为它更简单。我的方法如下：

用16个输入二进制单元初始化决策树（或NN）-网格中每个位置一个，加上4个可能的操作（上、下、左、右）

进行大量迭代，在训练集中为每个迭代保存qstate和计算出的qvalue

使用训练集训练决策树（或NN）

删除训练集并从步骤2开始重复，使用刚训练的决策树（或NN）计算Q值

这似乎太简单了，不可能是真的，事实上，我没有得到预期的结果。以下是一些MATLAB代码：

retrain = 1;
if(retrain) 
    x = zeros(1, 16); %This is my training set
    y = 0;
    t = 0; %Iterations
end
tree = fitrtree(x, y);
x = zeros(1, 16);
y = 0;
for i=1:100
    %Get the initial game state as a 3x4 matrix
    gamestate = initialstate();
    end = 0;
    while (end == 0)
        t = t + 1; %Increase the iteration

        %Get the index of the best action to take
        index = chooseaction(gamestate, tree);

        %Make the action and get the new game state and reward
        [newgamestate, reward] = makeaction(gamestate, index);

        %Get the state-action vector for the current gamestate and chosen action
        sa_pair = statetopair(gamestate, index);

        %Check for end of game
        if(isfinalstate(gamestate))
            end = 1;
            %Get the final reward
            reward = finalreward(gamestate);
            %Add a sample to the training set
            x(size(x, 1)+1, :) = sa_pair;
            y(size(y,  1)+1, 1) = updateq(reward, gamestate, index, newgamestate, tree, t, end);
        else
            %Add a sample to the training set
            x(size(x, 1)+1, :) = sa_pair;
            y(size(y, 1)+1, 1) = updateq(reward, gamestate, index, newgamestate, tree, t, end);
        end

        %Update gamestate
        gamestate = newgamestate;
    end
end

它有一半的时间选择一个随机动作。updateq函数是：

function [ q ] = updateq( reward, gamestate, index, newgamestate, tree, iteration, finalstate )

alfa = 1/iteration;
gamma = 0.99;

%Get the action with maximum qvalue in the new state s'
amax = chooseaction(newgamestate, tree);

%Get the corresponding state-action vectors
newsa_pair = statetopair(newgamestate, amax);    
sa_pair = statetopair(gamestate, index);

if(finalstate == 0)
    X = reward + gamma * predict(tree, newsa_pair);
else
    X = reward;
end

q = (1 - alfa) * predict(tree, sa_pair) + alfa * X;    

end

如有任何建议，将不胜感激

问题在于，在离线Q-Learning中，您需要至少重复n次收集数据的过程，其中n次取决于您试图建模的问题。如果您分析在每次迭代期间计算的QQ值并仔细考虑，就会立即明白为什么需要这样做

在第一次迭代中，您只学习最终状态，在第二次迭代中，您还学习倒数第二个状态，在第三次迭代中，您还学习倒数第二个状态，依此类推。你正在从最终状态学习到初始状态，传播回QQ值。在GridWorld示例中，结束游戏所需的最少访问状态数为6

最后，正确的算法是：

用16个输入二进制单元初始化决策树（或NN）-网格中每个位置一个，加上4个可能的操作（上、下、左、右）

进行大量迭代（对于这个GridWorld示例，30个游戏就足够了），在训练集中为每个游戏保存qstate和计算出的qvalue

使用训练集训练决策树（或NN）

删除训练集

从步骤2开始重复，使用刚训练好的决策树（或NN）计算Q值，至少n次，其中n取决于您的问题。对于这个GridWorld示例，n为6，但是如果您重复该过程7-8次，那么对于所有状态都会得到更好的结果