Machine learning 如何在大型州使用Julia中的DeepQLearning？_Machine Learning_Julia

Machine learning 如何在大型州使用Julia中的DeepQLearning？

machine-learning julia

Machine learning 如何在大型州使用Julia中的DeepQLearning？,machine-learning,julia,Machine Learning,Julia,我想使用来自的DeepQLearning.jl包。为了做到这一点，我们必须做类似的事情 using DeepQLearning using POMDPs using Flux using POMDPModels using POMDPSimulators using POMDPPolicies # load MDP model from POMDPModels or define your own! mdp = SimpleGridWorld(); # Define the Q networ

我想使用来自的DeepQLearning.jl包。为了做到这一点，我们必须做类似的事情

using DeepQLearning
using POMDPs
using Flux
using POMDPModels
using POMDPSimulators
using POMDPPolicies

# load MDP model from POMDPModels or define your own!
mdp = SimpleGridWorld();

# Define the Q network (see Flux.jl documentation)
# the gridworld state is represented by a 2 dimensional vector.
model = Chain(Dense(2, 32), Dense(32, length(actions(mdp))))

exploration = EpsGreedyPolicy(mdp, LinearDecaySchedule(start=1.0, stop=0.01, steps=10000/2))

solver = DeepQLearningSolver(qnetwork = model, max_steps=10000, 
                             exploration_policy = exploration,
                             learning_rate=0.005,log_freq=500,
                             recurrence=false,double_q=true, dueling=true, prioritized_replay=true)
policy = solve(solver, mdp)

sim = RolloutSimulator(max_steps=30)
r_tot = simulate(sim, mdp, policy)
println("Total discounted reward for 1 simulation: $r_tot")

在行

mdp=SimpleGridWorld（）

中，我们创建了mdp。当我试图创建MDP时，我遇到了非常大的状态空间的问题。对于某些

和

，我的MDP中的状态是

{1,2，…，m}^n

中的向量。因此，在定义函数

POMDPs.states（mdp:：myMDP）

时，我意识到我必须迭代所有非常大的状态，即

m^n

我用错软件包了吗？或者我们必须迭代这些状态，即使它们的数量是指数级的？如果是后者，那么使用深度Q学习有什么意义？我认为，当动作空间和状态空间非常大时，深度Q学习会有所帮助。

深度Q学习不需要枚举状态空间，可以处理连续空间问题。 DeepQLearning.jl仅使用。因此，您不需要实现

状态

功能，只需实现

gen

和

initialstate

（请参阅有关如何实现生成接口的链接）

但是，由于DQN的离散动作特性，您还需要

POMDPs.actions（mdp:：YourMDP）

，它应该在动作空间上返回一个迭代器

通过对实现进行这些修改，您应该能够使用解算器

DQN中的神经网络以状态的向量表示作为输入。如果您的状态为

维向量，则神经网络输入的大小将为

。网络的输出大小将等于模型中的操作数

在网格世界示例中，通量模型的输入大小为2（x，y位置），输出大小为

length（actions（mdp））=4

因此，如果动作空间很大，比如说动作是

{1,2，…，m}^n

中的向量，那么如何实现

POMDPs.actions（）。我是否应该使用较小的m
和n
？您可以使用迭代器.product（[1:m表示I=1:n]…）
来实现您的操作空间。由于网络的输出等于操作空间的大小（每个操作1个q值），因此您肯定会被限制为较小的m
和n
。您知道当我们有多个代理时，是否可以使用DeepQLearning.jl？我是说我有两个经纪人。每个代理都被建模为MDP，但一个MDP的奖励不仅取决于相应代理的行动，还取决于另一个MDP的另一个代理的行动。我们可以有两个DQN（每个代理一个），并在两个DQN上培训这两个代理。唯一的问题是使用两个代理的操作来计算每个代理的报酬。目前不是，我们计划在某个时候发布POMDPs.jl的多代理程序包，但不是很快。但是，重用当前包中的大部分代码并实现自己的代码应该是可行的。不过，我认为stackoverflow不适合这个讨论。请随意打开一篇文章，进行更深入的讨论。