Artificial intelligence 利率价值迭代算法_Artificial Intelligence

Artificial intelligence 利率价值迭代算法

artificial-intelligence

Artificial intelligence 利率价值迭代算法,artificial-intelligence,Artificial Intelligence,在关于计算MDP最优策略的值迭代算法一章中，有一个算法： function Value-Iteration(mdp,ε) returns a utility function inputs: mdp, an MDP with states S, actions A(s), transition model P(s'|s,a), rewards R(s), discount γ ε, the maximum error allowed in the

在关于计算MDP最优策略的值迭代算法一章中，有一个算法：

function Value-Iteration(mdp,ε) returns a utility function
  inputs: mdp, an MDP with states S, actions A(s), transition model P(s'|s,a),
            rewards R(s), discount γ
          ε, the maximum error allowed in the utility of any state
  local variables: U, U', vectors of utilities for states in S, initially zero
                 δ, the maximum change in the utility of any state in an iteration

  repeat
     U ← U'; δ ← 0
     for each state s in S do
         U'[s] ← R(s) + γ max(a in A(s)) ∑ over s' (P(s'|s,a) U[s'])
         if |U'[s] - U[s]| > δ then δ ← |U'[s] - U[s]|
  until δ < ε(1-γ)/γ
  return U

函数值迭代（mdp，ε）返回一个效用函数
输入：mdp，具有状态S的mdp，动作A（S），转换模型P（S’| S，A），
奖励R（s），折扣γ
ε、 任何状态的实用程序中允许的最大错误
局部变量：U，U'，S中状态的实用工具向量，最初为零
δ、 迭代中任何状态效用的最大变化
重复
U← U′；δ ← 0
对于s中的每个状态
U'[s]← R（s）+γmax（a（s）中的a）∑ 超过s'（P（s'| s，a）U[s'）
如果| U'[s]-U[s]>δ，则δ← |U'[s]-U[s]|
直到δ<ε（1-γ）/γ
返回U

（很抱歉设置了格式，但我需要10个代表才能发布图片，$latex formatting$在这里似乎不起作用。）

前一章也有一句话：

贴现系数γ等于利率（1/γ）− 一,

有人能给我解释一下利率（1/γ）-1是什么意思吗？他们是怎么得到的？为什么在上述算法中的终止条件中使用它？

t-1处的奖励被视为通过因子γ（y）贴现。也就是说，old=yx new。所以new=（1/y）*old和new-old=（1/y）-1）*old。这是你的利率

我不太清楚为什么在终止条件下使用它。无论如何，ε的值是任意的

事实上，我认为这种终止标准是非常糟糕的。当y=1时，它不起作用。当y=0时，迭代应立即停止，因为它足以估计完美值。当y=1时，需要多次迭代