如何在R程序中获取gridworld模型的SARSA代码?
我的研究案例有个问题。 我对gridworld模型的强化学习很感兴趣。 模型是由7x7个运动场组成的迷宫。 想想迷宫般的田野。有四个方向:上、下、左、右(或N、E、S、W)。因此,最多有一些政策。许多人在使用撞墙时立即给予的惩罚时可以被排除在外。 除此之外,还采用了返回抑制原则,通常允许的行动更少。许多政策仅在目标后的部分不同,或者是等效的 ▼ 国家:有障碍 ▼ 奖励:如果r=1,如果s=G,则任何允许的移动r=0,否则r=-100 ▼ 初始化:Q0(a,s)~N(0,0.01) 为了解决这个模型,我做了一个R代码,但它不能正常工作 型号:7x7,S:开始状态,G:终端状态,O:可访问状态,X:墙如何在R程序中获取gridworld模型的SARSA代码?,r,grid,reinforcement-learning,sarsa,R,Grid,Reinforcement Learning,Sarsa,我的研究案例有个问题。 我对gridworld模型的强化学习很感兴趣。 模型是由7x7个运动场组成的迷宫。 想想迷宫般的田野。有四个方向:上、下、左、右(或N、E、S、W)。因此,最多有一些政策。许多人在使用撞墙时立即给予的惩罚时可以被排除在外。 除此之外,还采用了返回抑制原则,通常允许的行动更少。许多政策仅在目标后的部分不同,或者是等效的 ▼ 国家:有障碍 ▼ 奖励:如果r=1,如果s=G,则任何允许的移动r=0,否则r=-100 ▼ 初始化:Q0(a,s)~N(0,0.01) 为了解决这个模
[O,O,G,X,O,O,S]
[O,X,O,X,O,X,X]
[O,X,O,X,O,O,O]
[O,X,O,X,O,X,O]
[O,X,O,O,O,X,O]
[O,X,O,X,O,X,O]
[O,O,O,X,O,O,O]
所以我想知道如何更正这个gridworld模型的代码(不是uppon代码),并想知道如何通过SARSA模型求解这个模型
actions <- c("N", "S", "E", "W")
x <- 1:7
y <- 1:7
rewards <- matrix(rep(0, 49), nrow=7)
rewards[1, 1] <- 0
rewards[1, 2] <- 0
rewards[1, 3] <- 1
rewards[1, 4] <- -100
rewards[1, 5] <- 0
rewards[1, 6] <- 0
rewards[1, 7] <- 0
rewards[2, 1] <- 0
rewards[2, 2] <- -100
rewards[2, 3] <- 0
rewards[2, 4] <- -100
rewards[2, 5] <- 0
rewards[2, 6] <- -100
rewards[2, 7] <- -100
rewards[3, 1] <- 0
rewards[3, 2] <- -100
rewards[3, 3] <- 0
rewards[3, 4] <- -100
rewards[3, 5] <- 0
rewards[3, 6] <- 0
rewards[3, 7] <- 0
rewards[4, 1] <- 0
rewards[4, 2] <- -100
rewards[4, 3] <- 0
rewards[4, 4] <- -100
rewards[4, 5] <- 0
rewards[4, 6] <- -100
rewards[4, 7] <- 0
rewards[5, 1] <- 0
rewards[5, 2] <- -100
rewards[5, 3] <- 0
rewards[5, 4] <- 0
rewards[5, 5] <- 0
rewards[5, 6] <- -100
rewards[5, 7] <- 0
rewards[6, 1] <- 0
rewards[6, 2] <- -100
rewards[6, 3] <- 0
rewards[6, 4] <- -100
rewards[6, 5] <- 0
rewards[6, 6] <- -100
rewards[6, 7] <- 0
rewards[7, 1] <- 0
rewards[7, 2] <- 0
rewards[7, 3] <- 0
rewards[7, 4] <- -100
rewards[7, 5] <- 0
rewards[7, 6] <- 0
rewards[7, 7] <- 0
values <- rewards # initial values
states <- expand.grid(x=x, y=y)
# Transition probability
transition <- list("N" = c("N" = 0.8, "S" = 0, "E" = 0.1, "W" = 0.1),
"S"= c("S" = 0.8, "N" = 0, "E" = 0.1, "W" = 0.1),
"E"= c("E" = 0.8, "W" = 0, "S" = 0.1, "N" = 0.1),
"W"= c("W" = 0.8, "E" = 0, "S" = 0.1, "N" = 0.1))
# The value of an action (e.g. move north means y + 1)
action.values <- list("N" = c("x" = 0, "y" = 1),
"S" = c("x" = 0, "y" = -1),
"E" = c("x" = 1, "y" = 0),
"W" = c("x" = -1, "y" = 0))
# act() function serves to move the robot through states based on an action
act <- function(action, state) {
action.value <- action.values[[action]]
new.state <- state
if(state["x"] == 1 && state["y"] == 7 || (state["x"] == 1 && state["y"] == 3))
return(state)
#
new.x = state["x"] + action.value["x"]
new.y = state["y"] + action.value["y"]
# Constrained by edge of grid
new.state["x"] <- min(x[length(x)], max(x[1], new.x))
new.state["y"] <- min(y[length(y)], max(y[1], new.y))
#
if(is.na(rewards[new.state["y"], new.state["x"]]))
new.state <- state
#
return(new.state)
}
rewards
bellman.update <- function(action, state, values, gamma=1) {
state.transition.prob <- transition[[action]]
q <- rep(0, length(state.transition.prob))
for(i in 1:length(state.transition.prob)) {
new.state <- act(names(state.transition.prob)[i], state)
q[i] <- (state.transition.prob[i] * (rewards[state["y"], state["x"]] + (gamma * values[new.state["y"], new.state["x"]])))
}
sum(q)
}
value.iteration <- function(states, actions, rewards, values, gamma, niter, n) {
for (j in 1:niter) {
for (i in 1:nrow(states)) {
state <- unlist(states[i,])
if(i %in% c(7, 15)) next # terminal states
q.values <- as.numeric(lapply(actions, bellman.update, state=state, values=values, gamma=gamma))
values[state["y"], state["x"]] <- max(q.values)
}
}
return(values)
}
final.values <- value.iteration(states=states, actions=actions, rewards=rewards, values=values, gamma=0.99, niter=100, n=10)
final.values
行动问题是你的惩罚远远大于奖励。代理人可能更愿意把自己扔进墙里,而不是试图得到报酬。这是因为状态动作值收敛到非常低的实数,甚至低于-100,这取决于动作的奖励
下面是我制作的一个模拟值迭代的模型(它显示了SARSA应该收敛到的值):
值表表示图片中模型的值状态,但它是反向的(因为我还没有修复它)
在本例中,我将奖惩的价值与您的模型非常相似-15是一个不偏不倚的状态(一堵墙),1.0是球,-100是方块。代理对每个动作都得到0.0,并且转移概率也相同
代理必须到达球,但正如您所看到的,状态收敛到非常小的值。在这里,您可以看到球所在的相邻状态的值较低。因此,代理人宁愿永远也达不到自己的目标
为了解决你的问题,尽量减少惩罚