Pytorch 训练两足系统时奖励不增加
我是一个全新的强化学习者,这是我的第一个实践项目。我尝试在OpenAI健身房环境中使用策略梯度算法来训练双足系统 然而,无论是在第0集还是在第1000集,奖赏都不会改变,我无法找出哪里出了问题。谁能帮帮我吗 蒂亚 代码如下:Pytorch 训练两足系统时奖励不增加,pytorch,reinforcement-learning,policy-gradient-descent,Pytorch,Reinforcement Learning,Policy Gradient Descent,我是一个全新的强化学习者,这是我的第一个实践项目。我尝试在OpenAI健身房环境中使用策略梯度算法来训练双足系统 然而,无论是在第0集还是在第1000集,奖赏都不会改变,我无法找出哪里出了问题。谁能帮帮我吗 蒂亚 代码如下: import torch import torch.nn as nn import numpy from torch.autograd import Variable import gym def train(model, name): model.train()
import torch
import torch.nn as nn
import numpy
from torch.autograd import Variable
import gym
def train(model, name):
model.train()
env = gym.make(name)
env.reset()
num_episodes = 20000
max_steps = 10000
optim = torch.optim.Adam(model.parameters(), lr=1)
numsteps = []
rewards = []
avg_numsteps = []
min_reward = -1000
for episode in range(num_episodes):
state = env.reset()
probs = []
rewards = []
for steps in range(max_steps):
action, log_prob = model.action(state)
env.render()
state, reward, finished, _ = env.step(action.squeeze(0).detach().numpy())
env.render()
probs.append(log_prob)
rewards.append(reward)
if finished:
break
if finished:
Rewards = []
for i in range(len(rewards)):
G = 0
p = 0
for reward in rewards[i:]:
G = G + 0.9 * p * reward
p = p + 1
Rewards.append(G)
Rewards = torch.tensor(Rewards)
discounted_reward = (Rewards - Rewards.mean()) / (Rewards.std() + 1e-9)
gradients = []
for log_prob, G in zip(log_prob, discounted_reward):
gradients.append(-log_prob * G)
optim.zero_grad()
policy_gradient = Variable(torch.stack(gradients).sum(), requires_grad=True)
policy_gradient.backward()
optim.step()
numsteps.append(steps)
avg_numsteps.append(numpy.mean(numsteps[-10:]))
rewards.append(numpy.sum(rewards))
print("episode: {}, total reward: {}, average_reward: {}, length: {}\n".format(episode,
numpy.sum(rewards),
numpy.round(numpy.mean(
rewards[-10:]),
decimals=3),
steps))
if numpy.sum(rewards) > min_reward:
torch.save(model.state_dict(), '/home/atharva/policyNet.pth')
min_reward = numpy.sum(rewards)
def test(model, name):
env = gym.make(name)
model.eval()
state = env.reset()
with torch.no_grad():
while True:
action, log_prob = model(state)
state, reward, finished, _ = env.step(action.squeeze(0).numpy())
env.render()
if finished:
break
class PolicyNet(nn.Module):
def __init__(self, inputs, actions, hidden_size):
super(PolicyNet, self).__init__()
self.num_actions = actions
self.layer1 = nn.Linear(inputs, hidden_size)
self.layer2 = nn.Linear(hidden_size, hidden_size)
self.layer3 = nn.Linear(hidden_size, 2*hidden_size)
self.layer4 = nn.Linear(2*hidden_size, hidden_size)
self.layer5 = nn.Linear(hidden_size, actions)
def forward(self, x):
x = self.layer1(x)
x = nn.functional.relu(x)
x = self.layer2(x)
x = nn.functional.relu(x)
x = self.layer3(x)
x = nn.functional.relu(x)
x = self.layer4(x)
x = nn.functional.relu(x)
x = self.layer5(x)
return x
def action(self, state):
state = torch.from_numpy(state).float().unsqueeze(0)
actions = self.forward(Variable(state))
#prob = numpy.random.choice(self.num_actions, p=numpy.squeeze(actions.detach().numpy()))
log_prob = torch.log(actions.squeeze(0))
return actions, log_prob
电话如下
from REINFORCE import PolicyNet, train
model = PolicyNet(24, 4, 256)
train(model, 'BipedalWalker-v3')
你的学习率太高了!高学习率可能导致您的模型永远不会收敛(事实上,损失可能会分散)。学习率太低会使培训过程过长。你必须为自己找到一个平衡点 调整您的学习率可能会对模型的性能产生巨大影响。我建议你花点时间阅读这篇(写得很好)博客文章: 对于初学者,请尝试在
[0.01,0.00001]
范围内的学习速率。例如:
optim = torch.optim.Adam(model.parameters(), lr=0.001)
你的学习率太高了!高学习率可能导致您的模型永远不会收敛(事实上,损失可能会分散)。学习率太低会使培训过程过长。你必须为自己找到一个平衡点 调整您的学习率可能会对模型的性能产生巨大影响。我建议你花点时间阅读这篇(写得很好)博客文章: 对于初学者,请尝试在
[0.01,0.00001]
范围内的学习速率。例如:
optim = torch.optim.Adam(model.parameters(), lr=0.001)
之前的学习率为1e-2至1e-5,但仍然没有发生任何事情。。所以,当我感到沮丧,并保持它作为1。。。我相信模型参数没有更新,因此奖赏和损失等保持不变……不确定这是否会产生影响,但将
actions=self.forward(Variable(state))
替换为actions=self.\uu cal\uuu(torch.tensor(state))
。通过直接调用forward,你可以跳过一些在向后传球中使用的钩子。深度强化学习因其对超参数的敏感性而臭名昭著。这可能是您培训中的一个错误,但通常需要找到正确的超参数才能正常工作。您可以在这里查看一些有用的示例:非常感谢,我们将查看它们。之前的学习率为1e-2到1e-5,但仍然没有发生任何事情。。所以,当我感到沮丧,并保持它作为1。。。我相信模型参数没有更新,因此奖赏和损失等保持不变……不确定这是否会产生影响,但将actions=self.forward(Variable(state))
替换为actions=self.\uu cal\uuu(torch.tensor(state))
。通过直接调用forward,你可以跳过一些在向后传球中使用的钩子。深度强化学习因其对超参数的敏感性而臭名昭著。这可能是您培训中的一个错误,但通常需要找到正确的超参数才能正常工作。您可以在这里查看一些有用的示例:非常感谢,我们将查看它们。