Python 如何在每次迭代后返回控制的强化学习程序中使用Tensorflow优化器而不重新计算激活?

Python 如何在每次迭代后返回控制的强化学习程序中使用Tensorflow优化器而不重新计算激活?,python,tensorflow,machine-learning,reinforcement-learning,q-learning,Python,Tensorflow,Machine Learning,Reinforcement Learning,Q Learning,编辑(1/3/16): 我正在使用Tensorflow(Python接口)实现一个q-learning代理,该代理使用随机梯度下降训练函数近似 在实验的每次迭代中,将调用代理中的一个步长函数,该函数根据新的奖励和激活更新近似器的参数,然后选择要执行的新操作 问题是(强化学习术语): 代理计算其状态操作值预测以选择操作 然后将控制权返还给另一个模拟环境中某个步骤的程序 现在为下一次迭代调用代理的step函数。我想使用Tensorflow的优化器类为我计算梯度。然而,这需要我在最后一步计算的状态动

编辑(1/3/16):

我正在使用Tensorflow(Python接口)实现一个
q-learning
代理,该代理使用
随机梯度下降
训练函数近似

在实验的每次迭代中,将调用代理中的一个步长函数,该函数根据新的奖励和激活更新近似器的参数,然后选择要执行的新操作

问题是(强化学习术语):

  • 代理计算其状态操作值预测以选择操作
  • 然后将控制权返还给另一个模拟环境中某个步骤的程序
  • 现在为下一次迭代调用代理的step函数。我想使用Tensorflow的优化器类为我计算梯度。然而,这需要我在最后一步计算的状态动作值预测和它们的图表。因此:
    • 如果我在整个图上运行优化器,那么它必须重新计算状态操作值预测
    • 但是,如果我将预测(对于所选操作)存储为变量,然后将其作为占位符提供给优化器,那么它就不再具有计算梯度所需的图形
    • 我不能在同一个
      sess.run()
      语句中运行所有操作,因为我必须放弃控制并返回所选的操作,以便获得下一次观察和奖励(在目标中使用loss函数)
那么,有没有一种方法可以让我(不必学习行话):

  • 计算图形的一部分,返回值1
  • 将value1返回给调用程序以计算value2
  • 在下一次迭代中,使用value2 to作为梯度下降损失函数的一部分,而无需重新计算图形中计算value1的部分
  • 当然,我考虑过显而易见的解决方案:

  • 只需对梯度进行硬编码:这对于我现在使用的非常简单的近似器来说很容易,但是如果我在一个大的卷积网络中试验不同的滤波器和激活函数,这将非常不方便。如果可能的话,我真的很想使用优化器类

  • 从代理内部调用环境模拟:这样做,但会使我的环境更复杂,并删除许多模块化和结构。所以,我不想这样做

  • 我已经阅读了好几次API和白皮书,但似乎没有找到解决方案。我试图找到一种方法,将目标输入到一个图形中,以计算梯度,但无法找到一种方法来自动构建该图形

    如果事实证明这在TensorFlow中还不可能,您认为作为一个新的操作符来实现它会非常复杂吗?(我在几年内还没有使用C++,所以TysFooSoad看起来有点吓人。)或者我最好切换到像火把一样的命令,而不是符号化的区别? 谢谢你花时间帮我解决这个问题。我想尽量使这句话简明扼要

    编辑:在做了进一步的搜索之后,我发现了。这和我的有点不同(他们试图避免在Torch中每次迭代两次更新LSTM网络),目前还没有任何答案

    下面是一些代码,如果有帮助的话:

    '''
    -Q-Learning agent for a grid-world environment.
    -Receives input as raw RGB pixel representation of the screen.
    -Uses an artificial neural network function approximator with one hidden layer
    
    2015 Jonathon Byrd
    '''
    
    import random
    import sys
    #import copy
    from rlglue.agent.Agent import Agent
    from rlglue.agent import AgentLoader as AgentLoader
    from rlglue.types import Action
    from rlglue.types import Observation
    
    import tensorflow as tf
    import numpy as np
    
    world_size = (3,3)
    total_spaces = world_size[0] * world_size[1]
    
    class simple_agent(Agent):
    
        #Contants
        discount_factor = tf.constant(0.5, name="discount_factor")
        learning_rate = tf.constant(0.01, name="learning_rate")
        exploration_rate = tf.Variable(0.2, name="exploration_rate")  # used to be a constant :P
        hidden_layer_size = 12
    
        #Network Parameters - weights and biases
        W = [tf.Variable(tf.truncated_normal([total_spaces * 3, hidden_layer_size], stddev=0.1), name="layer_1_weights"), 
        tf.Variable(tf.truncated_normal([hidden_layer_size,4], stddev=0.1), name="layer_2_weights")]
        b = [tf.Variable(tf.zeros([hidden_layer_size]), name="layer_1_biases"), tf.Variable(tf.zeros([4]), name="layer_2_biases")]
    
        #Input placeholders - observation and reward
        screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="observation") #input pixel rgb values
        reward = tf.placeholder(tf.float32, shape=[], name="reward")
    
        #last step data
        last_obs = np.array([1, 2, 3], ndmin=4)
        last_act = -1
    
        #Last step placeholders
        last_screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="previous_observation")
        last_move = tf.placeholder(tf.int32, shape = [], name="previous_action")
    
        next_prediction = tf.placeholder(tf.float32, shape = [], name="next_prediction")
    
        step_count = 0
    
        def __init__(self):
            #Initialize computational graphs
            self.q_preds = self.Q(self.screen)
            self.last_q_preds = self.Q(self.last_screen)
            self.action = self.choose_action(self.q_preds)
            self.next_pred = self.max_q(self.q_preds)
            self.last_pred = self.act_to_pred(self.last_move, self.last_q_preds) # inefficient recomputation
            self.loss = self.error(self.last_pred, self.reward, self.next_prediction)
            self.train = self.learn(self.loss)
            #Summaries and Statistics
            tf.scalar_summary(['loss'], self.loss)
            tf.scalar_summary('reward', self.reward)
            #w_hist = tf.histogram_summary("weights", self.W[0])
            self.summary_op = tf.merge_all_summaries()
            self.sess = tf.Session()
            self.summary_writer = tf.train.SummaryWriter('tensorlogs', graph_def=self.sess.graph_def)
    
    
        def agent_init(self,taskSpec):
            print("agent_init called")
            self.sess.run(tf.initialize_all_variables())
    
        def agent_start(self,observation):
            #print("agent_start called, observation = {0}".format(observation.intArray))
            o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255)
            return self.control(o)
    
        def agent_step(self,reward, observation):
            #print("agent_step called, observation = {0}".format(observation.intArray))
            print("step, reward: {0}".format(reward))
            o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255)
    
            next_prediction = self.sess.run([self.next_pred], feed_dict={self.screen:o})[0]
    
            if self.step_count % 10 == 0:
                summary_str = self.sess.run([self.summary_op, self.train], 
                    feed_dict={self.reward:reward, self.last_screen:self.last_obs, 
                    self.last_move:self.last_act, self.next_prediction:next_prediction})[0]
    
                self.summary_writer.add_summary(summary_str, global_step=self.step_count)
            else:
                self.sess.run([self.train], 
                    feed_dict={self.screen:o, self.reward:reward, self.last_screen:self.last_obs, 
                    self.last_move:self.last_act, self.next_prediction:next_prediction})
    
            return self.control(o)
    
        def control(self, observation):
            results = self.sess.run([self.action], feed_dict={self.screen:observation})
            action = results[0]
    
            self.last_act = action
            self.last_obs = observation
    
            if (action==0):  # convert action integer to direction character
                action = 'u'
            elif (action==1):
                action = 'l'
            elif (action==2):
                action = 'r'
            elif (action==3):
                action = 'd'
            returnAction=Action()
            returnAction.charArray=[action]
            #print("return action returned {0}".format(action))
            self.step_count += 1
            return returnAction
    
        def Q(self, obs):  #calculates state-action value prediction with feed-forward neural net
            with tf.name_scope('network_inference') as scope:
                h1 = tf.nn.relu(tf.matmul(obs, self.W[0]) + self.b[0])
                q_preds = tf.matmul(h1, self.W[1]) + self.b[1] #linear activation
                return tf.reshape(q_preds, shape=[4])
    
        def choose_action(self, q_preds):  #chooses action epsilon-greedily
            with tf.name_scope('action_choice') as scope:
                exploration_roll = tf.random_uniform([])
                #greedy_action = tf.argmax(q_preds, 0)  # gets the action with the highest predicted Q-value
                #random_action = tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)
    
                #exploration rate updates
                #if self.step_count % 10000 == 0:
                    #self.exploration_rate.assign(tf.div(self.exploration_rate, 2))
    
                return tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), 
                    tf.argmax(q_preds, 0),   #greedy_action
                    tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64))  #random_action
    
            '''
            Why does this return NoneType?:
    
            flag = tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), 'g', 'r')
            if flag == 'g':  #greedy
                return tf.argmax(q_preds, 0) # gets the action with the highest predicted Q-value
            elif flag == 'r':  #random
                return tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)
            '''
    
        def error(self, last_pred, r, next_pred):
            with tf.name_scope('loss_function') as scope:
                y = tf.add(r, tf.mul(self.discount_factor, next_pred)) #target
                return tf.square(tf.sub(y, last_pred)) #squared difference error
    
    
        def learn(self, loss): #Update parameters using stochastic gradient descent
            #TODO:  Either figure out how to avoid computing the q-prediction twice or just hardcode the gradients.
            with tf.name_scope('train') as scope:
                return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(loss, var_list=[self.W[0], self.W[1], self.b[0], self.b[1]])
    
    
        def max_q(self, q_preds):
            with tf.name_scope('greedy_estimate') as scope:
                return tf.reduce_max(q_preds)  #best predicted action from current state
    
        def act_to_pred(self, a, preds): #get the value prediction for action a
            with tf.name_scope('get_prediction') as scope:
                return tf.slice(preds, tf.reshape(a, shape=[1]), [1])
    
    
        def agent_end(self,reward):
            pass
    
        def agent_cleanup(self):
            self.sess.close()
            pass
    
        def agent_message(self,inMessage):
            if inMessage=="what is your name?":
                return "my name is simple_agent";
            else:
                return "I don't know how to respond to your message";
    
    if __name__=="__main__":
        AgentLoader.loadAgent(simple_agent())
    

    现在,在Tensorflow(0.6)中,您想要做的是非常困难的。您最好的办法是咬紧牙关,多次调用run,代价是重新计算激活。然而,我们在内部非常了解这个问题。“部分运行”解决方案的原型正在开发中,但目前还没有完成的时间表。因为一个真正令人满意的答案可能需要修改tensorflow本身,所以您也可以对此提出github问题,看看是否有其他人对此有任何意见

    编辑:部分运行的实验支持现在处于运行状态