Python 从tf.distributions.category输出层创建softmax
我正在训练一个代理在离散环境中执行操作,我正在使用Python 从tf.distributions.category输出层创建softmax,python,tensorflow,machine-learning,neural-network,softmax,Python,Tensorflow,Machine Learning,Neural Network,Softmax,我正在训练一个代理在离散环境中执行操作,我正在使用tf.distributions.Categorical输出层,然后对该层进行采样以创建softmax输出,以确定要采取的操作。我创建的策略网络如下所示: pi_eval, _ = self._build_anet(self.state, 'pi', reuse=True) def _build_anet(self, state_in, name, reuse=False): w_reg = tf.contrib.layers.l2_r
tf.distributions.Categorical
输出层,然后对该层进行采样以创建softmax输出,以确定要采取的操作。我创建的策略网络如下所示:
pi_eval, _ = self._build_anet(self.state, 'pi', reuse=True)
def _build_anet(self, state_in, name, reuse=False):
w_reg = tf.contrib.layers.l2_regularizer(L2_REG)
with tf.variable_scope(name, reuse=reuse):
layer_1 = tf.layers.dense(state_in, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l1")
layer_2 = tf.layers.dense(layer_1, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l2")
a_logits = tf.layers.dense(layer_2, self.a_dim, kernel_regularizer=w_reg, name="pi_logits")
dist = tf.distributions.Categorical(logits=a_logits)
params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
return dist, params
softmax = self.sess.run([self.logits_action], {self.state: state[np.newaxis, :]})
然后,我使用以下示例对网络进行采样,并建立一个类分发输出作为softmax输出:
像这样跑:
pi_eval, _ = self._build_anet(self.state, 'pi', reuse=True)
def _build_anet(self, state_in, name, reuse=False):
w_reg = tf.contrib.layers.l2_regularizer(L2_REG)
with tf.variable_scope(name, reuse=reuse):
layer_1 = tf.layers.dense(state_in, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l1")
layer_2 = tf.layers.dense(layer_1, HIDDEN_LAYER_NEURONS, tf.nn.relu, kernel_regularizer=w_reg, name="pi_l2")
a_logits = tf.layers.dense(layer_2, self.a_dim, kernel_regularizer=w_reg, name="pi_logits")
dist = tf.distributions.Categorical(logits=a_logits)
params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
return dist, params
softmax = self.sess.run([self.logits_action], {self.state: state[np.newaxis, :]})
但输出只有两个非零项:
[0.44329998 0. 0. 0.5567 ]
[0.92139995 0. 0. 0.0786 ]
[0.95699996 0. 0. 0.043 ]
[0.7051 0. 0. 0.2949]
我的直觉与value\u range
有关,上面说:
值\范围:与值具有相同数据类型的形状张量。值=值\范围
将映射到hist[-1]
但我不确定应该使用什么值范围?我想知道是否有人有什么想法?事实上,我怀疑这与
值范围有关,我应该将上限设置为动作维度:
value_range=[0, self.a_dim]
事实上,正如我所怀疑的,这与值\u范围有关,我应该将上限设置为action维度:
value_range=[0, self.a_dim]