Tensorflow估计器不断创建设备

Tensorflow估计器不断创建设备,tensorflow,Tensorflow,我有一个两层神经网络的自定义估计器,我想在每个历元提取验证集性能,以控制提前停止。我使用的是TensorFlow 1.41,运行的是3970 GTX GPU。我的估计器将来看起来是这样的我想定制我网络中的架构,所以我不能使用现成的估计器: def my_estimator(features, labels, mode, params): """ NN estimator function """ # set up regularizer regularizer = tf.contrib.lay

我有一个两层神经网络的自定义估计器,我想在每个历元提取验证集性能,以控制提前停止。我使用的是TensorFlow 1.41,运行的是3970 GTX GPU。我的估计器将来看起来是这样的我想定制我网络中的架构,所以我不能使用现成的估计器:

def my_estimator(features, labels, mode, params):
""" NN estimator function """

# set up regularizer
regularizer = tf.contrib.layers.l2_regularizer(scale=params["l2_lambda"])
drop_rate = scale=params["dropout"]

# input layer will just be features
input_layer = features

# Hidden fully connected layer with n_hidden_1 neurons
layer_1 = tf.layers.dense(inputs = input_layer,units= n_hidden_1, activation = tf.nn.relu, kernel_regularizer=regularizer)

# droupout for Hidden 1
dropout_1 = tf.layers.dropout(inputs=layer_1, rate=drop_rate,
                              training=mode == tf.estimator.ModeKeys.TRAIN)

# Hidden fully connected layer with n_hidden_2 neurons
layer_2 = tf.layers.dense(inputs = dropout_1,units= n_hidden_2, activation = tf.nn.relu,kernel_regularizer=regularizer)

# droupout for Hidden 2
dropout_2 = tf.layers.dropout(inputs=layer_2, rate=drop_rate,
                              training=mode == tf.estimator.ModeKeys.TRAIN)

# Output fully connected layer with p neurons
output_layer = tf.layers.dense(inputs = dropout_2,units= p, activation= None,kernel_regularizer=regularizer)

# Reshape output layer to 1-dim Tensor to return predictions
predictions = tf.reshape(output_layer, [-1])

predictions_dict = {"predictions": predictions}

# If prediction mode, early return
if mode == tf.estimator.ModeKeys.PREDICT:
    return tf.estimator.EstimatorSpec(mode, predictions=predictions_dict)

# Calculate loss using mean squared error
objective_loss = tf.losses.mean_squared_error(tf.reshape(labels, [-1]), predictions)
reg_losses = tf.cast(tf.losses.get_regularization_loss(),tf.float32)
loss = objective_loss+ reg_losses

# add eval metric ops
eval_metric_ops = {"rmse": tf.metrics.root_mean_squared_error(tf.reshape(labels, [-1]), predictions),
                   "mae": tf.metrics.mean_absolute_error(tf.reshape(labels, [-1]), predictions)}

# build tranning op
learning_rate = tf.train.exponential_decay(params["learning_rate"], tf.train.get_global_step(),
                                       100, 0.96, staircase=True)

gdoptim = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op =gdoptim.minimize(loss=loss, global_step=tf.train.get_global_step())
grads_and_vars = gdoptim.compute_gradients(loss)
grad_norms = [tf.nn.l2_loss(g) for g, v in grads_and_vars]
grad_norm = tf.add_n(grad_norms)

# add some additional information to the summary writer
tf.summary.scalar("cost", objective_loss)              
tf.summary.scalar("reg_cost", reg_losses)          
tf.summary.scalar("grad_norm", grad_norm)
tf.summary.scalar("learningrate", learning_rate)
summary_op = tf.summary.merge_all()

return tf.estimator.EstimatorSpec(mode, predictions, loss, train_op, eval_metric_ops)
我的培训如下:

nn = tf.estimator.Estimator(model_fn=my_estimator, params=model_params,config=run_config)
# for each training epoch:
for i in range(training_epochs):
    nn.train(input_fn=train_input_fn,steps=num_batches)
    val_result = nn.evaluate(input_fn=eval_val_input_fn)
    # decide to break based on val_result
train_input_fn是一个tf.Dataset迭代器,它成批移动,而eval_val_input_fn在一批中遍历更小的集合。数据都是连续的特征,标签是标量——这是一项回归任务。我的问题是,我的输出中充斥着数百个设备创建实例:

2018-01-29 17:33:49.721025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 970, pci bus id: 0000:02:00.0, compute capability: 5.2)
2018-01-29 17:33:49.721047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: GeForce GTX 970, pci bus id: 0000:03:00.0, compute capability: 5.2)
2018-01-29 17:33:49.854761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2)
2018-01-29 17:33:49.854793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 970, pci bus id: 0000:02:00.0, compute capability: 5.2)
2018-01-29 17:33:49.854811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: GeForce GTX 970, pci bus id: 0000:03:00.0, compute capability: 5.2)
2018-01-29 17:33:50.260143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2)
2018-01-29 17:33:50.260177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 970, pci bus id: 0000:02:00.0, compute capability: 5.2)

估算器似乎对用户隐藏了大量关于会话和图形的信息。我不认为有必要创建这么多次设备-当然这可以在一个会话中完成

当您调用estimator.train时,它将启动一个新会话并从检查点重新加载权重。因此,它会不断创建新的会话和设备。

感谢您的回复。我是否应该使用不同的工作流来避免这种情况?我可以不强制会议继续吗?我想你无法避免。最好通过钩子实现评估。