Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/visual-studio-code/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Tensorflow PAI教程示例无法运行。与';[出口代码]:177';_Tensorflow_Openpai - Fatal编程技术网

Tensorflow PAI教程示例无法运行。与';[出口代码]:177';

Tensorflow PAI教程示例无法运行。与';[出口代码]:177';,tensorflow,openpai,Tensorflow,Openpai,我在跟踪PAI的工作 以下是我的工作配置: { "jobName": "yuan_tensorflow-distributed-jobguid", "image": "docker.io/openpai/pai.run.tensorflow", "dataDir": "hdfs://10.11.3.2:9000/yuan/sample/tensorflow", "outputDir": "$PAI_DEFAULT_FS_URI/yuan/tensorflow-distribute

我在跟踪PAI的工作

以下是我的工作配置:

{
  "jobName": "yuan_tensorflow-distributed-jobguid",
  "image": "docker.io/openpai/pai.run.tensorflow",
  "dataDir": "hdfs://10.11.3.2:9000/yuan/sample/tensorflow",
  "outputDir": "$PAI_DEFAULT_FS_URI/yuan/tensorflow-distributed-jobguid/output",
  "codeDir": "$PAI_DEFAULT_FS_URI/path/tensorflow-distributed-jobguid/code",
  "virtualCluster": "default",
  "taskRoles": [
    {
      "name": "ps_server",
      "taskNumber": 2,
      "cpuNumber": 2,
      "memoryMB": 8192,
      "gpuNumber": 0,
      "portList": [
        {
          "label": "http",
          "beginAt": 0,
          "portNumber": 1
        },
        {
          "label": "ssh",
          "beginAt": 0,
          "portNumber": 1
        }
      ],
      "command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=ps --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
    },
    {
      "name": "worker",
      "taskNumber": 2,
      "cpuNumber": 2,
      "memoryMB": 16384,
      "gpuNumber": 4,
      "portList": [
        {
          "label": "http",
          "beginAt": 0,
          "portNumber": 1
        },
        {
          "label": "ssh",
          "beginAt": 0,
          "portNumber": 1
        }
      ],
      "command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=worker --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
    }
  ],
  "killAllOnCompletedTaskNumber": 2,
  "retryCount": 0
}
作业提交成功,但很快在大约4分钟后失败

下面是我的“申请摘要”

开始时间:2018年6月15日,晚上8:18:01

完成时间:2018年6月15日,晚上8:22:31

退出诊断:

[ExitStatus]:启动器\退出\状态\未定义[ExitCode]:177 [ExitDiagnostics]:可能在发射器中未定义ExitStatus 用户应用程序本身失败。[退出类型]:未知 ________________________________________________________________________________________________________________________________________________________________________________________________________[ExitCustomizedDiagnostics]:[ExitCode]:1[ExitDiagnostics]: 容器启动异常。容器id: 容器_1529064439409_0003_01_000005出口代码:1堆栈跟踪: ExitCodeException exitCode=1:at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)位于 org.apache.hadoop.util.Shell.run(Shell.java:456)位于 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) 在 org.apache.hadoop.warn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) 在 org.apache.hadoop.warn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) 在 org.apache.hadoop.warn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) 在java.util.concurrent.FutureTask.run(FutureTask.java:266)处 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 运行(Thread.java:748)

外壳输出:[错误]纱线容器中接收到退出信号,正在退出

容器以非零退出代码1退出

________________________________________________________________________________________________________________________________________________________________________________________________________[ExitCustomizedDiagnostics]:

:TASK_已完成:[TaskStatus]:{“taskIndex”:1, “taskRoleName”:“worker”,“taskState”:“TASK_已完成”, taskRetryPolicyState:{“retriedCount”:0,“succeededRetriedCount” :0,“transientNormalRetriedCount”:0, “transientConflictRetriedCount”:0,“非TransientRetriedCount”:0, “unKnownRetriedCount”:0},“taskCreatedTimestamp”:1529065083290, “taskCompletedTimestamp”:1529065346772,“taskServiceStatus”:{ “serviceVersion”:0},“容器ID”: “container_1529064439409_0003_01_000005”,“containerHost”: “10.11.1.9”、“集装箱运输”:“10.11.1.9”、“集装箱港口”: http:2938;ssh:2939;,“containerGpus”:15,“containerLogHttpAddress” : "", “集装箱连接丢失计数”:0,“集装箱调试”: 空,“containerLaunchedTimestamp”:1529065087200, “containerCompletedTimestamp”:1529065346768,“ContainerExit代码”: 1,“containerExitDiagnostics”:“来自 容器启动。\n容器id: 容器\u 1529064439409\u 0003\u 01\u000005\n下一代码:1\n堆栈跟踪: ExitCodeException exitCode=1:\n\t org.apache.hadoop.util.Shell.runCommand(Shell.java:545)\n\tat org.apache.hadoop.util.Shell.run(Shell.java:456)\n\tat org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)\n\t org.apache.hadoop.warn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)\n\t org.apache.hadoop.warn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)\n\t org.apache.hadoop.warn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\t java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\t java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\t java.lang.Thread.run(Thread.java:748)\n\n安装输出:[错误]退出 纱线容器中接收到信号,正在退出…\n\n\n容器已退出 对于非零退出代码1\n,“containerExitType”:“未知”} [集装箱诊断]:集装箱完工 主机名10.11.1.9上的容器_1529064439409_0003_01_000005。 ContainerLogHttpAddress: AppCacheNetworkPath: 10.11.1.9:/var/lib/hadoopdata/nm local dir/usercache/admin/appcache/application\u 1529064439409\u 0003 ContainerLogNetworkPath: 10.11.1.9:/var/lib/warn/userlogs/application\u 1529064439409\u 0003/container\u 1529064439409\u 0003\u 01\u000005 ________________________________________________________________________________________________________________________________________________________________________________________________________[AMStopReason]:任务已完成,任务已完成 启用

找到更多日志详细信息:

[INFO] hdfs_ssh_folder is hdfs://10.11.3.2:9000/Container/admin/yuan_tensorflow-distributed-2/ssh/application_1529064439409_0450
[INFO] task_role_no is 0
[INFO] PAI_TASK_INDEX is 1
[INFO] waitting for ssh key ready
[INFO] waitting for ssh key ready
[INFO] ssh key pair ready ...
[INFO] begin to download ssh key pair from hdfs ...
[INFO] start ssh service
 * Restarting OpenBSD Secure Shell server sshd       [80G 
[74G[ OK ]
[INFO] USER COMMAND START

Traceback (most recent call last):
  File "code/tf_cnn_benchmarks.py", line 38, in <module>
    import benchmark_storage
ImportError: No module named benchmark_storage
[DEBUG] EXIT signal received in docker container, exiting ...

通常,您需要查看所有工人的日志,尤其是第一个退出的容器,以查看那里发生了什么,因为任何退出的容器都会导致启动器提前停止作业,因此您可以看到“容器中收到的退出信号”应用程序诊断内容中的消息。

将不会删除失败作业的日志。作业完成后,它将移动到hdfs


从您的日志中,代码似乎遗漏了一些文件。请下载整个基准文件夹,而不是一个或两个文件(如cnn基准)。

您能在容器日志中找到一些内容吗?谢谢,但PAI似乎会立即清理作业容器。我必须不断刷新日志输出。你是对的,缺少一些依赖项。我已经更新了我的问题。谢谢
{
  "jobName": "tensorflow-cifar10",
  "image": "openpai/pai.example.tensorflow",

  "dataDir": "/tmp/data",
  "outputDir": "/tmp/output",

  "taskRoles": [
    {
      "name": "cifar_train",
      "taskNumber": 1,
      "cpuNumber": 8,
      "memoryMB": 32768,
      "gpuNumber": 1,
      "command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=$PAI_DATA_DIR && python train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
    }
  ]
}