Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/295.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 谷歌云机器学习内存不足_Python_Machine Learning_Out Of Memory_Google Cloud Platform_Cloud - Fatal编程技术网

Python 谷歌云机器学习内存不足

Python 谷歌云机器学习内存不足,python,machine-learning,out-of-memory,google-cloud-platform,cloud,Python,Machine Learning,Out Of Memory,Google Cloud Platform,Cloud,当我选择以下配置(config.yaml)时,我遇到内存不足的问题: 我正在关注谷歌关于“criteo_tft”的教程: 这个链接说他们能够训练1TB的数据!我很高兴尝试一下 我的数据集是分类的,所以它在一次热编码后创建了一个相当大的矩阵(一个大小为520000 x 4000的2D numpy数组)。我可以在一台有32GB内存的本地机器上训练我的数据集,但在云中我不能这样做 以下是我的错误: 请不要担心“使用TensorFlow后端”错误,因为我已经得到了它,即使对于其他较小的数据集,培训工作也

当我选择以下配置(config.yaml)时,我遇到内存不足的问题:

我正在关注谷歌关于“criteo_tft”的教程:

这个链接说他们能够训练1TB的数据!我很高兴尝试一下

我的数据集是分类的,所以它在一次热编码后创建了一个相当大的矩阵(一个大小为520000 x 4000的2D numpy数组)。我可以在一台有32GB内存的本地机器上训练我的数据集,但在云中我不能这样做

以下是我的错误:

请不要担心“使用TensorFlow后端”错误,因为我已经得到了它,即使对于其他较小的数据集,培训工作也是成功的


谁能解释一下是什么导致内存不足(错误247)以及我如何编写config.yaml文件来避免这些问题,并在云中训练我的数据?

我已经解决了这个问题。我需要做几件事:

  • 更改tensorflow版本,尤其是我在云中提交培训工作的方式

  • 而不是一个热编码[为每个添加的新项目创建一列],我切换到

  • 现在,它可以训练一个包含250万行和4200个编码列的分类数据集

    trainingInput:
      scaleTier: CUSTOM
      masterType: large_model
      workerType: complex_model_m
      parameterServerType: large_model
      workerCount: 10
      parameterServerCount: 10
    
    ERROR   2017-12-18 12:57:37 +1100   worker-replica-1        Using TensorFlow 
    backend.
    
    ERROR   2017-12-18 12:57:37 +1100   worker-replica-4        Using TensorFlow                     
    backend.
    
    INFO    2017-12-18 12:57:37 +1100   worker-replica-0        Running command: 
    python -m trainer.task --train-file gs://my_bucket/my_training_file.csv --
    job-dir gs://my_bucket/my_bucket_20171218_125645
    
    ERROR   2017-12-18 12:57:38 +1100   worker-replica-2        Using TensorFlow 
    backend.
    
    ERROR   2017-12-18 12:57:40 +1100   worker-replica-0        Using TensorFlow 
    backend.
    
    ERROR   2017-12-18 12:57:53 +1100   worker-replica-3        Command 
    '['python', '-m', u'trainer.task', u'--train-file', 
    u'gs://my_bucket/my_training_file.csv', '--job-dir', 
    u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
    
    INFO    2017-12-18 12:57:53 +1100   worker-replica-3        Module 
    completed; cleaning up.
    
    INFO    2017-12-18 12:57:53 +1100   worker-replica-3        Clean up 
    finished.
    
    ERROR   2017-12-18 12:57:56 +1100   worker-replica-4        Command 
    '['python', '-m', u'trainer.task', u'--train-file', 
    u'gs://my_bucket/my_training_file.csv', '--job-dir', 
    u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
    
    INFO    2017-12-18 12:57:56 +1100   worker-replica-4        Module 
    completed; cleaning up.
    
    INFO    2017-12-18 12:57:56 +1100   worker-replica-4        Clean up 
    finished.
    
    ERROR   2017-12-18 12:57:58 +1100   worker-replica-2        Command 
    '['python', '-m', u'trainer.task', u'--train-file', 
    u'gs://my_bucket/my_training_file.csv', '--job-dir', 
    u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
    
    INFO    2017-12-18 12:57:58 +1100   worker-replica-2        Module 
    completed; cleaning up.
    
    INFO    2017-12-18 12:57:58 +1100   worker-replica-2        Clean up 
    finished.
    
    ERROR   2017-12-18 12:57:59 +1100   worker-replica-1        Command 
    '['python', '-m', u'trainer.task', u'--train-file', 
    u'gs://my_bucket/my_training_file.csv', '--job-dir', 
    u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
    
    INFO    2017-12-18 12:57:59 +1100   worker-replica-1        Module 
    completed; cleaning up.
    
    INFO    2017-12-18 12:57:59 +1100   worker-replica-1        Clean up finished.
    
    ERROR   2017-12-18 12:58:01 +1100   worker-replica-0        Command 
    '['python', '-m', u'trainer.task', u'--train-file', 
    u'gs://my_bucket/my_training_file.csv', '--job-dir', 
    u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit   status -9
    
    INFO    2017-12-18 12:58:01 +1100   worker-replica-0        Module 
    completed; cleaning up.
    
    INFO    2017-12-18 12:58:01 +1100   worker-replica-0        Clean up finished.
    
    ERROR   2017-12-18 12:58:43 +1100   service     The replica worker 0 ran 
    out-of-memory and exited with a non-zero status of 247. The replica worker 1 
    ran out-of-memory and exited with a non-zero status of 247. The replica 
    worker 2 ran out-of-memory and exited with a non-zero status of 247. The 
    replica worker 3 ran out-of-memory and exited with a non-zero status of 247. 
    The replica worker 4 ran out-of-memory and exited with a non-zero status of 
    247. To find out more about why your job exited please check the logs:  
    https://console.cloud.google.com/logs/viewer?project=a_project_id........(link to to my cloud log)
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-0        Signal 15 (SIGTERM) 
    was caught. Terminated by service. This is normal behavior.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-1        Signal 15 (SIGTERM) 
    was caught. Terminated by service. This is normal behavior.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-0        Module completed; 
    cleaning up.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-0        Clean up finished.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-1        Module completed; 
    cleaning up.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-1        Clean up finished.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-2        Signal 15 
    (SIGTERM) was caught. Terminated by service. This is normal behavior.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-2        Module completed; 
    cleaning up.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-2        Clean up finished.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-3        Signal 15 (SIGTERM) 
    was caught. Terminated by service. This is normal behavior.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-5        Signal 15 (SIGTERM) 
    was caught. Terminated by service. This is normal behavior.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-3        Module completed; 
    cleaning up.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-3        Clean up finished.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-5        Module completed; 
    cleaning up.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-5        Clean up finished.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-4        Signal 15 (SIGTERM) 
    was caught. Terminated by service. This is normal behavior.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-4        Module completed; 
    cleaning up.
    
    INFO    2017-12-18 12:58:44 +1100   ps-replica-4        Clean up finished.
    
    INFO    2017-12-18 12:59:28 +1100   service     Finished tearing down 
    TensorFlow.
    
    INFO    2017-12-18 13:00:17 +1100   service     Job failed.##