Python 在HPC集群中创建Dask LocalCluster实例时,SLURM任务失败
我正在使用命令Python 在HPC集群中创建Dask LocalCluster实例时,SLURM任务失败,python,dask,slurm,Python,Dask,Slurm,我正在使用命令sbatch和下一个配置对任务进行排队: #SBATCH --job-name=dask-test #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 #SBATCH --mem=80G #SBATCH --time=00:30:00 #SBATCH --tmp=10G #SBATCH --partition=normal #SBATCH --qos=normal python ./dask-test.py python脚本大致如下所示:
sbatch
和下一个配置对任务进行排队:
#SBATCH --job-name=dask-test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=80G
#SBATCH --time=00:30:00
#SBATCH --tmp=10G
#SBATCH --partition=normal
#SBATCH --qos=normal
python ./dask-test.py
python脚本大致如下所示:
import pandas as pd
import dask.dataframe as dd
import numpy as np
from dask.distributed import Client, LocalCluster
print("Generating LocalCluster...")
cluster = LocalCluster()
print("Generating Client...")
client = Client(cluster, processes=False)
print("Scaling client...")
client.scale(8)
data = dd.read_csv(
BASE_DATA_SOURCE + '/Data-BIGDATFILES-*.csv',
delimiter=';',
)
def get_min_dt():
min_dt = data.datetime.min().compute()
print("Min is {}".format())
print("Getting min dt...")
get_min_dt()
import pandas as pd
import dask.dataframe as dd
import numpy as np
from dask.distributed import Client, LocalCluster
if __name__ == "__main__":
print("Generating LocalCluster...")
cluster = LocalCluster()
print("Generating Client...")
client = Client(cluster, processes=False)
print("Scaling client...")
client.scale(8)
data = dd.read_csv(
BASE_DATA_SOURCE + '/Data-BIGDATFILES-*.csv',
delimiter=';',
)
def get_min_dt():
min_dt = data.datetime.min().compute()
print("Min is {}".format())
print("Getting min dt...")
get_min_dt()
第一个问题是文本“generatinglocalcluster…”打印了6次,这让我怀疑脚本是否同时运行了多次。
第二,在几分钟不打印任何内容后,我收到以下消息:
/anaconda3/lib/python3.7/site-packages/distributed/node.py:155: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 37396 instead
http_address["port"], self.http_server.port
很多次。。最后是下一个,也是很多次:
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /cluster/home/user/anaconda3/lib/python3.7/asyncio/tasks.py:592> exception=RuntimeError('\n An attempt has been made to start a new process before the\n current process has finished its bootstrapping phase.\n\n This probably means that you are not using fork to start your\n child processes and you have forgotten to use the proper idiom\n in the main module:\n\n if __name__ == \'__main__\':\n freeze_support()\n ...\n\n The "freeze_support()" line can be omitted if the program\n is not going to be frozen to produce an executable.')>
Traceback (most recent call last):
File "/cluster/home/user/anaconda3/lib/python3.7/asyncio/tasks.py", line 599, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/core.py", line 290, in _
await self.start()
File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 295, in start
response = await self.instantiate()
File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 378, in instantiate
result = await self.process.start()
File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 575, in start
await self.process.start()
File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 34, in _call_and_set_future
res = func(*args, **kwargs)
File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 202, in _start
process.start()
File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
从未检索到任务异常
未来:
回溯(最近一次呼叫最后一次):
文件“/cluster/home/user/anaconda3/lib/python3.7/asyncio/tasks.py”,第599行,在
返回(等待的收益率。uuu等待的收益率()
文件“/cluster/home/user/anaconda3/lib/python3.7/site packages/distributed/core.py”,第290行,在_
等待self.start()
文件“/cluster/home/user/anaconda3/lib/python3.7/site packages/distributed/nanny.py”,第295行,开头
response=wait self.instantiate()
文件“/cluster/home/user/anaconda3/lib/python3.7/site packages/distributed/nanny.py”,第378行,在实例化中
结果=等待self.process.start()
文件“/cluster/home/user/anaconda3/lib/python3.7/site packages/distributed/nanny.py”,第575行,开头
等待self.process.start()
文件“/cluster/home/user/anaconda3/lib/python3.7/site packages/distributed/process.py”,第34行,在调用和设置中
res=func(*args,**kwargs)
文件“/cluster/home/user/anaconda3/lib/python3.7/site packages/distributed/process.py”,第202行,在_start中
process.start()
文件“/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/process.py”,第112行,开始
self.\u popen=self.\u popen(self)
文件“/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/context.py”,第284行,在
返回Popen(过程对象)
文件“/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py”,第32行,在__
super().\uuuu init\uuuj(进程对象)
文件“/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_fork.py”,第20行,在__
自启动(过程obj)
文件“/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py”,第42行,在
prep_data=spawn.get_preparation_data(进程对象名称)
文件“/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/spawn.py”,第143行,在get_准备_数据中
_选中\u not \u导入\u main()
文件“/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/spawn.py”,第136行,在“检查”非“导入”主目录中
不会冻结以生成可执行文件。“”
运行时错误:
已尝试在启动之前启动新进程
当前进程已完成其引导阶段。
这可能意味着您没有使用fork启动您的应用程序
子进程,而您忘记了使用正确的习惯用法
在主模块中:
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
冻结支持()
...
如果程序
不会冻结以生成可执行文件。
我已经尝试过添加更多内核、更多内存、在客户机
实例化时设置processs=False
,以及其他许多事情,但我无法找出问题所在
使用的库/软件版本包括:
- Python 3.7
- 熊猫1.0.5
- Dask 2.19.0
- slurm 17.11.7
我是不是搞错了?使用本地集群和客户端结构的方法正确吗?经过一些研究,我可以找到一个解决方案。不太确定原因,但非常确定它有效 LocalCluster、Client和它之后的所有代码(将被分发执行的代码)的实例化不能在Python脚本的模块级别。相反,此代码必须位于方法中或_main__块内,如下所示:
import pandas as pd
import dask.dataframe as dd
import numpy as np
from dask.distributed import Client, LocalCluster
print("Generating LocalCluster...")
cluster = LocalCluster()
print("Generating Client...")
client = Client(cluster, processes=False)
print("Scaling client...")
client.scale(8)
data = dd.read_csv(
BASE_DATA_SOURCE + '/Data-BIGDATFILES-*.csv',
delimiter=';',
)
def get_min_dt():
min_dt = data.datetime.min().compute()
print("Min is {}".format())
print("Getting min dt...")
get_min_dt()
import pandas as pd
import dask.dataframe as dd
import numpy as np
from dask.distributed import Client, LocalCluster
if __name__ == "__main__":
print("Generating LocalCluster...")
cluster = LocalCluster()
print("Generating Client...")
client = Client(cluster, processes=False)
print("Scaling client...")
client.scale(8)
data = dd.read_csv(
BASE_DATA_SOURCE + '/Data-BIGDATFILES-*.csv',
delimiter=';',
)
def get_min_dt():
min_dt = data.datetime.min().compute()
print("Min is {}".format())
print("Getting min dt...")
get_min_dt()
这一简单的改变会带来不同。在该问题线程中找到了解决方案:您是绝对的冠军