Python 在读取sql表之后无法持久化dask数据帧

Python 在读取sql表之后无法持久化dask数据帧,python,dask,Python,Dask,我试图将数据库表读入dask数据帧,然后持久化该数据帧。我尝试了一些变体,它们要么导致内存不足,要么导致错误 我使用的是内存为8GB的Windows10笔记本电脑。当我试图读入大型MySQL或Oracle数据库表时,问题就出现了。我可以用SQLite重现这个问题 下面是设置700 MB SQLite表以重现问题的代码。(请原谅python代码中的任何笨拙之处——我已经做了10年的SAS数据分析师。我正在寻找一个更便宜的替代方案,因此我对python、numpy、pandas和dask不熟悉。请注

我试图将数据库表读入dask数据帧,然后持久化该数据帧。我尝试了一些变体,它们要么导致内存不足,要么导致错误

我使用的是内存为8GB的Windows10笔记本电脑。当我试图读入大型MySQL或Oracle数据库表时,问题就出现了。我可以用SQLite重现这个问题

下面是设置700 MB SQLite表以重现问题的代码。(请原谅python代码中的任何笨拙之处——我已经做了10年的SAS数据分析师。我正在寻找一个更便宜的替代方案,因此我对python、numpy、pandas和dask不熟悉。请注意,SAS可以读取SQLite表,将其写入磁盘,并在90秒内创建索引,而无需锁定笔记本电脑。)

我在dask调度程序上尝试了4种变体:

  • 默认调度程序--这是OOM,笔记本电脑已锁定

  • 具有多个进程的本地分布式调度程序--这导致了tornado异常

  • 带有一个进程的本地分布式调度程序——这是OOM

  • 从命令行启动dask调度程序和dask工作进程,将工作进程内存限制为3 GB。这一变化导致了一个错误,工人被杀

  • 每个变体的代码如下所示。我怎样才能做到这一点

    一,

  • 本地分布式调度器

    import dask.dataframe as ddf
    from dask.distributed import Client
    import dask
    import chest
    cache = chest.Chest(path='c:\\temp2', available_memory=8e9)
    dask.set_options(cache=cache)
    client = Client()  
    dbPath = "C:\\temp2\\test.db"
    connString = "sqlite:///{}".format(dbPath)
    df = ddf.read_sql_table('testTbl', connString, index_col = 'a')
    df = client.persist(df)
    
  • 例外情况是这样开始的:

    >>> tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:57522, threads: 1>>
    Traceback (most recent call last):
      File "C:\Program Files\Python36\lib\site-packages\psutil\_pswindows.py", line 635, in wrapper
        return fun(self, *args, **kwargs)
      File "C:\Program Files\Python36\lib\site-packages\psutil\_pswindows.py", line 821, in create_time
        return cext.proc_create_time(self.pid)
    ProcessLookupError: [Errno 3] No such process
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Program Files\Python36\lib\site-packages\psutil\__init__.py", line 368, in _init
        self.create_time()
      File "C:\Program Files\Python36\lib\site-packages\psutil\__init__.py", line 699, in create_time
        self._create_time = self._proc.create_time()
      File "C:\Program Files\Python36\lib\site-packages\psutil\_pswindows.py", line 640, in wrapper
        raise NoSuchProcess(self.pid, self._name)
    psutil._exceptions.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=14212)
    
  • dask调度程序,dask工作程序

  • 一个命令行: c:>dask调度程序--主机127.0.0.1

    另一个命令行: c:>dask worker 127.0.0.1:8786--nprocs 1--nthreads 1--name worker-1--内存限制3GB--本地目录c:\temp2

    import dask.dataframe as ddf
    from dask.distributed import Client
    import dask
    import chest
    cache = chest.Chest(path='c:\\temp2', available_memory=8e9)
    dask.set_options(cache=cache)
    client = Client(address="127.0.0.1:8786")
    dbPath = "C:\\temp2\\test.db"
    connString = "sqlite:///{}".format(dbPath)
    df = ddf.read_sql_table('testTbl', connString, index_col = 'a')
    df = client.persist(df)
    
    工人被这些消息一次又一次地杀害:

    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.12 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.16 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.24 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.31 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.39 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 2.46 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.47 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.54 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.61 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.66 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.73 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.81 GB -- Worker memory limit: 3.00 GB
    distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
    distributed.nanny - WARNING - Worker process 17916 was killed by signal 15
    distributed.nanny - WARNING - Restarting worker
    

    我不相信您在
    'a'
    列上有索引,这意味着在扫描表时,每个分区访问可能都在使用sqlite中的大量内存。在任何情况下,pandas通过sqlalchemy访问DBs的内存效率都不是特别高,所以我对访问过程中出现内存尖峰并不感到惊讶

    但是,您可以增加分区的数量,以便能够访问数据。例如:

    df = ddf.read_sql_table('testTbl', connString, index_col = 'a', npartitions=20)
    
    或者减少可用线程/进程的数量,以便每个线程有更多的内存


    请注意,
    cost
    在这里对您没有任何帮助,它只能保存完成的结果,并且在加载数据的过程中发生内存尖峰(此外,分布式工作人员应该溢出到磁盘,而不显式提供缓存)。

    我更改了三件事,这有助于防止内存尖峰。我在sqlite表的“a”列中添加了一个索引。我用processs=False调用客户端,并在read\u sql\u表中使用npartitions=35。谢谢。如果您觉得这些信息有用,您可能希望接受答案。
    import dask.dataframe as ddf
    from dask.distributed import Client
    import dask
    import chest
    cache = chest.Chest(path='c:\\temp2', available_memory=8e9)
    dask.set_options(cache=cache)
    client = Client(address="127.0.0.1:8786")
    dbPath = "C:\\temp2\\test.db"
    connString = "sqlite:///{}".format(dbPath)
    df = ddf.read_sql_table('testTbl', connString, index_col = 'a')
    df = client.persist(df)
    
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.12 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.16 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.24 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.31 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.39 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 2.46 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.47 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.54 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.61 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.66 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.73 GB -- Worker memory limit: 3.00 GB
    distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.81 GB -- Worker memory limit: 3.00 GB
    distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
    distributed.nanny - WARNING - Worker process 17916 was killed by signal 15
    distributed.nanny - WARNING - Restarting worker
    
    df = ddf.read_sql_table('testTbl', connString, index_col = 'a', npartitions=20)