Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/286.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用python并行下载和解压缩gzip文件_Python_Multithreading_Parallel Processing_Gzip_Zlib - Fatal编程技术网

用python并行下载和解压缩gzip文件

用python并行下载和解压缩gzip文件,python,multithreading,parallel-processing,gzip,zlib,Python,Multithreading,Parallel Processing,Gzip,Zlib,我正在下载数万个~20MB gzip文件,并将.csv内容读取到数据帧中。我用来下载它们的功能代码如下 def download(url, chuk_size=125000): r = requests.get(url, stream=True) gz = io.BytesIO(b'') for chunk in r.iter_content(chunk_size=chunk_size): gz.write(chunk) gz.seek(0)

我正在下载数万个~20MB gzip文件,并将.csv内容读取到数据帧中。我用来下载它们的功能代码如下

def download(url, chuk_size=125000):
    r = requests.get(url, stream=True)
    gz = io.BytesIO(b'')
    for chunk in r.iter_content(chunk_size=chunk_size):
        gz.write(chunk)
    gz.seek(0)
    df = pd.read_csv(gz, compression='gzip')
    return df

我尝试过使用
多处理
zlib
并行下载和解压缩

# global
d = zlib.decompressobj(16 + zlib.MAX_WBITS)


def decompress(q, chunk, gz):
    chunk = zlib.decompress(chunk, 15 + 32)
    gz.write(chunk)
    q.put(gz)


def download(url, chuk_size=125000):
    q = mp.Queue()
    r = requests.get(url, stream=True)
    gz = io.BytesIO(b'')
    p = None
    for chunk in r.iter_content(chunk_size=chunk_size):
        if p:
            gz = q.get()
            p.join()
        p = mp.Process(target=decompress, args=(q, chunk, gz))
        p.start()
    p.join()
    gz = q.get()
    gz.seek(0)
    df = pd.read_csv(gz)
    return df
当它试图解压第二个块时,会出现此错误并挂起

Traceback (most recent call last):
  File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Gabriel\PycharmProjects\EVEDayTrading\get_orders.py", line 19, in decompress
    chunk = zlib.decompress(chunk)
zlib.error: Error -3 while decompressing data: incorrect header check
当我按^C时:

Traceback (most recent call last):
  File "get_orders.py", line 82, in <module>
    df1 = download_url(os1)
  File "get_orders.py", line 35, in download_url
    gz = q.get()
  File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\queues.py", line 97, in get
    res = self._recv_bytes()
  File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\connection.py", line 305, in _recv_bytes
    waitres = _winapi.WaitForMultipleObjects(
KeyboardInterrupt
回溯(最近一次呼叫最后一次):
文件“get_orders.py”,第82行,在
df1=下载url(os1)
下载url中第35行的文件“get_orders.py”
gz=q.get()
文件“C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\queues.py”,get第97行
res=self.\u recv\u bytes()
文件“C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\connection.py”,第216行,以recv\u字节为单位
buf=自身接收字节(最大长度)
文件“C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\connection.py”,第305行,以字节为单位
waitres=\u winapi.WaitForMultipleObjects(
键盘中断
额外信息以防有用:压缩文件的不是我,我不知道是怎么做的,除了一个
gzip
文件


如何让我编写的代码发挥作用?还有其他方法可以并行下载和解压缩吗?我愿意接受建议,可能是
asyncio
threading

我可以做的一件事是将stream对象直接传递到gzip,这不仅使代码更简单,而且还能将速度提高2秒或2秒左右5%。以下是我使用的旧函数:

In [2]: import requests
   ...: import io
   ...: import pandas as pd
   ...:
   ...: def download_old_method(url, chunk_size=125000):
   ...:     r = requests.get(url, stream=True)
   ...:     gz = io.BytesIO(b'')
   ...:     for chunk in r.iter_content(chunk_size=chunk_size):
   ...:         gz.write(chunk)
   ...:     gz.seek(0)
   ...:     df = pd.read_csv(gz, compression='gzip')
   ...:     return df
   ...:

In [3]: %timeit df = download_old_method(url)
8.44 s ± 2.21 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
现在是更简单、更完善的功能:

In [4]: import gzip
   ...:
   ...:
   ...: def download_new_method(url):
   ...:     r = requests.get(url, stream=True)
   ...:     gz = gzip.GzipFile(fileobj=r.raw)
   ...:     gz.seek(0)
   ...:     df = pd.read_csv(gz)
   ...:     return df
   ...:

In [5]: %timeit df = download_new_method(url)
6.11 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

请求已经自动解码gzip编码的内容,您确定不再尝试再次解码它吗?为什么要将整个内容分块读取到内存中?请将完整的错误回溯添加到您的问题中!@wim很好,但这取决于发送的标题。@wim我正在分块读取因为出于某种原因,这会更快,在我的例子中,请求没有解码gzip@KlausD.添加了完整的错误回溯