用python并行下载和解压缩gzip文件
我正在下载数万个~20MB gzip文件,并将.csv内容读取到数据帧中。我用来下载它们的功能代码如下用python并行下载和解压缩gzip文件,python,multithreading,parallel-processing,gzip,zlib,Python,Multithreading,Parallel Processing,Gzip,Zlib,我正在下载数万个~20MB gzip文件,并将.csv内容读取到数据帧中。我用来下载它们的功能代码如下 def download(url, chuk_size=125000): r = requests.get(url, stream=True) gz = io.BytesIO(b'') for chunk in r.iter_content(chunk_size=chunk_size): gz.write(chunk) gz.seek(0)
def download(url, chuk_size=125000):
r = requests.get(url, stream=True)
gz = io.BytesIO(b'')
for chunk in r.iter_content(chunk_size=chunk_size):
gz.write(chunk)
gz.seek(0)
df = pd.read_csv(gz, compression='gzip')
return df
我尝试过使用多处理
和zlib
并行下载和解压缩
# global
d = zlib.decompressobj(16 + zlib.MAX_WBITS)
def decompress(q, chunk, gz):
chunk = zlib.decompress(chunk, 15 + 32)
gz.write(chunk)
q.put(gz)
def download(url, chuk_size=125000):
q = mp.Queue()
r = requests.get(url, stream=True)
gz = io.BytesIO(b'')
p = None
for chunk in r.iter_content(chunk_size=chunk_size):
if p:
gz = q.get()
p.join()
p = mp.Process(target=decompress, args=(q, chunk, gz))
p.start()
p.join()
gz = q.get()
gz.seek(0)
df = pd.read_csv(gz)
return df
当它试图解压第二个块时,会出现此错误并挂起
Traceback (most recent call last):
File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Gabriel\PycharmProjects\EVEDayTrading\get_orders.py", line 19, in decompress
chunk = zlib.decompress(chunk)
zlib.error: Error -3 while decompressing data: incorrect header check
当我按^C时:
Traceback (most recent call last):
File "get_orders.py", line 82, in <module>
df1 = download_url(os1)
File "get_orders.py", line 35, in download_url
gz = q.get()
File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\queues.py", line 97, in get
res = self._recv_bytes()
File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\connection.py", line 305, in _recv_bytes
waitres = _winapi.WaitForMultipleObjects(
KeyboardInterrupt
回溯(最近一次呼叫最后一次):
文件“get_orders.py”,第82行,在
df1=下载url(os1)
下载url中第35行的文件“get_orders.py”
gz=q.get()
文件“C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\queues.py”,get第97行
res=self.\u recv\u bytes()
文件“C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\connection.py”,第216行,以recv\u字节为单位
buf=自身接收字节(最大长度)
文件“C:\Users\Gabriel\anaconda3\envs\evedaytrading\lib\multiprocessing\connection.py”,第305行,以字节为单位
waitres=\u winapi.WaitForMultipleObjects(
键盘中断
额外信息以防有用:压缩文件的不是我,我不知道是怎么做的,除了一个gzip
文件
如何让我编写的代码发挥作用?还有其他方法可以并行下载和解压缩吗?我愿意接受建议,可能是
asyncio
或threading
我可以做的一件事是将stream对象直接传递到gzip,这不仅使代码更简单,而且还能将速度提高2秒或2秒左右5%。以下是我使用的旧函数:
In [2]: import requests
...: import io
...: import pandas as pd
...:
...: def download_old_method(url, chunk_size=125000):
...: r = requests.get(url, stream=True)
...: gz = io.BytesIO(b'')
...: for chunk in r.iter_content(chunk_size=chunk_size):
...: gz.write(chunk)
...: gz.seek(0)
...: df = pd.read_csv(gz, compression='gzip')
...: return df
...:
In [3]: %timeit df = download_old_method(url)
8.44 s ± 2.21 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
现在是更简单、更完善的功能:
In [4]: import gzip
...:
...:
...: def download_new_method(url):
...: r = requests.get(url, stream=True)
...: gz = gzip.GzipFile(fileobj=r.raw)
...: gz.seek(0)
...: df = pd.read_csv(gz)
...: return df
...:
In [5]: %timeit df = download_new_method(url)
6.11 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
请求已经自动解码gzip编码的内容,您确定不再尝试再次解码它吗?为什么要将整个内容分块读取到内存中?请将完整的错误回溯添加到您的问题中!@wim很好,但这取决于发送的标题。@wim我正在分块读取因为出于某种原因,这会更快,在我的例子中,请求没有解码gzip@KlausD.添加了完整的错误回溯