Python 使用snappy压缩时,Dask数据帧上的操作失败

Python 使用snappy压缩时,Dask数据帧上的操作失败,python,pandas,dask,amazon-emr,Python,Pandas,Dask,Amazon Emr,我使用pandas.DataFrame.to_parquet将一个大数据集划分为一系列parquet文件,并将它们保存到S3。然后,我使用Dask将这些数据读入集群上的Dask import dask.dataframe as dd df = dd.read_parquet( 's3://aleksey-emr-dask/data/2019-taxi-dataset/', storage_options={'key': 'secret', 'secret': 'secret'},

我使用
pandas.DataFrame.to_parquet
将一个大数据集划分为一系列
parquet
文件,并将它们保存到S3。然后,我使用
Dask将这些数据读入集群上的Dask

import dask.dataframe as dd
df = dd.read_parquet(
    's3://aleksey-emr-dask/data/2019-taxi-dataset/',
    storage_options={'key': 'secret', 'secret': 'secret'},
    engine='fastparquet'
)
默认情况下,
pandas
使用
snappy
压缩<只要安装
python snappy
snappy
软件包,code>fastparquet
就能够使用这种压缩。由于我在AWS EMR上运行并使用,我已使用
--botstrap actions
标志和
--conda packages
可选参数从
conda forge
安装了这些软件包:

python3 -m pip list | grep snappy
python-snappy          0.5.4
这足以使
dd.read\u拼花地板
成功。但是,某些操作会失败,并出现
KeyError:snappy
。例如,此操作失败:

passenger_counts = df.trip_distance.value_counts().compute()
我知道这不是群集配置的错误,因为其他操作(如此操作)会成功:

vendors = df.VendorID.value_counts().compute()
> 2.0    53516733
> 1.0    30368157
> 4.0      267080
> Name: VendorID, dtype: int64
这就引出了我的问题。Dask是否不支持
snappy
压缩,即使其IO引擎(
fastparquet
在本例中)支持压缩

以下是错误消息的全文:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<timed exec> in <module>

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/dask/base.py in compute(self, **kwargs)
    165         dask.base.compute
    166         """
--> 167         (result,) = compute(self, traverse=False, **kwargs)
    168         return result
    169 

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/dask/base.py in compute(*args, **kwargs)
    445         postcomputes.append(x.__dask_postcompute__())
    446 
--> 447     results = schedule(dsk, keys, **kwargs)
    448     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    449 

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2686                     should_rejoin = False
   2687             try:
-> 2688                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2689             finally:
   2690                 for f in futures.values():

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1986                 direct=direct,
   1987                 local_worker=local_worker,
-> 1988                 asynchronous=asynchronous,
   1989             )
   1990 

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    831         else:
    832             return sync(
--> 833                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    834             )
    835 

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    337     if error[0]:
    338         typ, exc, tb = error[0]
--> 339         raise exc.with_traceback(tb)
    340     else:
    341         return result[0]

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/utils.py in f()
    321             if callback_timeout is not None:
    322                 future = asyncio.wait_for(future, callback_timeout)
--> 323             result[0] = yield future
    324         except Exception as exc:
    325             error[0] = sys.exc_info()

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1874                 else:
   1875                     self._gather_future = future
-> 1876                 response = await future
   1877 
   1878             if response["status"] == "error":

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/client.py in _gather_remote(self, direct, local_worker)
   1925 
   1926             else:  # ask scheduler to gather data for us
-> 1927                 response = await retry_operation(self.scheduler.gather, keys=keys)
   1928 
   1929         return response

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/utils_comm.py in retry_operation(coro, operation, *args, **kwargs)
    388         delay_min=retry_delay_min,
    389         delay_max=retry_delay_max,
--> 390         operation=operation,
    391     )

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/utils_comm.py in retry(coro, count, delay_min, delay_max, jitter_fraction, retry_on_exceptions, operation)
    368                 delay *= 1 + random.random() * jitter_fraction
    369             await asyncio.sleep(delay)
--> 370     return await coro()
    371 
    372 

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/core.py in send_recv_from_rpc(**kwargs)
    859             name, comm.name = comm.name, "ConnectionPool." + key
    860             try:
--> 861                 result = await send_recv(comm=comm, op=key, **kwargs)
    862             finally:
    863                 self.pool.reuse(self.addr, comm)

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/core.py in send_recv(comm, reply, serializers, deserializers, **kwargs)
    642         await comm.write(msg, serializers=serializers, on_error="raise")
    643         if reply:
--> 644             response = await comm.read(deserializers=deserializers)
    645         else:
    646             response = None

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/comm/tcp.py in read(self, deserializers)
    204                     deserialize=self.deserialize,
    205                     deserializers=deserializers,
--> 206                     allow_offload=self.allow_offload,
    207                 )
    208             except EOFError:

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/comm/utils.py in from_frames(frames, deserialize, deserializers, allow_offload)
     85         res = await offload(_from_frames)
     86     else:
---> 87         res = _from_frames()
     88 
     89     return res

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/comm/utils.py in _from_frames()
     64         try:
     65             return protocol.loads(
---> 66                 frames, deserialize=deserialize, deserializers=deserializers
     67             )
     68         except EOFError:

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/protocol/core.py in loads(frames, deserialize, deserializers)
    126             if deserialize or key in bytestrings:
    127                 if "compression" in head:
--> 128                     fs = decompress(head, fs)
    129                 fs = merge_frames(head, fs)
    130                 value = _deserialize(head, fs, deserializers=deserializers)

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/protocol/compression.py in decompress(header, frames)
    214     return [
    215         compressions[c]["decompress"](frame)
--> 216         for c, frame in zip(header["compression"], frames)
    217     ]

~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/protocol/compression.py in <listcomp>(.0)
    214     return [
    215         compressions[c]["decompress"](frame)
--> 216         for c, frame in zip(header["compression"], frames)
    217     ]

KeyError: 'snappy'
---------------------------------------------------------------------------
KeyError回溯(最近一次呼叫最后一次)
在里面
计算中的~/opt/miniconda3/envs/dask本地测试env/lib/python3.7/site-packages/dask/base.py(self,**kwargs)
165 dask.base.compute
166         """
-->167(结果,)=compute(自我,遍历=False,**kwargs)
168返回结果
169
计算中的~/opt/miniconda3/envs/dask本地测试env/lib/python3.7/site-packages/dask/base.py(*args,**kwargs)
445 postcomputes.append(x.\uuuu dask\u postcompute\uuuu())
446
-->447结果=进度表(dsk、键、**kwargs)
448返回重新打包([f(r,*a)用于r,(f,a)压缩(结果,邮政编码)])
449
get中的~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/client.py(self、dsk、key、restrictions、loose\u restrictions、resources、sync、asynchronous、direct、retries、priority、fifo\u timeout、actors、**kwargs)
2686应该重新加入=错误
2687请尝试:
->2688结果=自聚集(打包、异步=异步、直接=直接)
2689最后:
2690表示期货中的f.values():
聚集中的~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/client.py(self、futures、errors、direct、asynchronous)
1986直接=直接,
1987本地工人=本地工人,
->1988异步=异步,
1989             )
1990
~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/client.py处于同步状态(self、func、异步、回调超时、*args、**kwargs)
831其他:
832返回同步(
-->833 self.loop,func,*args,callback\u timeout=callback\u timeout,**kwargs
834             )
835
~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/utils.py处于同步状态(循环、函数、回调超时、*args、**kwargs)
337如果错误[0]:
338典型、exc、tb=错误[0]
-->339带回溯(tb)的提升exc
340其他:
341返回结果[0]
~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/utils.py in f()
321如果回调_超时不是无:
322 future=asyncio.wait\u for(future,回调\u超时)
-->323结果[0]=未来收益率
324例外情况除外,作为exc:
325错误[0]=sys.exc_info()
~/opt/miniconda3/envs/dask本地测试环境/lib/python3.7/site-packages/tornado/gen.py处于运行状态(self)
733
734尝试:
-->735 value=future.result()
736例外情况除外:
737 exc_info=sys.exc_info()
~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/client.py in_-gather(self、futures、errors、direct、local_-worker)
1874其他:
1875自我收集未来=未来
->1876响应=等待未来
1877
1878如果响应[“状态”]=“错误”:
~/opt/miniconda3/envs/dask-local-test-env/lib/python3.7/site-packages/distributed/client.py-in\u-gather\u-remote(自我、直接、本地\u工作者)
1925
1926年:请调度员为我们收集数据
->1927响应=等待重试\u操作(self.scheduler.gather,keys=keys)
1928
1929返回响应
重试操作中的~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/utils\u comm.py(coro,操作,*args,**kwargs)
388延迟时间=重试延迟时间,
389延迟最大值=重试延迟最大值,
-->390操作=操作,
391     )
~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/utils\u comm.py处于重试状态(coro、计数、延迟最小值、延迟最大值、抖动分数、重试异常、操作)
368延迟*=1+随机.random()*抖动分数
369等待异步睡眠(延迟)
-->370返回等待coro()
371
372
~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/core.py in send_recv_from_rpc(**kwargs)
859 name,comm.name=comm.name,“ConnectionPool.”+key
860试试:
-->861结果=等待发送(通信=通信,操作=键,**kwargs)
862最后:
863自池重用(self.addr,comm)
发送recv中的~/opt/miniconda3/envs/dask local test env/lib/python3.7/site-packages/distributed/core.py(comm、reply、序列化器、反序列化器、**kwargs)
642等待通信写入(毫秒)
$ conda install -c conda-forge snappy python-snappy