可以从Dask读取拼花地板元数据吗?
我有数千个拼花地板文件需要处理。在处理文件之前,我尝试使用拼花元数据获取有关文件的各种信息,例如每个分区中的行数、分钟数、最大值等 我尝试使用dask读取元数据。我希望在集群中分发元数据收集任务,但这似乎会导致dask不稳定。请参见下面的示例代码段和节点超时错误 有没有办法从Dask读取拼花地板元数据?我知道Dask的“read_parquet”函数有一个“gather_statistics”选项,您可以将其设置为false以加快文件读取速度。但是,如果设置为true,我看不到访问所有拼花地板元数据/统计信息的方法 示例代码:可以从Dask读取拼花地板元数据吗?,dask,parquet,dask-distributed,dask-delayed,fastparquet,Dask,Parquet,Dask Distributed,Dask Delayed,Fastparquet,我有数千个拼花地板文件需要处理。在处理文件之前,我尝试使用拼花元数据获取有关文件的各种信息,例如每个分区中的行数、分钟数、最大值等 我尝试使用dask读取元数据。我希望在集群中分发元数据收集任务,但这似乎会导致dask不稳定。请参见下面的示例代码段和节点超时错误 有没有办法从Dask读取拼花地板元数据?我知道Dask的“read_parquet”函数有一个“gather_statistics”选项,您可以将其设置为false以加快文件读取速度。但是,如果设置为true,我看不到访问所有拼花地板元
@dask.delayed
def get_pf(item_to_read):
pf = fastparquet.ParquetFile(item_to_read)
row_groups = pf.row_groups.copy()
all_stats = pf.statistics.copy()
col = pf.info['columns'].copy()
return [row_groups, all_stats, col]
stats_arr = get_pf(item_to_read)
示例错误:
2019-10-03 01:43:51,202 - INFO - 192.168.0.167 - distributed.worker - ERROR - Worker stream died during communication: tcp://192.168.0.223:34623
2019-10-03 01:43:51,203 - INFO - 192.168.0.167 - Traceback (most recent call last):
2019-10-03 01:43:51,204 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 218, in connect
2019-10-03 01:43:51,206 - INFO - 192.168.0.167 - quiet_exceptions=EnvironmentError,
2019-10-03 01:43:51,207 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,210 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,211 - INFO - 192.168.0.167 - tornado.util.TimeoutError: Timeout
2019-10-03 01:43:51,212 - INFO - 192.168.0.167 -
2019-10-03 01:43:51,213 - INFO - 192.168.0.167 - During handling of the above exception, another exception occurred:
2019-10-03 01:43:51,214 - INFO - 192.168.0.167 -
2019-10-03 01:43:51,215 - INFO - 192.168.0.167 - Traceback (most recent call last):
2019-10-03 01:43:51,217 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/worker.py", line 1841, in gather_dep
2019-10-03 01:43:51,218 - INFO - 192.168.0.167 - self.rpc, deps, worker, who=self.address
2019-10-03 01:43:51,219 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,220 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,222 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,223 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,224 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/worker.py", line 3029, in get_data_from_worker
2019-10-03 01:43:51,225 - INFO - 192.168.0.167 - comm = yield rpc.connect(worker)
2019-10-03 01:43:51,640 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,641 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,643 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,644 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,645 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/core.py", line 866, in connect
2019-10-03 01:43:51,646 - INFO - 192.168.0.167 - connection_args=self.connection_args,
2019-10-03 01:43:51,647 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run
2019-10-03 01:43:51,649 - INFO - 192.168.0.167 - value = future.result()
2019-10-03 01:43:51,650 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run
2019-10-03 01:43:51,651 - INFO - 192.168.0.167 - yielded = self.gen.throw(*exc_info) # type: ignore
2019-10-03 01:43:51,652 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 230, in connect
2019-10-03 01:43:51,653 - INFO - 192.168.0.167 - _raise(error)
2019-10-03 01:43:51,654 - INFO - 192.168.0.167 - File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 207, in _raise
2019-10-03 01:43:51,656 - INFO - 192.168.0.167 - raise IOError(msg)
2019-10-03 01:43:51,657 - INFO - 192.168.0.167 - OSError: Timed out trying to connect to 'tcp://192.168.0.223:34623' after 10 s: connect() didn't finish in time
dd.read\u拼花地板花很长时间吗?如果没有,那么您可以按照其中的任何策略在客户机中进行阅读
如果数据在根目录中有一个\u metadata
文件,那么您只需使用fastparquet打开它,这正是Dask所要做的。它包含所有数据块的所有详细信息
没有特别的理由认为分发元数据读取应该是一个问题,但是您应该知道,在某些情况下,总的元数据项加起来可能相当大。谢谢,@mdurant。dd.read_parquet本身速度很快,直接给我提供了很多信息(例如列名),但是获取每个分区中的行数等信息要比直接读取元数据慢得多,因为您基本上必须持久化/计算整个dask读取操作。所以,这听起来像是延迟fastparquet元数据读取操作,并像我所做的那样计算这些操作来分发操作,这是最好的方法。也许我还有其他问题导致了不稳定。