如何让Dask知道索引已排序?

如何让Dask知道索引已排序?,dask,Dask,根据答案,如果Dask知道数据帧的索引已排序,则Dask数据帧可以执行智能索引 如果索引已排序,如何让Dask知道 在我的具体情况下,我会这样做: for source is sources: # This df has a datetimeindex that I know to be sorted pd = load_pandas_df_from_some_source(source) dd = dask.dataframe.from_pandas(pd, chunksize=f

根据答案,如果Dask知道数据帧的索引已排序,则Dask数据帧可以执行智能索引

如果索引已排序,如何让Dask知道

在我的具体情况下,我会这样做:

for source is sources:
  # This df has a datetimeindex that I know to be sorted
  pd = load_pandas_df_from_some_source(source)
  dd = dask.dataframe.from_pandas(pd, chunksize=foo)
  dd.to_hdf(some_unique_filename, '/data')
但是,当我这样做时,索引速度非常慢:

 dd = dask.dataframe.read_hdf(some_glob, '/data')
 print(dd.loc['2001-1-1':'2001-1-2'])

我假设Dask不知道我的数据帧已排序。如何让它知道?

从HDF加载时,每个分区中索引的数据值不一定已知。这些属性用于构造用于加速查找的daataframe的
divisions
属性

对于像您这样的数据集,您应该能够传递
sorted\u index=True
,并获得所需的行为


正如@kuanb所建议的,您可能希望尝试以拼花格式存储,这是专门为表格数据设计的。它是否提供更高的性能将取决于数据的性质(hdf主要为数字数据编写)和用例ymmv;然而,parquet通常在保持每个分区中数据值的元数据统计信息方面做得很好。

正如@mdurant所建议的,使用是理想的

更一般地说,您可以使用设置任何数据帧上的索引,即使是使用其他方法创建的索引。如果新的索引列已经排序,并且您已经知道分区之间的分隔值,则此函数具有新的关键字,使您能够提高效率。这是当前的docstring。您可能对最后一个示例感兴趣:

"""Set the DataFrame index (row labels) using an existing column

This realigns the dataset to be sorted by a new column.  This can have a
significant impact on performance, because joins, groupbys, lookups, etc.
are all much faster on that column.  However, this performance increase
comes with a cost, sorting a parallel dataset requires expensive shuffles.
Often we ``set_index`` once directly after data ingest and filtering and
then perform many cheap computations off of the sorted dataset.

This function operates exactly like ``pandas.set_index`` except with
different performance costs (it is much more expensive).  Under normal
operation this function does an initial pass over the index column to
compute approximate qunatiles to serve as future divisions.  It then passes
over the data a second time, splitting up each input partition into several
pieces and sharing those pieces to all of the output partitions now in
sorted order.

In some cases we can alleviate those costs, for example if your dataset is
sorted already then we can avoid making many small pieces or if you know
good values to split the new index column then we can avoid the initial
pass over the data.  For example if your new index is a datetime index and
your data is already sorted by day then this entire operation can be done
for free.  You can control these options with the following parameters.

Parameters
----------
df: Dask DataFrame
index: string or Dask Series
npartitions: int, None, or 'auto'
    The ideal number of output partitions.   If None use the same as
    the input.  If 'auto' then decide by memory use.
shuffle: string, optional
    Either ``'disk'`` for single-node operation or ``'tasks'`` for
    distributed operation.  Will be inferred by your current scheduler.
sorted: bool, optional
    If the index column is already sorted in increasing order.
    Defaults to False
divisions: list, optional
    Known values on which to separate index values of the partitions.
    See http://dask.pydata.org/en/latest/dataframe-design.html#partitions
    Defaults to computing this with a single pass over the data. Note
    that if ``sorted=True``, specified divisions are assumed to match
    the existing partitions in the data. If this is untrue, you should
    leave divisions empty and call ``repartition`` after ``set_index``.
compute: bool
    Whether or not to trigger an immediate computation. Defaults to False.

Examples
--------
>>> df2 = df.set_index('x')  # doctest: +SKIP
>>> df2 = df.set_index(d.x)  # doctest: +SKIP
>>> df2 = df.set_index(d.timestamp, sorted=True)  # doctest: +SKIP

A common case is when we have a datetime column that we know to be
sorted and is cleanly divided by day.  We can set this index for free
by specifying both that the column is pre-sorted and the particular
divisions along which is is separated

>>> import pandas as pd
>>> divisions = pd.date_range('2000', '2010', freq='1D')
>>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions)  # doctest: +SKIP
    """

达斯克不是“更喜欢”阿帕奇拼花地板吗?也就是说,我认为HDF5的局限性在于它需要存储为单个文件,而拼花地板文件可以分发。可能发生的情况是Dask需要重新读取整个HDF5文件并将其转换为数据帧,而这正是花费如此长时间的部分,而不是
.loc
?您是否已将这两个步骤分开以观察是否可能是这种情况?但是设置sorted_index=True force npartitions=1,这是预期的行为吗?