Python Dask数据帧筛选器和重新分区提供了一些空分区_Python_Dataframe_Dask_Dask Dataframe

Python Dask数据帧筛选器和重新分区提供了一些空分区

python dataframe dask

Python Dask数据帧筛选器和重新分区提供了一些空分区,python,dataframe,dask,dask-dataframe,Python,Dataframe,Dask,Dask Dataframe,我试图过滤一个DaskDataFrame，然后使用map\u partitions对每个分区应用一个函数。该函数需要至少有一行的数据帧下面是生成一些虚拟数据作为MCVE的数据帧的代码（然后转换为DaskDataFrame） def create_data(n): df = pd.DataFrame(np.random.rand(6 * n), columns=["A"]) random_integers = np.random.default_rng().c

我试图过滤一个Dask

DataFrame

，然后使用

map\u partitions

对每个分区应用一个函数。该函数需要至少有一行的数据帧

下面是生成一些虚拟数据作为MCVE的数据帧的代码（然后转换为Dask

DataFrame

）

def create_data(n):
    df = pd.DataFrame(np.random.rand(6 * n), columns=["A"])
    random_integers = np.random.default_rng().choice(14, size=n, replace=False)
    df.insert(0, 'store_id', [d for s in random_integers for d in [s] * 6])
    return df

df = create_data(n=10)
print(df.head(15))
>>>
    store_id         A
0         10  0.850730
1         10  0.581119
2         10  0.825802
3         10  0.657797
4         10  0.291961
5         10  0.864984
6          9  0.161334
7          9  0.397162
8          9  0.089300
9          9  0.435914
10         9  0.750741
11         9  0.920625
12         3  0.635727
13         3  0.425270
14         3  0.904043

数据结构：对于每个

存储id

，正好有6行

现在，我创建了一个列表，其中包含一些我想用来过滤上述数据的

store\u id

filtered_store_ids = df["store_id"].value_counts().index[:6].tolist()
print(filtered_store_ids)
>>> [13, 12, 11, 10, 9, 7]

然后，我将上述数据（一个

DataFrame

）转换成一个

dask.DataFrame

ddf = dd.from_pandas(df, npartitions=10)

现在我打印

ddf

for p in range(ddf.npartitions):
    print(f"Partition Index={p}, Number of Rows={len(ddf.get_partition(p))}")
>>>
Partition Index=0, Number of Rows=6
Partition Index=1, Number of Rows=6
Partition Index=2, Number of Rows=6
Partition Index=3, Number of Rows=6
Partition Index=4, Number of Rows=6
Partition Index=5, Number of Rows=6
Partition Index=6, Number of Rows=6
Partition Index=7, Number of Rows=6
Partition Index=8, Number of Rows=6
Partition Index=9, Number of Rows=6

for p in range(ddf.npartitions):
    print(f"Partition Index={p}, Number of Rows={len(ddf.get_partition(p))}")
>>>
Partition Index=0, Number of Rows=0
Partition Index=1, Number of Rows=0
Partition Index=2, Number of Rows=6
Partition Index=3, Number of Rows=6
Partition Index=4, Number of Rows=0
Partition Index=5, Number of Rows=6
Partition Index=6, Number of Rows=6
Partition Index=7, Number of Rows=6
Partition Index=8, Number of Rows=0
Partition Index=9, Number of Rows=6

ddf = cull_empty_partitions(ddf)  # remove empties
ddf = _rebalance_ddf(ddf)         # re-size

这是意料之中的。每个分区有6行和一个（唯一的）

store\u id

。因此，每个分区都包含一个

存储id

的数据

现在，我使用上面的

store\u id

s列表过滤Dask数据帧

ddf = ddf[ddf["store_id"].isin(filtered_store_ids)]

我再次打印过滤后的

ddf

for p in range(ddf.npartitions):
    print(f"Partition Index={p}, Number of Rows={len(ddf.get_partition(p))}")
>>>
Partition Index=0, Number of Rows=6
Partition Index=1, Number of Rows=6
Partition Index=2, Number of Rows=6
Partition Index=3, Number of Rows=6
Partition Index=4, Number of Rows=6
Partition Index=5, Number of Rows=6
Partition Index=6, Number of Rows=6
Partition Index=7, Number of Rows=6
Partition Index=8, Number of Rows=6
Partition Index=9, Number of Rows=6

for p in range(ddf.npartitions):
    print(f"Partition Index={p}, Number of Rows={len(ddf.get_partition(p))}")
>>>
Partition Index=0, Number of Rows=0
Partition Index=1, Number of Rows=0
Partition Index=2, Number of Rows=6
Partition Index=3, Number of Rows=6
Partition Index=4, Number of Rows=0
Partition Index=5, Number of Rows=6
Partition Index=6, Number of Rows=6
Partition Index=7, Number of Rows=6
Partition Index=8, Number of Rows=0
Partition Index=9, Number of Rows=6

ddf = cull_empty_partitions(ddf)  # remove empties
ddf = _rebalance_ddf(ddf)         # re-size

这是预期的，因为每个分区都有一个

store\u id

，通过过滤，一些分区将被完全过滤掉，因此它们将包含零行

因此，现在我将按照重新划分过滤后的

数据帧
我期望这个重新分区操作只会产生大小均匀的非空分区但是，现在当我重新打印分区时，我得到了与前一个类似的输出（分区大小不均匀，一些空分区），就好像重新分区没有发生一样
for p in range(ddf.npartitions):
    print(f"Partition Index={p}, Number of Rows={len(ddf.get_partition(p))}")
>>>
Partition Index=0, Number of Rows=0
Partition Index=1, Number of Rows=6
Partition Index=2, Number of Rows=6
Partition Index=3, Number of Rows=6
Partition Index=4, Number of Rows=12
Partition Index=5, Number of Rows=6

我的下一步是在过滤后对每个分区应用一个函数，但这不起作用，因为有些分区（pandasDataFrame
s）由于缺少行而无法处理
def myadd(df):
    assert df.shape[0] > 0
    ...
    return ...

ddf.map_partitions(myadd)
>>> AssertionError                            Traceback (most recent call last)
.
.
.
AssertionError: 

用于重新分区的Dask文档（与我上面链接的最佳实践相同），看起来很简单，但是在重新分区之后，我仍然得到一些零行的分区，map\u分区
将在这里失败。我肯定我遗漏了什么
有几篇关于重新分区（，）的文章，但它们不涉及空分区
问题
有没有办法确保在重新分区之后，所有分区都会有6行，并且没有空分区？i、 e.是否可以使用大小相同（非空）的分区重新分区Dask数据帧

编辑
看起来空分区目前无法在Dask中处理：问题。这些可能与我在这里遇到的问题有关。
我从SO找到了两篇现有的帖子

使用删除空分区
使用

警告-此函数需要计算


我用它们来解决这个问题
从问题中的原始代码开始（无需更改）
当我现在重新打印分区大小时，所有分区大小都是均匀的，没有一个是空的
for p in range(ddf.npartitions):
    print(f"Partition Index={p}, Number of Rows={len(ddf.get_partition(p))}")
>>>
Partition Index=0, Number of Rows=6
Partition Index=1, Number of Rows=6
Partition Index=2, Number of Rows=6
Partition Index=3, Number of Rows=6
Partition Index=4, Number of Rows=6
Partition Index=5, Number of Rows=6