Python 列中的dask计数/频率项_Python_Pandas_Dask

Python 列中的dask计数/频率项

python pandas dask

Python 列中的dask计数/频率项,python,pandas,dask,Python,Pandas,Dask,我有一个非常大的数据集（>1000万行）。下面是一个5行的小例子，我可以让Pandas在有术语列表的列中对某些给定术语进行计数。对于运行Pandas的单核机器来说，一切都很好。我得到了预期的结果（10行）。但是，在同一个小数据集（我在这里展示）上，它有5行，当使用Dask进行实验时，计数时，吐出超过10行（基于分区的数量）。这是代码。如果有人能在我误解/出错的地方指导我，我将不胜感激实施：熊猫产量： Dask实现：注意：对于npartition，我尝试了num_cores=1，得到了预期的

我有一个非常大的数据集（>1000万行）。下面是一个5行的小例子，我可以让Pandas在有术语列表的列中对某些给定术语进行计数。对于运行Pandas的单核机器来说，一切都很好。我得到了预期的结果（10行）。但是，在同一个小数据集（我在这里展示）上，它有5行，当使用Dask进行实验时，计数时，吐出超过10行（基于分区的数量）。这是代码。如果有人能在我误解/出错的地方指导我，我将不胜感激

实施：熊猫产量： Dask实现：注意：对于npartition，我尝试了num_cores=1，得到了预期的结果。如果我将num_cores更改为大于1的任何值，则会得到我不理解的结果。例如：当num_cores=2时，结果df有20行（好的……我知道了）。当num_cores=3或4时，我仍然得到20行。当num_cores=5…16时，我得到40行！没有尝试更多

num_cores = 8
ddf = dd.from_pandas(df, npartitions=num_cores * 1)
meta = make_meta({'Term': 'U', 'Capability': 'U', 'Count': 'i8'}, index=pd.Index([], 'i8'))
count_df = ddf.map_partitions(compute_total, term_list, cap_list, meta=meta).compute(scheduler='processes')
print(count_df)
print(count_df.shape)

Dask输出：观察：在看了这个相当长的结果数据帧之后，我想我可以对它做最后一次计算，以得到我想要的。只需按术语、能力和总和分组。我会得到预期的结果（某种程度上）

但是，我想知道，是否可以使用Dask以一种干净的方式完成这项工作。我知道这个问题不是一个“令人尴尬的平行”问题——也就是说，需要对整个数据集有一个全局视图才能获得计数。所以，我必须以“地图->减少”的方式来处理它，我现在正在这样做。有更干净的方法吗？

我想您应该使用

结果

聚集

函数，它将聚合每个核心的结果@我喜欢这个建议。我会试试的。如果您有一个带有结果/聚集的示例，请提供。将非常有用。我想您应该使用

结果

聚集

函数来聚合每个核心的结果@我喜欢这个建议。我会试试的。如果您有一个带有结果/聚集的示例，请提供。这将非常有帮助。

          Term  Capability  Count
0      channel         irc  2.0
1      channel  screenshot  2.0
2   findwindow         irc  2.0
3   findwindow  screenshot  2.0
4  printwindow         irc  1.0
5  printwindow  screenshot  1.0
6      privmsg         irc  2.0
7      privmsg  screenshot  2.0
8        topic         irc  3.0
9        topic  screenshot  3.0

num_cores = 8
ddf = dd.from_pandas(df, npartitions=num_cores * 1)
meta = make_meta({'Term': 'U', 'Capability': 'U', 'Count': 'i8'}, index=pd.Index([], 'i8'))
count_df = ddf.map_partitions(compute_total, term_list, cap_list, meta=meta).compute(scheduler='processes')
print(count_df)
print(count_df.shape)

          Term  Capability  Count
0      channel         irc    1.0
1      channel  screenshot    1.0
2   findwindow         irc    0.0
3   findwindow  screenshot    0.0
4  printwindow         irc    0.0
5  printwindow  screenshot    0.0
6      privmsg         irc    0.0
7      privmsg  screenshot    0.0
8        topic         irc    0.0
9        topic  screenshot    0.0
0      channel         irc    1.0
1      channel  screenshot    1.0
2   findwindow         irc    2.0
3   findwindow  screenshot    2.0
4  printwindow         irc    0.0
5  printwindow  screenshot    0.0
6      privmsg         irc    0.0
7      privmsg  screenshot    0.0
8        topic         irc    0.0
9        topic  screenshot    0.0
0      channel         irc    0.0
1      channel  screenshot    0.0
2   findwindow         irc    0.0
3   findwindow  screenshot    0.0
4  printwindow         irc    0.0
5  printwindow  screenshot    0.0
6      privmsg         irc    0.0
7      privmsg  screenshot    0.0
8        topic         irc    2.0
9        topic  screenshot    2.0
0      channel         irc    0.0
1      channel  screenshot    0.0
2   findwindow         irc    0.0
3   findwindow  screenshot    0.0
4  printwindow         irc    1.0
5  printwindow  screenshot    1.0
6      privmsg         irc    2.0
7      privmsg  screenshot    2.0
8        topic         irc    1.0
9        topic  screenshot    1.0
(40, 3)

df1 = df.groupby(['Term', 'Capability'])['Count'].sum()