Python 尝试分配给数据帧时,GroupBy.apply vs common aggregation
我有这样一个示例数据帧:Python 尝试分配给数据帧时,GroupBy.apply vs common aggregation,python,dataframe,dask,Python,Dataframe,Dask,我有这样一个示例数据帧: df = pd.DataFrame({'http_user':['user1']*10, 'dst':['1111'] * 10, 'dst_port':[80] * 10, 'content': np.random.randint(0, 1024, size=10)}) ddf = dd.from_pandas(df, npartitions=5) group = ddf.groupby(['http_user', 'dst', 'dst_port']) meta_d
df = pd.DataFrame({'http_user':['user1']*10, 'dst':['1111'] * 10, 'dst_port':[80] * 10, 'content': np.random.randint(0, 1024, size=10)})
ddf = dd.from_pandas(df, npartitions=5)
group = ddf.groupby(['http_user', 'dst', 'dst_port'])
meta_df = make_meta(('average', 'f8'))
meta_df = pd.MultiIndex(levels=[['user'], ['111'], [443]], codes=[[]] * 3, names=['http_user', 'dst', 'dst_port'])
with_apply = group.content.apply(lambda s: s.mean(), meta=meta_df)
without_apply = group.content.mean()
without_apply.to_frame('average').assign(average2=without_apply) # this works
without_apply.to_frame('average').assign(average2=with_apply) # This doesn't
例外情况是:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-120-190777069746> in <module>
----> 1 without_apply.to_frame('average').assign(average2=with_apply)
~/virtualenvs/statistics3/lib/python3.6/site-packages/dask/dataframe/core.py in assign(self, **kwargs)
3519 # Figure out columns of the output
3520 df2 =
self._meta_nonempty.assign(**_extract_meta(kwargs, nonempty=True))
-> 3521 return elemwise(methods.assign, self, *pairs, meta=df2)
3522
3523 @derived_from(pd.DataFrame, ua_args=["index"])
~/virtualenvs/statistics3/lib/python3.6/site-packages/dask/dataframe/core.py in elemwise(op, *args, **kwargs)
4273 from .multi import _maybe_align_partitions
4274
-> 4275 args = _maybe_align_partitions(args)
4276 dasks = [arg for arg in args if isinstance(arg, (_Frame, Scalar, Array))]
4277 dfs = [df for df in dasks if isinstance(df, _Frame)]
~/virtualenvs/statistics3/lib/python3.6/site-packages/dask/dataframe/multi.py in _maybe_align_partitions(args)
162 divisions = dfs[0].divisions
163 if not all(df.divisions == divisions for df in dfs):
--> 164 dfs2 = iter(align_partitions(*dfs)[0])
165 return [a if not isinstance(a, _Frame) else next(dfs2) for a in args]
166 return args
~/virtualenvs/statistics3/lib/python3.6/site-packages/dask/dataframe/multi.py in align_partitions(*dfs)
117 if not all(df.known_divisions for df in dfs1):
118 raise ValueError(
--> 119 "Not all divisions are known, can't align "
120 "partitions. Please use `set_index` "
121 "to set the index."
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
---------------------------------------------------------------------------
ValueError回溯(最近一次调用上次)
在里面
---->1不带应用。到框架(“平均”)。分配(平均2=带应用)
分配中的~/virtualenvs/statistics3/lib/python3.6/site-packages/dask/dataframe/core.py(self,**kwargs)
3519#计算出输出的列
3520 df2=
self.\u meta\u nonempty.assign(**\u extract\u meta(kwargs,nonempty=True))
->3521返回元素(methods.assign,self,*pairs,meta=df2)
3522
3523@derived_from(pd.DataFrame,ua_args=[“index”])
elemwise中的~/virtualenvs/statistics3/lib/python3.6/site-packages/dask/dataframe/core.py(op,*args,**kwargs)
4273来自。多重导入\u可能\u对齐\u分区
4274
->4275 args=\u可能\u对齐\u分区(args)
4276 dasks=[arg表示args中的arg,如果存在(arg,(_帧,标量,数组))]
4277 dfs=[如果存在(df,_帧),则dasks中df的df]
~/virtualenvs/statistics3/lib/python3.6/site-packages/dask/dataframe/multi.py在分区(args)中
162个分区=dfs[0]。分区
163如果不是全部(df.divisions==dfs中df的分段):
-->164 dfs2=iter(对齐分区(*dfs)[0])
165对于参数中的a,返回[a if not isinstance(a,_帧)else next(dfs2)]
166返回参数
对齐分区(*dfs)中的~/virtualenvs/statistics3/lib/python3.6/site-packages/dask/dataframe/multi.py
117如果不是全部(dfs1中df的已知分区):
118升值错误(
-->119“并非所有分区都已知,无法对齐”
120“分区。请使用“设置索引”
121“以设置索引。”
ValueError:并非所有分区都已知,无法对齐分区。请使用“set_index”设置索引。
在我的分析中,我有一些聚合可以通过构建自定义的聚合器来完成,并在组上使用它。分配是有效的。但其中一些需要传递一些额外的数据,因此据我所知,我只能使用apply,因为我无法将数据传递给聚合器函数(chunk
,agg
,finalize
)。从文档中可以看出,使用聚合器
对象是可取的。应用的这种行为是预期的吗