Python 如何计算滚动窗口中数据帧列中相同实例的数量
我试图在每个滑动窗口内为该数据计算相同ID的数量:Python 如何计算滚动窗口中数据帧列中相同实例的数量,python,pandas,numpy,machine-learning,data-mining,Python,Pandas,Numpy,Machine Learning,Data Mining,我试图在每个滑动窗口内为该数据计算相同ID的数量: ID DATE 2017-05-17 15:49:51 s_2 2017-05-17 15:49:52 s_5 2017-05-17 15:49:55 s_2 2017-05-17 15:49:56 s_3 2017-05-17 15:49:58 s_5 201
ID
DATE
2017-05-17 15:49:51 s_2
2017-05-17 15:49:52 s_5
2017-05-17 15:49:55 s_2
2017-05-17 15:49:56 s_3
2017-05-17 15:49:58 s_5
2017-05-17 15:49:59 s_5
我正试图数一数大小为3的滚动窗口中相互重叠的相同ID的数量。答案应该是这样的:
DATE ID s_2_count s_3_count s_5_count
2017-05-17 15:49:51 s_2 2 0 1
2017-05-17 15:49:52 s_5 1 1 1
2017-05-17 15:49:55 s_2 1 1 1
2017-05-17 15:49:56 s_3 0 1 2
2017-05-17 15:49:58 s_5 NaN NaN NaN
2017-05-17 15:49:59 s_5 NaN NaN NaN
使用
str.get\u dummies
,rolling
,sum
,shift
,以及添加前缀
:
df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count')
输出:
s_2_count s_3_count s_5_count
DATE
2017-05-17 15:49:51 2.0 0.0 1.0
2017-05-17 15:49:52 1.0 1.0 1.0
2017-05-17 15:49:55 1.0 1.0 1.0
2017-05-17 15:49:56 0.0 1.0 2.0
2017-05-17 15:49:58 NaN NaN NaN
2017-05-17 15:49:59 NaN NaN NaN
ID s_2_count s_3_count s_5_count
DATE
2017-05-17 15:49:51 s_2 2.0 0.0 1.0
2017-05-17 15:49:52 s_5 1.0 1.0 1.0
2017-05-17 15:49:55 s_2 1.0 1.0 1.0
2017-05-17 15:49:56 s_3 0.0 1.0 2.0
2017-05-17 15:49:58 s_5 NaN NaN NaN
2017-05-17 15:49:59 s_5 NaN NaN NaN
让我们将其分配回数据帧:
df.assign(**df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count'))
或者使用join
df.join(df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count'))
输出:
s_2_count s_3_count s_5_count
DATE
2017-05-17 15:49:51 2.0 0.0 1.0
2017-05-17 15:49:52 1.0 1.0 1.0
2017-05-17 15:49:55 1.0 1.0 1.0
2017-05-17 15:49:56 0.0 1.0 2.0
2017-05-17 15:49:58 NaN NaN NaN
2017-05-17 15:49:59 NaN NaN NaN
ID s_2_count s_3_count s_5_count
DATE
2017-05-17 15:49:51 s_2 2.0 0.0 1.0
2017-05-17 15:49:52 s_5 1.0 1.0 1.0
2017-05-17 15:49:55 s_2 1.0 1.0 1.0
2017-05-17 15:49:56 s_3 0.0 1.0 2.0
2017-05-17 15:49:58 s_5 NaN NaN NaN
2017-05-17 15:49:59 s_5 NaN NaN NaN
选项2使用pd.crosstab
df.assign(**pd.crosstab(df.index,df.ID).rolling(3).sum().shift(-2))
或者使用join
df.join(pd.crosstab(df.index,df.ID).rolling(3).sum().shift(-2))
@阿里…如果你想在列中获得所有内容,你可以重置索引。非常感谢!这是一个非常聪明的方法,我可以问另一个问题,
**df.ID
或**pd.crosstab
,它们是指针吗?您有任何关于使用**
的资料吗?@Ali我认为数据帧的**字典解包没有文档记录,因此我使用join选项更新了此解决方案。非常感谢!我现在理解了代码,但是如果可以的话,您能简要解释一下**
符号吗