Python 如何计算数据帧上的连续有序值_Python_Pandas

Python 如何计算数据帧上的连续有序值

python pandas

Python 如何计算数据帧上的连续有序值,python,pandas,Python,Pandas,我试图从给定的数据框中获取连续0值的最大计数，该数据框的id、日期、值列来自pandas上的数据框，如下所示： id date value 354 2019-03-01 0 354 2019-03-02 0 354 2019-03-03 0 354 2019-03-04 5 354 2019-03-05 5 354 2019-03-09 7 354 2019-03-10 0 357 2019-03-01 5 357 2019-03-02 5

我试图从给定的数据框中获取连续0值的最大计数，该数据框的id、日期、值列来自pandas上的数据框，如下所示：

id    date       value
354   2019-03-01 0
354   2019-03-02 0
354   2019-03-03 0
354   2019-03-04 5
354   2019-03-05 5 
354   2019-03-09 7
354   2019-03-10 0
357   2019-03-01 5
357   2019-03-02 5
357   2019-03-03 8
357   2019-03-04 0
357   2019-03-05 0
357   2019-03-06 7
357   2019-03-07 7
540   2019-03-02 7
540   2019-03-03 8
540   2019-03-04 9
540   2019-03-05 8
540   2019-03-06 7
540   2019-03-07 5
540   2019-03-08 2 
540   2019-03-09 3
540   2019-03-10 2

所需结果将按Id分组，如下所示：

id   max_consecutive_zeros
354  3
357  2
540  0

我已经用for实现了我想要的功能，但是当您使用大熊猫数据帧时，它会变得非常慢，我已经找到了一些类似的解决方案，但它根本无法解决我的问题。

这里有一种方法，我们需要为

groupby

创建附加键，然后只需要

groupby

这个键和

id

s=df.groupby('id').value.apply(lambda x : x.ne(0).cumsum())
df[df.value==0].groupby([df.id,s]).size().max(level=0).reindex(df.id.unique(),fill_value=0)
Out[267]: 
id
354    3
357    2
540    0
dtype: int64

你可以做：

df.groupby（'id'）.value.apply（lambda x:（（x.diff（）！=0.cumsum（））。其中（x==0\
np.nan）。值_计数（）.max（））.fillna（0）

输出

id
354    3.0
357    2.0
540    0.0
Name: value, dtype: float64

为相同值的连续行创建groupID

。接下来，在

id

和

上调用

value\u counts

，在多索引上调用

.loc

，只对最右边索引级别的

值进行切片。最后，通过

id

中的

duplicated

过滤掉重复索引，并重新索引，为没有

计数的

id

创建0值

m = df.value.diff().ne(0).cumsum().rename('gid')    
#Consecutive rows having the same value will be assigned same IDNumber by this command. 
#It is the way to identify a group of consecutive rows having the same value, so I called it groupID.

df1 = df.groupby(['id', m]).value.value_counts().loc[:,:,0].droplevel(-1)
#this groupby groups consecutive rows of same value per ID into separate groups.
#within each group, count number of each value and `.loc` to pick specifically only `0` because we only concern on the count of value `0`.

df1[~df1.index.duplicated()].reindex(df.id.unique(), fill_value=0)
#There're several groups of value `0` per `id`. We want only group of highest count. 
#`value_count` already sorted number of count descending, so we just need to pick 
#the top one of duplicates by slicing on True/False mask of `duplicated`.
#finally, `reindex` adding any `id` doesn't have value 0 in original `df`.
#Note: `id` is the column `id` in `df`. It is different from groupID `m` we create to use with groupby

Out[315]:
id
354    3
357    2
540    0
Name: value, dtype: int64

对我来说太抽象了（我不习惯lambda之类的东西），你能解释一下它的功能吗？我不知道df m在那里做什么，你能再解释一下吗？@Wel:对不起，直到StackOverflow今天通知我，我才看到你的评论。你理解上面的代码了吗？还没有完全理解，但我已经用上面的例子试过了，它的工作方式和我的循环相同，速度快了10000000000倍，我只是还没有理解m是什么。@Wel:我不擅长文字。但是，我尽力在代码中添加一些解释。请检查更新的。您还可以取消对上面每个命令的锁定，并在控制台中单独运行每个命令，以查看每个输出，从而了解它们的用途。只需再问一个问题，液滴级别（-1）的原因是什么？使用负值的动机是什么？