在python中，使用pandas按id var1将数据分组为连续日期_Python_Pandas_Dataframe

在python中，使用pandas按id var1将数据分组为连续日期

python pandas dataframe

在python中，使用pandas按id var1将数据分组为连续日期,python,pandas,dataframe,Python,Pandas,Dataframe,我有一些数据看起来像： df_raw_dates=pd.DataFrame（{“id”：[102,102,102,103,103,103,104]，“var1”：['a'，'b'，'a'，'b'，'b'，'a'，'c']， “val”：[9,2,4,7,6,3,2]， “日期”：[pd.时间戳（2020,1,1）， pd.时间戳（2020,1,1）， pd.时间戳（2020,1,2）， pd.时间戳（2020,1,2）， pd.时间戳（2020年1月3日）， pd.时间戳（2020,1,5），

我有一些数据看起来像：

df_raw_dates=pd.DataFrame（{“id”：[102,102,102,103,103,103,104]，“var1”：['a'，'b'，'a'，'b'，'b'，'a'，'c']，
“val”：[9,2,4,7,6,3,2]，
“日期”：[pd.时间戳（2020,1,1），
pd.时间戳（2020,1,1），
pd.时间戳（2020,1,2），
pd.时间戳（2020,1,2），
pd.时间戳（2020年1月3日），
pd.时间戳（2020,1,5），
pd.时间戳（2020年3月12日）}

我想将这些数据分组到IDs和var1中，其中日期是连续的，如果错过了一天，我想开始一个新记录

例如，最终输出应为：

df_end_result=pd.DataFrame（{“id”：[102102103103104]，“var1”：['a'，'b'，'b'，'a'，'c']，
“val”：[13,2,13,3,2]，
“开始日期”：[pd.时间戳（2020,1,1），
pd.时间戳（2020,1,1），
pd.时间戳（2020,1,2），
pd.时间戳（2020,1,5），
pd.时间戳（2020，3，12）]，
“结束日期”：[pd.时间戳（2020,1,2），
pd.时间戳（2020,1,1），
pd.时间戳（2020年1月3日），
pd.时间戳（2020,1,5），
pd.时间戳（2020年3月12日）}

我尝试过几种方法，但一直失败，有些东西可以存在的时间长度未知，var1的可能数量可以随着每个id和日期窗口的变化而变化

例如，我试图识别这样的连续天数，但它总是返回['count_days']==0（显然有些地方出了问题！）。然后我想我可以用date（min）和date（min）+count_天来获得“开始日期”和“结束日期”

s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)

我也尝试过：

df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()

这让我更接近了，但我不认为这涉及到连续天数的问题，而是提供了最早和最晚的一天，不幸的是，这不是我可以继续的事情

编辑：添加更多上下文

另一种方法是：

df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)

它提供了ID和日期的列表，但我一直在这个新窗口中查找最小-最大连续日期

扩展示例在组

（102，'a'）

的日期范围内有一个中断

进一步示例

这是在使用下面二战的anwser

import pandas as pd
import collections

df_raw_dates1 = pd.DataFrame(
    {
        "id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
        "var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
        "val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
        "dates": [
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18)

        ],
    }
)

day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)

for k,g in gb:
    # print(g)
    eyed, var1 = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(g.val.sum())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(), date_range['dates'].max()
        val = date_range.val.sum()
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)

print(pd.DataFrame(new_df))

>>>    id var1   val      start        end
0   100    a   0.0 2021-01-22 2021-01-22
1   100    b   0.0 2021-01-22 2021-01-22
2   100    d   0.0 2021-01-22 2021-01-22

3   105    a   0.0 2021-01-22 2021-01-22
4   105    a   1.0 2021-01-21 2021-01-21
5   105    a   0.0 2021-01-20 2021-01-20
6   105    a  10.0 2021-01-19 2021-01-19

7   105    b   2.0 2021-01-22 2021-01-22
8   105    b   9.0 2021-01-21 2021-01-21
9   105    b   5.0 2021-01-20 2021-01-20
10  105    b  13.0 2021-01-19 2021-01-19

根据上面的内容，我希望第3、4、5、6行被分组在一起，第7、8、9、10行也被分组在一起。我不知道为什么这个例子现在失败了

不确定这个例子和上面的扩展例子有什么不同，为什么这似乎不起作用？

我没有熊猫的超能力，所以我从来没有尝试过用一行一行的方式分组，也许有一天

通过

['id'，'var1']

调整接受的SO问题答案-第一组；按连续日期范围为每个组

import pandas as pd
sep = "************************************\n"
day = pd.Timedelta('1d')
# using the extended example in the question.
gb = df_raw_dates.groupby(['id', 'var1'])

for k,g in gb:
    print(g)
    dt = g['dates']
    # find difference in days between rows
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)

    # create a Series to identify consecutive ranges to group by
    # this cumsum trick can be found in many SO answers
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    # split into date ranges
    date_groups = g.groupby(groups)
    for _,date_range in date_groups:
        print(date_range)
    print(sep)

您可以看到

（102，'a'）

组已分为两组

    id var1  val      dates
0  102    a    9 2020-01-01
2  102    a    4 2020-01-02
7  102    a    1 2020-01-03
     id var1  val      dates
8   102    a    2 2020-01-07
9   102    a    3 2020-01-08
10  102    a    4 2020-01-09

更进一步：在迭代过程中，构造一个字典，用它生成一个新的数据帧

import pandas as pd
import collections
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
    # print(g)
    eyed,var = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var)
        new_df['val'].append(g.val.mean())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(),date_range['dates'].max()
        val = date_range.val.mean()
        new_df['id'].append(eyed)
        new_df['var1'].append(var)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)


print(pd.DataFrame(new_df))

>>>
    id var1        val      start        end
0  102    a   4.666667 2020-01-01 2020-01-03
1  102    a   3.000000 2020-01-07 2020-01-09
2  102    b   2.000000 2020-01-01 2020-01-01
3  103    a   3.000000 2020-01-05 2020-01-05
4  103    b   6.500000 2020-01-02 2020-01-03
5  104    c   2.000000 2020-03-12 2020-03-12
6  108    a  99.000000 2020-01-21 2020-01-25

看起来很乏味，也许有人会提供一个不那么冗长的解决方案。也许有些操作可以放在函数中，可以使用

.apply

或

.transform

或

.pipe

使其更干净一些

它不考虑具有多个日期但只有一个日期范围的

（'id'，'var1'）

组。e、 g

     id var1  val      dates
11  108    a   99 2020-01-21
12  108    a   99 2020-01-25

您可能需要检测日期时间序列中是否存在任何间隔，并利用该事实进行调整。

提及/显示您失败的尝试，并解释结果不起作用的原因可能会有所帮助-可能您很接近，其中一个只是需要调整一下。您正在尝试做一些事情-您最喜欢哪一件你在这个问题上有什么问题吗？例如-

如何将一系列数据分割成连续的数据范围？

-我们通常希望每个问题有一个问题，不赞成请为我实现这一点。请阅读和阅读该页面上的链接和其他链接。感谢@wwii I为问题添加了更多的上下文，很抱歉，这不是正确的格式，只是我第一次在这里添加了一些内容：）相关：，@wwii I添加了更多的上下文，确切地说明了我所处的位置和我所尝试的，这里有一些方法，不知道这是否有帮助或使其更加混乱。我希望能帮上忙，提前谢谢你，阿达姆诺。五万行听起来不多。如果必须对多个数据帧执行此操作，则多个进程可能会有所帮助。我必须研究它以找到效率，我真的没有专注于这一点。有了熊猫，你可以通过整体操作来提高速度——我通过在群体中寻找群体来解决你的问题，这听起来并不好，但我只是专注于寻找解决方案。当我决定先编一本字典的时候，我有点不喜欢这样，所以可能有什么东西。另外，我刚刚选择了第一个可行的解决方案来查找日期范围。谢谢，不，我也不认为ti太多，但感觉相当慢。我还注意到，我刚刚添加到Q中的示例出现了一些问题，但我无法发现为什么您添加的“扩展”示例在除所列示例之外的所有情况下都有效。我以为这是因为val为0，但这并不影响解决方案

     id var1  val      dates
11  108    a   99 2020-01-21
12  108    a   99 2020-01-25