Python 按连续天数的id分组

Python 按连续天数的id分组,python,pandas,dataframe,Python,Pandas,Dataframe,这是上一个问题的后续问题: 我有一个每天ID和值的数据集,我想把所有连续的日子分组在一起,形成一行,如果缺少一天,我想开始一行 这对于一个简单的例子是有效的,但是对于下面的例子,分组失败了,我无法找出原因 import pandas as pd import collections df_raw_dates1 = pd.DataFrame( { "id": [100,105,105,105,100,105,100,100,105,105,105,1

这是上一个问题的后续问题:

我有一个每天ID和值的数据集,我想把所有连续的日子分组在一起,形成一行,如果缺少一天,我想开始一行

这对于一个简单的例子是有效的,但是对于下面的例子,分组失败了,我无法找出原因

import pandas as pd
import collections

df_raw_dates1 = pd.DataFrame(
    {
        "id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
        "var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
        "val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
        "dates": [
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18)

        ],
    }
)

day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)

for k,g in gb:
    # print(g)
    eyed, var1 = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(g.val.sum())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(), date_range['dates'].max()
        val = date_range.val.sum()
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)

print(pd.DataFrame(new_df))

>>>    id var1   val      start        end
0   100    a   0.0 2021-01-22 2021-01-22
1   100    b   0.0 2021-01-22 2021-01-22
2   100    d   0.0 2021-01-22 2021-01-22

3   105    a   0.0 2021-01-22 2021-01-22
4   105    a   1.0 2021-01-21 2021-01-21
5   105    a   0.0 2021-01-20 2021-01-20
6   105    a  10.0 2021-01-19 2021-01-19

7   105    b   2.0 2021-01-22 2021-01-22
8   105    b   9.0 2021-01-21 2021-01-21
9   105    b   5.0 2021-01-20 2021-01-20
10  105    b  13.0 2021-01-19 2021-01-19

尝试以下解决方案:

# create temporary Group ID (to make following steps more clear)
df['gid'] = df['var1'] + df['id'].astype(str)
# sort values
df.sort_values(['gid', 'dates'], inplace=True)
# calculate difference between days
# (there are some differences between different groups, but they are irrelevant)
df['diff'] = df.groupby(['gid'])['dates'].diff().dt.days

# walk through all gid groups to create groups of consecutive dates within gid 
for g in df.groupby(['gid']):
    e = 0
    for r in g[1].itertuples():
        # when diff > 1 then dates are not consecutive -> increment counter for new group
        if r.diff and r.diff > 1:
            e += 1
        df.loc[r.Index, 'gid'] = r.gid + str(e)
    
# end dates are max values within new gid groups
df['end_dates'] = df.groupby(['gid'])['dates'].transform(max)
# rename dates column
df.rename({'dates': 'start_dates'}, axis=1, inplace=True)
# sum up val column within new gid groups
df['val'] = df.groupby(['gid'])['val'].transform(sum)
# remove all duplicate rows - first row in each gid group contains 
# correct start and end dates, others are irrelevant
df.drop_duplicates(['gid'], keep='first', inplace=True)
# remove all temporary columns (if needed)
df.drop(['gid', 'diff'], axis=1, inplace=True)

df.sort_values(['var1'], inplace=True)
df.reset_index(inplace=True, drop=True)
输出:

id  var1    val start_dates end_dates
0   100 a   0   2021-01-22  2021-01-22
1   100 b   0   2021-01-21  2021-01-22
2   100 d   0   2021-01-22  2021-01-22
3   105 a   13  2021-01-18  2021-01-22
4   105 b   44  2021-01-18  2021-01-22
5   105 c   1   2021-01-18  2021-01-18
6   105 c   1   2021-01-20  2021-01-22
7   105 d   22  2021-01-18  2021-01-22

首先,你对上一个问题的回答是根据这样一个事实改编的,即每个(id,var1)夫妻内的日期都在增加。您可以使用以下行修复此问题:

df_raw_dates1=df_raw_dates1.sort_values(by=['id','var1','dates'])
一旦完成,这一步就完成了。您仍然存在(id=105,var1=c)问题,因为这些天不是连续的。我通过删除块中的过滤器中的来修复此问题。for循环的开头如下所示:

for k,g in gb:
    # print(g)
    eyed, var1 = k
    dt = g['dates']
    breaks = dt.diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    ...
通过这些修改,我得到以下输出:

     id var1  val      start        end
0  100    a    0 2021-01-22 2021-01-22
1  100    b    0 2021-01-21 2021-01-22
2  100    d    0 2021-01-22 2021-01-22
3  105    a   13 2021-01-18 2021-01-22
4  105    b   44 2021-01-18 2021-01-22
5  105    c    1 2021-01-18 2021-01-18
6  105    c    1 2021-01-20 2021-01-22
7  105    d   22 2021-01-18 2021-01-22

这回答了你的问题吗?请张贴所需的输出。