Python 按连续天数的id分组
这是上一个问题的后续问题: 我有一个每天ID和值的数据集,我想把所有连续的日子分组在一起,形成一行,如果缺少一天,我想开始一行 这对于一个简单的例子是有效的,但是对于下面的例子,分组失败了,我无法找出原因Python 按连续天数的id分组,python,pandas,dataframe,Python,Pandas,Dataframe,这是上一个问题的后续问题: 我有一个每天ID和值的数据集,我想把所有连续的日子分组在一起,形成一行,如果缺少一天,我想开始一行 这对于一个简单的例子是有效的,但是对于下面的例子,分组失败了,我无法找出原因 import pandas as pd import collections df_raw_dates1 = pd.DataFrame( { "id": [100,105,105,105,100,105,100,100,105,105,105,1
import pandas as pd
import collections
df_raw_dates1 = pd.DataFrame(
{
"id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
"var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
"val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
"dates": [
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 22),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 21),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 20),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 19),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18),
pd.Timestamp(2021, 1, 18)
],
}
)
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
# print(g)
eyed, var1 = k
dt = g['dates']
in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
filt = g.loc[in_block]
breaks = filt['dates'].diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
# print(k,groups,groups.any())
# accomodate groups with only one date
if not groups.any():
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(g.val.sum())
new_df['start'].append(g.dates.min())
new_df['end'].append(g.dates.max())
continue
for _,date_range in date_groups:
start,end = date_range['dates'].min(), date_range['dates'].max()
val = date_range.val.sum()
new_df['id'].append(eyed)
new_df['var1'].append(var1)
new_df['val'].append(val)
new_df['start'].append(start)
new_df['end'].append(end)
print(pd.DataFrame(new_df))
>>> id var1 val start end
0 100 a 0.0 2021-01-22 2021-01-22
1 100 b 0.0 2021-01-22 2021-01-22
2 100 d 0.0 2021-01-22 2021-01-22
3 105 a 0.0 2021-01-22 2021-01-22
4 105 a 1.0 2021-01-21 2021-01-21
5 105 a 0.0 2021-01-20 2021-01-20
6 105 a 10.0 2021-01-19 2021-01-19
7 105 b 2.0 2021-01-22 2021-01-22
8 105 b 9.0 2021-01-21 2021-01-21
9 105 b 5.0 2021-01-20 2021-01-20
10 105 b 13.0 2021-01-19 2021-01-19
尝试以下解决方案:
# create temporary Group ID (to make following steps more clear)
df['gid'] = df['var1'] + df['id'].astype(str)
# sort values
df.sort_values(['gid', 'dates'], inplace=True)
# calculate difference between days
# (there are some differences between different groups, but they are irrelevant)
df['diff'] = df.groupby(['gid'])['dates'].diff().dt.days
# walk through all gid groups to create groups of consecutive dates within gid
for g in df.groupby(['gid']):
e = 0
for r in g[1].itertuples():
# when diff > 1 then dates are not consecutive -> increment counter for new group
if r.diff and r.diff > 1:
e += 1
df.loc[r.Index, 'gid'] = r.gid + str(e)
# end dates are max values within new gid groups
df['end_dates'] = df.groupby(['gid'])['dates'].transform(max)
# rename dates column
df.rename({'dates': 'start_dates'}, axis=1, inplace=True)
# sum up val column within new gid groups
df['val'] = df.groupby(['gid'])['val'].transform(sum)
# remove all duplicate rows - first row in each gid group contains
# correct start and end dates, others are irrelevant
df.drop_duplicates(['gid'], keep='first', inplace=True)
# remove all temporary columns (if needed)
df.drop(['gid', 'diff'], axis=1, inplace=True)
df.sort_values(['var1'], inplace=True)
df.reset_index(inplace=True, drop=True)
输出:
id var1 val start_dates end_dates
0 100 a 0 2021-01-22 2021-01-22
1 100 b 0 2021-01-21 2021-01-22
2 100 d 0 2021-01-22 2021-01-22
3 105 a 13 2021-01-18 2021-01-22
4 105 b 44 2021-01-18 2021-01-22
5 105 c 1 2021-01-18 2021-01-18
6 105 c 1 2021-01-20 2021-01-22
7 105 d 22 2021-01-18 2021-01-22
首先,你对上一个问题的回答是根据这样一个事实改编的,即每个(id,var1)夫妻内的日期都在增加。您可以使用以下行修复此问题:
df_raw_dates1=df_raw_dates1.sort_values(by=['id','var1','dates'])
一旦完成,这一步就完成了。您仍然存在(id=105,var1=c)问题,因为这些天不是连续的。我通过删除块中的和过滤器中的来修复此问题。for循环的开头如下所示:
for k,g in gb:
# print(g)
eyed, var1 = k
dt = g['dates']
breaks = dt.diff() != day
groups = breaks.cumsum()
date_groups = g.groupby(groups)
...
通过这些修改,我得到以下输出:
id var1 val start end
0 100 a 0 2021-01-22 2021-01-22
1 100 b 0 2021-01-21 2021-01-22
2 100 d 0 2021-01-22 2021-01-22
3 105 a 13 2021-01-18 2021-01-22
4 105 b 44 2021-01-18 2021-01-22
5 105 c 1 2021-01-18 2021-01-18
6 105 c 1 2021-01-20 2021-01-22
7 105 d 22 2021-01-18 2021-01-22
这回答了你的问题吗?请张贴所需的输出。