Python 在数据框中保留连续天数

Python 在数据框中保留连续天数,python,pandas,dataframe,date,filtering,Python,Pandas,Dataframe,Date,Filtering,我只希望在基于个人的连续天数的数据框中保留条目 假设我的数据帧定义为- dic = {'name':['John','John','John','Susan','Susan','Susan','Susan','Mike', 'Mike','Mike'], 'worked':['2020-03-12','2020-03-13','2020-03-15','2020-03-16', '2020-03-18','2020-0

我只希望在基于个人的连续天数的数据框中保留条目

假设我的数据帧定义为-

dic = {'name':['John','John','John','Susan','Susan','Susan','Susan','Mike',
               'Mike','Mike'],
       'worked':['2020-03-12','2020-03-13','2020-03-15','2020-03-16',
                 '2020-03-18','2020-03-19','2020-03-20','2020-03-31',
                 '2020-03-29','2020-04-01'],
       'paid':[100,200,300,400,500,100,200,300,400,500]}
df = pd.DataFrame(dic)
df['worked'] = pd.to_datetime(df['worked'])
print(df)
有输出-

    name     worked  paid
0   John 2020-03-12   100
1   John 2020-03-13   200
2   John 2020-03-15   300
3  Susan 2020-03-16   400
4  Susan 2020-03-18   500
5  Susan 2020-03-19   100
6  Susan 2020-03-20   200
7   Mike 2020-03-31   300
8   Mike 2020-03-29   400
9   Mike 2020-04-01   500
我期望的输出如下所示-

    name     worked  paid
0   John 2020-03-12   100
1   John 2020-03-13   200
2  Susan 2020-03-18   500
3  Susan 2020-03-19   100
4  Susan 2020-03-20   200
5   Mike 2020-03-31   300
6   Mike 2020-04-01   500
我的做法:

df['worked'] = pd.to_datetime(df['worked'])
df = df.sort_values(['name','worked'])
period = pd.to_timedelta('1 day')

groups = df.groupby('name')
s1 = df['worked'] - groups['worked'].shift()
s2 = groups['worked'].shift(-1) -df['worked']

df[(s1==period)|(s2==period)].sort_index()
输出:

    name     worked  paid
0   John 2020-03-12   100
1   John 2020-03-13   200
4  Susan 2020-03-18   500
5  Susan 2020-03-19   100
6  Susan 2020-03-20   200
7   Mike 2020-03-31   300
9   Mike 2020-04-01   500

这里有一个解决方案,为了清晰起见,分为几个步骤

df = df.sort_values("worked")
df["prev_day"] = df.groupby("name")["worked"].shift()
df["days_diff"] = (df.worked - df.prev_day).dt.days
df["next_day_diff"] = df.days_diff.shift(-1)
df = df[(df.days_diff ==1) | (df.next_day_diff == 1)]
包括中间步骤(同样,为了清楚起见)的输出是:


我的2美分与
diff

df = df.sort_values(['name','worked'])
c = df.groupby("name")['worked'].diff().dt.days.eq(1)
df[c|c.shift(-1)].sort_index()


为什么预期输出中的两行是Mike?因为4月1日正好在3月31日之后,我发现您的数据没有按日期排序。更新了我的答案
df = df.sort_values(['name','worked'])
c = df.groupby("name")['worked'].diff().dt.days.eq(1)
df[c|c.shift(-1)].sort_index()
    name     worked  paid
0   John 2020-03-12   100
1   John 2020-03-13   200
4  Susan 2020-03-18   500
5  Susan 2020-03-19   100
6  Susan 2020-03-20   200
7   Mike 2020-03-31   300
9   Mike 2020-04-01   500