Python 在数据框中保留连续天数
我只希望在基于个人的连续天数的数据框中保留条目 假设我的数据帧定义为-Python 在数据框中保留连续天数,python,pandas,dataframe,date,filtering,Python,Pandas,Dataframe,Date,Filtering,我只希望在基于个人的连续天数的数据框中保留条目 假设我的数据帧定义为- dic = {'name':['John','John','John','Susan','Susan','Susan','Susan','Mike', 'Mike','Mike'], 'worked':['2020-03-12','2020-03-13','2020-03-15','2020-03-16', '2020-03-18','2020-0
dic = {'name':['John','John','John','Susan','Susan','Susan','Susan','Mike',
'Mike','Mike'],
'worked':['2020-03-12','2020-03-13','2020-03-15','2020-03-16',
'2020-03-18','2020-03-19','2020-03-20','2020-03-31',
'2020-03-29','2020-04-01'],
'paid':[100,200,300,400,500,100,200,300,400,500]}
df = pd.DataFrame(dic)
df['worked'] = pd.to_datetime(df['worked'])
print(df)
有输出-
name worked paid
0 John 2020-03-12 100
1 John 2020-03-13 200
2 John 2020-03-15 300
3 Susan 2020-03-16 400
4 Susan 2020-03-18 500
5 Susan 2020-03-19 100
6 Susan 2020-03-20 200
7 Mike 2020-03-31 300
8 Mike 2020-03-29 400
9 Mike 2020-04-01 500
我期望的输出如下所示-
name worked paid
0 John 2020-03-12 100
1 John 2020-03-13 200
2 Susan 2020-03-18 500
3 Susan 2020-03-19 100
4 Susan 2020-03-20 200
5 Mike 2020-03-31 300
6 Mike 2020-04-01 500
我的做法:
df['worked'] = pd.to_datetime(df['worked'])
df = df.sort_values(['name','worked'])
period = pd.to_timedelta('1 day')
groups = df.groupby('name')
s1 = df['worked'] - groups['worked'].shift()
s2 = groups['worked'].shift(-1) -df['worked']
df[(s1==period)|(s2==period)].sort_index()
输出:
name worked paid
0 John 2020-03-12 100
1 John 2020-03-13 200
4 Susan 2020-03-18 500
5 Susan 2020-03-19 100
6 Susan 2020-03-20 200
7 Mike 2020-03-31 300
9 Mike 2020-04-01 500
这里有一个解决方案,为了清晰起见,分为几个步骤
df = df.sort_values("worked")
df["prev_day"] = df.groupby("name")["worked"].shift()
df["days_diff"] = (df.worked - df.prev_day).dt.days
df["next_day_diff"] = df.days_diff.shift(-1)
df = df[(df.days_diff ==1) | (df.next_day_diff == 1)]
包括中间步骤(同样,为了清楚起见)的输出是:
我的2美分与
diff
df = df.sort_values(['name','worked'])
c = df.groupby("name")['worked'].diff().dt.days.eq(1)
df[c|c.shift(-1)].sort_index()
为什么预期输出中的两行是Mike?因为4月1日正好在3月31日之后,我发现您的数据没有按日期排序。更新了我的答案
df = df.sort_values(['name','worked'])
c = df.groupby("name")['worked'].diff().dt.days.eq(1)
df[c|c.shift(-1)].sort_index()
name worked paid
0 John 2020-03-12 100
1 John 2020-03-13 200
4 Susan 2020-03-18 500
5 Susan 2020-03-19 100
6 Susan 2020-03-20 200
7 Mike 2020-03-31 300
9 Mike 2020-04-01 500