Python 按相同列值删除和操作熊猫中的日期
对于以下类型的表:Python 按相同列值删除和操作熊猫中的日期,python,pandas,time-series,Python,Pandas,Time Series,对于以下类型的表: Name Date Score James 02/2011 70 James 03/2011 72 James 10/2011 60 James 12/2011 50 James 01/2012 40 James 02/2012 60 James 03/2012 75 James 11/2012
Name Date Score
James 02/2011 70
James 03/2011 72
James 10/2011 60
James 12/2011 50
James 01/2012 40
James 02/2012 60
James 03/2012 75
James 11/2012 70
James 12/2012 70
James 01/2013 30
James 02/2013 20
James 04/2013 60
James 06/2013 80
Jacob 01/2011 70
Jacob 02/2011 70
Jacob 03/2011 60
Jacob 04/2011 80
Jacob 05/2011 70
Jacob 06/2011 70
Jacob 08/2011 70
Jacob 10/2011 60
Jacob 11/2011 60
Jacob 12/2011 70
Jacob 02/2012 80
我想按名称应用以下规则:
1) 如果连续日期之间的差异超过6个月,请删除与前一行关联的所有行
因此,对于James,在前3行:
James 02/2011 70
James 03/2011 72 (difference between 03/2011, 10/2011 is 7 months)
James 10/2011 60
我们在这里删除前两行,得到的行是
James 10/2011 60
James 12/2011 50
James 01/2012 40
James 02/2012 60
James 03/2012 75
James 11/2012 70
James 12/2012 70
James 01/2013 30
James 02/2013 20
James 04/2013 60
James 06/2013 80
但是在这里
James 10/2011 60
James 12/2011 50
James 01/2012 40
James 02/2012 60
James 03/2012 75 (difference between 03/2012, 11/2012 is 8 months)
James 11/2012 70
James 12/2012 70
James 01/2013 30
James 02/2013 20
James 04/2013 60
James 06/2013 80
所以我们摆脱了
James 10/2011 60
James 12/2011 50
James 01/2012 40
James 02/2012 60
James 03/2012 75
而且只得到
James 11/2012 70
James 12/2012 70
James 01/2013 30
James 02/2013 20
James 04/2013 60
James 06/2013 80
2) 对所有名称应用1)后,更改日期,使最后一个日期保持不变,连续日期之间的差值始终为1个月
因此,对于雅各布来说,我们最初有
Jacob 01/2011 70
Jacob 02/2011 70
Jacob 03/2011 60
Jacob 04/2011 80
Jacob 05/2011 70
Jacob 06/2011 70
Jacob 08/2011 70
Jacob 10/2011 60
Jacob 11/2011 60
Jacob 12/2011 70
Jacob 02/2012 80
但由此产生的行将是
Jacob 04/2011 70
Jacob 05/2011 70
Jacob 06/2011 60
Jacob 07/2011 80
Jacob 08/2011 70
Jacob 09/2011 70
Jacob 10/2011 70
Jacob 11/2011 60
Jacob 12/2011 60
Jacob 01/2012 70
Jacob 02/2012 80
因此,我想要的结果表是:
Name Date Score
James 01/2013 70
James 02/2013 70
James 03/2013 30
James 04/2013 20
James 05/2013 60
James 06/2013 80
Jacob 04/2011 70
Jacob 05/2011 70
Jacob 06/2011 60
Jacob 07/2011 80
Jacob 08/2011 70
Jacob 09/2011 70
Jacob 10/2011 70
Jacob 11/2011 60
Jacob 12/2011 60
Jacob 01/2012 70
Jacob 02/2012 80
任何帮助都将不胜感激。我把台阶拆下来。
#转换为日期时间
df.Date=pd.to_datetime(df.Date,格式=“%m/%Y”)
只是玩弄了一些数字,但对一些人来说不起作用dates@user98235嗯,那么你的样本数据就不能重现你的问题了。它对我的样本数据有效,但对于新数据,存在一些问题。我想你的代码给了我足够的提示。我会尽我最大的努力来解决这个问题。谢谢
# create new key for groupby , then we juts need the last group
df['Newkey']=df.groupby('Name').apply(lambda x : (x.Date.diff()/np.timedelta64(6, 'M')).gt(1).cumsum()).sort_index(level=1).values
# filter out the only keep the last subgroup base one each name
df1=df.loc[df.Newkey==df.groupby('Name').Newkey.transform('max')]
# create the new date by using the len of the group and the max date value from that group
df1['NewDate']=np.concatenate(df1.groupby('Name',sort=False).apply(lambda x : pd.date_range(end=x['Date'].max(),periods=len(x['Date']),freq='MS').strftime('%m/%Y')).values)
df1
Out[281]:
Name Date Score Newkey NewDate
7 James 2012-11-01 70 2 01/2013
8 James 2012-12-01 70 2 02/2013
9 James 2013-01-01 30 2 03/2013
10 James 2013-02-01 20 2 04/2013
11 James 2013-04-01 60 2 05/2013
12 James 2013-06-01 80 2 06/2013
13 Jacob 2011-01-01 70 0 04/2011
14 Jacob 2011-02-01 70 0 05/2011
15 Jacob 2011-03-01 60 0 06/2011
16 Jacob 2011-04-01 80 0 07/2011
17 Jacob 2011-05-01 70 0 08/2011
18 Jacob 2011-06-01 70 0 09/2011
19 Jacob 2011-08-01 70 0 10/2011
20 Jacob 2011-10-01 60 0 11/2011
21 Jacob 2011-11-01 60 0 12/2011
22 Jacob 2011-12-01 70 0 01/2012
23 Jacob 2012-02-01 80 0 02/2012