Python 基于组的子集,通过变换计算组中的新列
我有一套分类和活动日期。对于每条记录,我想为该记录的类别指定上一个日期 这将为每个组分配简单的最大值:Python 基于组的子集,通过变换计算组中的新列,python,pandas,grouping,Python,Pandas,Grouping,我有一套分类和活动日期。对于每条记录,我想为该记录的类别指定上一个日期 这将为每个组分配简单的最大值: dates = pd.date_range('2013-02', '2013-03', freq='D').values[0:10] df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo', 'bar','bar','bar','bar','bar']
dates = pd.date_range('2013-02', '2013-03', freq='D').values[0:10]
df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
'bar','bar','bar','bar','bar']
})
df['date'] = dates
df['latest'] = df.groupby(['category'])['date'].transform(max)
我需要的是记录组中小于记录日期的最大值
我可以在SQL或ddply中轻松实现这一点,但我还没有找到在pandas中进一步子集组的方法
谢谢
编辑:根据注释,我想要的输出如下所示:
category date previous
foo 2013-02-26 NA
foo 2013-02-27 2013-02-26
foo 2013-02-28 2013-02-27
foo 2013-03-01 2013-02-28
foo 2013-03-02 2013-03-01
bar 2013-03-03 NA
bar 2013-03-04 2013-03-03
bar 2013-03-05 2013-03-04
bar 2013-03-06 2013-03-05
etc我想您需要一个
扩展\u max
功能:
In [26]: df['latest'] = df.groupby(['category'])['date'].apply(pd.expanding_max)
In [27]: df
Out[27]:
category date latest
0 foo 2013-02-27 1.361923e+18
1 foo 2013-02-28 1.362010e+18
2 foo 2013-03-01 1.362096e+18
3 foo 2013-03-02 1.362182e+18
4 foo 2013-03-03 1.362269e+18
5 bar 2013-03-04 1.362355e+18
6 bar 2013-03-05 1.362442e+18
7 bar 2013-03-06 1.362528e+18
8 bar 2013-03-07 1.362614e+18
9 bar 2013-03-08 1.362701e+18
[10 rows x 3 columns]
并重铸为日期时间:
在[29]中:df['latest']=pd.to_datetime(df['latest'])
如果日期小于或等于记录的日期,则会给出最大值。这似乎有很长的路要走,但我想到的是:
dates = pd.date_range('2013-02', '2013-03', freq='D')
# create random index
rindex = np.array(sample(xrange(len(dates)), 10))
# get 10 random dates
dates = dates[rindex]
df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
'bar','bar','bar','bar','bar']
})
df['date'] = dates
df = df.set_index(['category', 'date'], drop=False)
df = df.sortlevel()
df['dateseq'] = df.groupby('category').date.rank().astype(int) - 1
df
# Increment each rank number by one to get the rank number of the next date
# in the group. Final-day records will get numbers that don't join, which is
# what we want.
prevdates = df['dateseq']
prevdates = prevdates + 1
# convert the index back into columns
prevdates = prevdates.reset_index()
prevdates['prev_date'] = prevdates['date']
prevdates = prevdates.drop('date',1)
prevdates
# Use merge to join the two tables, with the category and sequence number for keys.
df = pd.merge(df, prevdates, how='left', left_on=['category','dateseq'], right_on=['category','dateseq'],
left_index=False, right_index=False, sort=True,
copy=True)
您能提供您期望从上述数据中获得的输出吗?这是正确的,但我需要知道小于行的
日期的最大值。我已经看过了扩展_max
的文档,但我不知道如何使用它。谢谢
dates = pd.date_range('2013-02', '2013-03', freq='D')
# create random index
rindex = np.array(sample(xrange(len(dates)), 10))
# get 10 random dates
dates = dates[rindex]
df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
'bar','bar','bar','bar','bar']
})
df['date'] = dates
df = df.set_index(['category', 'date'], drop=False)
df = df.sortlevel()
df['dateseq'] = df.groupby('category').date.rank().astype(int) - 1
df
# Increment each rank number by one to get the rank number of the next date
# in the group. Final-day records will get numbers that don't join, which is
# what we want.
prevdates = df['dateseq']
prevdates = prevdates + 1
# convert the index back into columns
prevdates = prevdates.reset_index()
prevdates['prev_date'] = prevdates['date']
prevdates = prevdates.drop('date',1)
prevdates
# Use merge to join the two tables, with the category and sequence number for keys.
df = pd.merge(df, prevdates, how='left', left_on=['category','dateseq'], right_on=['category','dateseq'],
left_index=False, right_index=False, sort=True,
copy=True)