Python 基于组的子集，通过变换计算组中的新列_Python_Pandas_Grouping

Python 基于组的子集，通过变换计算组中的新列

python pandas

Python 基于组的子集，通过变换计算组中的新列,python,pandas,grouping,Python,Pandas,Grouping,我有一套分类和活动日期。对于每条记录，我想为该记录的类别指定上一个日期这将为每个组分配简单的最大值： dates = pd.date_range('2013-02', '2013-03', freq='D').values[0:10] df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo', 'bar','bar','bar','bar','bar']

我有一套分类和活动日期。对于每条记录，我想为该记录的类别指定上一个日期

这将为每个组分配简单的最大值：

dates = pd.date_range('2013-02', '2013-03', freq='D').values[0:10]
df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
                            'bar','bar','bar','bar','bar']
                   })
df['date'] = dates

df['latest'] = df.groupby(['category'])['date'].transform(max)

我需要的是记录组中小于记录日期的最大值

我可以在SQL或ddply中轻松实现这一点，但我还没有找到在pandas中进一步子集组的方法

谢谢

编辑：根据注释，我想要的输出如下所示：

category    date                previous
foo         2013-02-26          NA
foo         2013-02-27          2013-02-26
foo         2013-02-28          2013-02-27
foo         2013-03-01          2013-02-28
foo         2013-03-02          2013-03-01
bar         2013-03-03          NA
bar         2013-03-04          2013-03-03
bar         2013-03-05          2013-03-04
bar         2013-03-06          2013-03-05

etc

我想您需要一个

扩展\u max

功能：

In [26]: df['latest'] = df.groupby(['category'])['date'].apply(pd.expanding_max)

In [27]: df
Out[27]: 
  category       date        latest
0      foo 2013-02-27  1.361923e+18
1      foo 2013-02-28  1.362010e+18
2      foo 2013-03-01  1.362096e+18
3      foo 2013-03-02  1.362182e+18
4      foo 2013-03-03  1.362269e+18
5      bar 2013-03-04  1.362355e+18
6      bar 2013-03-05  1.362442e+18
7      bar 2013-03-06  1.362528e+18
8      bar 2013-03-07  1.362614e+18
9      bar 2013-03-08  1.362701e+18

[10 rows x 3 columns]

并重铸为日期时间：

在[29]中：df['latest']=pd.to_datetime（df['latest']）

如果日期小于或等于记录的日期，则会给出最大值。

这似乎有很长的路要走，但我想到的是：

dates = pd.date_range('2013-02', '2013-03', freq='D')
# create random index
rindex =  np.array(sample(xrange(len(dates)), 10))
# get 10 random dates
dates = dates[rindex]

df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
                            'bar','bar','bar','bar','bar']
                   })
df['date'] = dates
df = df.set_index(['category', 'date'], drop=False)
df = df.sortlevel()
df['dateseq'] = df.groupby('category').date.rank().astype(int) - 1
df
# Increment each rank number by one to get the rank number of the next date
# in the group. Final-day records will get numbers that don't join, which is 
# what we want.
prevdates = df['dateseq']
prevdates = prevdates + 1

# convert the index back into columns
prevdates = prevdates.reset_index()
prevdates['prev_date'] = prevdates['date']
prevdates = prevdates.drop('date',1)
prevdates
# Use merge to join the two tables, with the category and sequence number for keys.
df = pd.merge(df, prevdates, how='left', left_on=['category','dateseq'], right_on=['category','dateseq'],
      left_index=False, right_index=False, sort=True,
       copy=True)

您能提供您期望从上述数据中获得的输出吗？这是正确的，但我需要知道小于行的

日期的最大值。我已经看过了扩展_max的文档，但我不知道如何使用它。谢谢
dates = pd.date_range('2013-02', '2013-03', freq='D')
# create random index
rindex =  np.array(sample(xrange(len(dates)), 10))
# get 10 random dates
dates = dates[rindex]

df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
                            'bar','bar','bar','bar','bar']
                   })
df['date'] = dates
df = df.set_index(['category', 'date'], drop=False)
df = df.sortlevel()
df['dateseq'] = df.groupby('category').date.rank().astype(int) - 1
df
# Increment each rank number by one to get the rank number of the next date
# in the group. Final-day records will get numbers that don't join, which is 
# what we want.
prevdates = df['dateseq']
prevdates = prevdates + 1

# convert the index back into columns
prevdates = prevdates.reset_index()
prevdates['prev_date'] = prevdates['date']
prevdates = prevdates.drop('date',1)
prevdates
# Use merge to join the two tables, with the category and sequence number for keys.
df = pd.merge(df, prevdates, how='left', left_on=['category','dateseq'], right_on=['category','dateseq'],
      left_index=False, right_index=False, sort=True,
       copy=True)