Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/heroku/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 基于组的子集,通过变换计算组中的新列_Python_Pandas_Grouping - Fatal编程技术网

Python 基于组的子集,通过变换计算组中的新列

Python 基于组的子集,通过变换计算组中的新列,python,pandas,grouping,Python,Pandas,Grouping,我有一套分类和活动日期。对于每条记录,我想为该记录的类别指定上一个日期 这将为每个组分配简单的最大值: dates = pd.date_range('2013-02', '2013-03', freq='D').values[0:10] df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo', 'bar','bar','bar','bar','bar']

我有一套分类和活动日期。对于每条记录,我想为该记录的类别指定上一个日期

这将为每个组分配简单的最大值:

dates = pd.date_range('2013-02', '2013-03', freq='D').values[0:10]
df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
                            'bar','bar','bar','bar','bar']
                   })
df['date'] = dates

df['latest'] = df.groupby(['category'])['date'].transform(max)
我需要的是记录组中小于记录日期的最大值

我可以在SQL或ddply中轻松实现这一点,但我还没有找到在pandas中进一步子集组的方法

谢谢

编辑:根据注释,我想要的输出如下所示:

category    date                previous
foo         2013-02-26          NA
foo         2013-02-27          2013-02-26
foo         2013-02-28          2013-02-27
foo         2013-03-01          2013-02-28
foo         2013-03-02          2013-03-01
bar         2013-03-03          NA
bar         2013-03-04          2013-03-03
bar         2013-03-05          2013-03-04
bar         2013-03-06          2013-03-05

etc

我想您需要一个
扩展\u max
功能:

In [26]: df['latest'] = df.groupby(['category'])['date'].apply(pd.expanding_max)

In [27]: df
Out[27]: 
  category       date        latest
0      foo 2013-02-27  1.361923e+18
1      foo 2013-02-28  1.362010e+18
2      foo 2013-03-01  1.362096e+18
3      foo 2013-03-02  1.362182e+18
4      foo 2013-03-03  1.362269e+18
5      bar 2013-03-04  1.362355e+18
6      bar 2013-03-05  1.362442e+18
7      bar 2013-03-06  1.362528e+18
8      bar 2013-03-07  1.362614e+18
9      bar 2013-03-08  1.362701e+18

[10 rows x 3 columns]
并重铸为日期时间:

在[29]中:df['latest']=pd.to_datetime(df['latest'])


如果日期小于或等于记录的日期,则会给出最大值。

这似乎有很长的路要走,但我想到的是:

dates = pd.date_range('2013-02', '2013-03', freq='D')
# create random index
rindex =  np.array(sample(xrange(len(dates)), 10))
# get 10 random dates
dates = dates[rindex]

df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
                            'bar','bar','bar','bar','bar']
                   })
df['date'] = dates
df = df.set_index(['category', 'date'], drop=False)
df = df.sortlevel()
df['dateseq'] = df.groupby('category').date.rank().astype(int) - 1
df
# Increment each rank number by one to get the rank number of the next date
# in the group. Final-day records will get numbers that don't join, which is 
# what we want.
prevdates = df['dateseq']
prevdates = prevdates + 1

# convert the index back into columns
prevdates = prevdates.reset_index()
prevdates['prev_date'] = prevdates['date']
prevdates = prevdates.drop('date',1)
prevdates
# Use merge to join the two tables, with the category and sequence number for keys.
df = pd.merge(df, prevdates, how='left', left_on=['category','dateseq'], right_on=['category','dateseq'],
      left_index=False, right_index=False, sort=True,
       copy=True)

您能提供您期望从上述数据中获得的输出吗?这是正确的,但我需要知道小于行的
日期的最大值。我已经看过了
扩展_max
的文档,但我不知道如何使用它。谢谢
dates = pd.date_range('2013-02', '2013-03', freq='D')
# create random index
rindex =  np.array(sample(xrange(len(dates)), 10))
# get 10 random dates
dates = dates[rindex]

df = pd.DataFrame({'category': ['foo','foo','foo','foo','foo',
                            'bar','bar','bar','bar','bar']
                   })
df['date'] = dates
df = df.set_index(['category', 'date'], drop=False)
df = df.sortlevel()
df['dateseq'] = df.groupby('category').date.rank().astype(int) - 1
df
# Increment each rank number by one to get the rank number of the next date
# in the group. Final-day records will get numbers that don't join, which is 
# what we want.
prevdates = df['dateseq']
prevdates = prevdates + 1

# convert the index back into columns
prevdates = prevdates.reset_index()
prevdates['prev_date'] = prevdates['date']
prevdates = prevdates.drop('date',1)
prevdates
# Use merge to join the two tables, with the category and sequence number for keys.
df = pd.merge(df, prevdates, how='left', left_on=['category','dateseq'], right_on=['category','dateseq'],
      left_index=False, right_index=False, sort=True,
       copy=True)