Python 使用布尔条件对每个组进行数据帧随机行选择
假设我有以下数据帧:Python 使用布尔条件对每个组进行数据帧随机行选择,python,datetime,pandas,group-by,Python,Datetime,Pandas,Group By,假设我有以下数据帧: df = pd.DataFrame({'name':['Dave','Lisa','John',Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'], 'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:1
df = pd.DataFrame({'name':['Dave','Lisa','John',Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],
'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
数据帧1
date name
0 2015-01-31 07:14:39 Dave
1 2014-12-16 22:50:55 Lisa
2 2015-04-12 23:29:11 John
3 2015-04-08 17:57:29 Lisa
4 2015-01-30 03:51:12 Simon
5 2015-02-20 10:33:48 Simon
6 2014-12-15 23:54:03 Simon
7 2014-12-16 19:53:53 Simon
8 2014-12-18 00:15:02 Lisa
9 2015-04-01 21:36:55 Dave
10 2015-04-13 23:25:55 Dave
11 2015-02-18 14:10:40 John
12 2015-02-27 04:56:33 Lisa
数据框架2
name datemax
0 Dave 2015-04-13 23:25:55
1 John 2015-04-12 23:29:11
2 Lisa 2015-04-08 17:57:29
3 Simon 2015-02-20 10:33:48
其中“date”和“datemax”列由datetime对象填充
我需要在DATAFRAME1中按“名称”分组,随机选择其中一个日期,但我希望所选日期位于第二个数据框(DATAFRAME2)中该名称的“日期最大值”之前
我正在研究的真正的数据帧比这个示例中的数据帧要大得多,所以我需要一种快速的方法来实现这一点 我会首先剪接所有不符合该标准的日期:
In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[11]:
0 2015-04-13 23:25:55
1 2015-04-08 17:57:29
2 2015-04-12 23:29:11
3 2015-04-08 17:57:29
4 2015-02-20 10:33:48
5 2015-02-20 10:33:48
6 2015-02-20 10:33:48
7 2015-02-20 10:33:48
8 2015-04-08 17:57:29
9 2015-04-13 23:25:55
10 2015-04-13 23:25:55
11 2015-04-12 23:29:11
12 2015-04-08 17:57:29
Name: date, dtype: datetime64[ns]
In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[12]:
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 True
9 True
10 False
11 True
12 True
Name: date, dtype: bool
In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]
In [14]: df_old
Out[14]:
date name
0 2015-01-31 07:14:39 Dave
1 2014-12-16 22:50:55 Lisa
4 2015-01-30 03:51:12 Simon
6 2014-12-15 23:54:03 Simon
7 2014-12-16 19:53:53 Simon
8 2014-12-18 00:15:02 Lisa
9 2015-04-01 21:36:55 Dave
11 2015-02-18 14:10:40 John
12 2015-02-27 04:56:33 Lisa
我会首先剪接所有不符合该标准的日期:
In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[11]:
0 2015-04-13 23:25:55
1 2015-04-08 17:57:29
2 2015-04-12 23:29:11
3 2015-04-08 17:57:29
4 2015-02-20 10:33:48
5 2015-02-20 10:33:48
6 2015-02-20 10:33:48
7 2015-02-20 10:33:48
8 2015-04-08 17:57:29
9 2015-04-13 23:25:55
10 2015-04-13 23:25:55
11 2015-04-12 23:29:11
12 2015-04-08 17:57:29
Name: date, dtype: datetime64[ns]
In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[12]:
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 True
9 True
10 False
11 True
12 True
Name: date, dtype: bool
In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]
In [14]: df_old
Out[14]:
date name
0 2015-01-31 07:14:39 Dave
1 2014-12-16 22:50:55 Lisa
4 2015-01-30 03:51:12 Simon
6 2014-12-15 23:54:03 Simon
7 2014-12-16 19:53:53 Simon
8 2014-12-18 00:15:02 Lisa
9 2015-04-01 21:36:55 Dave
11 2015-02-18 14:10:40 John
12 2015-02-27 04:56:33 Lisa
我的建议是:
import random
df = pd.DataFrame({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
df.date = [pd.to_datetime(x) for x in df.date]
df2 = pd.DataFrame([['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48']])
df2.columns = ['name','datemax']
df2.datemax = [pd.to_datetime(x) for x in df2.datemax]
df = df.merge(df2,how='left')
grouped = df.groupby('name')
grouped.apply(lambda x: random.choice([a for a in x['date'].values if a<x['datemax'].values[0]]))
随机导入
数据帧({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa','date']:['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
df.date=[pd.to_datetime(x)表示df.date中的x]
df2=pd.数据帧(['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48'])
df2.columns=['name','datemax']
df2.datemax=[pd.to_datetime(x)表示df2.datemax中的x]
df=df.merge(df2,how='left')
grouped=df.groupby('name')
分组。应用(lambda x:random.choice([a代表x中的a['date'])。如果a这是我的建议:
import random
df = pd.DataFrame({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
df.date = [pd.to_datetime(x) for x in df.date]
df2 = pd.DataFrame([['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48']])
df2.columns = ['name','datemax']
df2.datemax = [pd.to_datetime(x) for x in df2.datemax]
df = df.merge(df2,how='left')
grouped = df.groupby('name')
grouped.apply(lambda x: random.choice([a for a in x['date'].values if a<x['datemax'].values[0]]))
随机导入
数据帧({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa','date']:['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
df.date=[pd.to_datetime(x)表示df.date中的x]
df2=pd.数据帧(['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48'])
df2.columns=['name','datemax']
df2.datemax=[pd.to_datetime(x)表示df2.datemax中的x]
df=df.merge(df2,how='left')
grouped=df.groupby('name')
分组.apply(lambda x:random.choice([a代表x中的a['date'])。如果a可以使用pd.DataFrame.sample
like
In [697]: idx = df2.set_index('name').datemax
In [698]: (df1.groupby('name')
.apply(lambda x: x.loc[x.date < idx[x.name]].sample(1))
.reset_index(drop=True))
Out[698]:
date name
0 2015-04-01 21:36:55 Dave
1 2015-02-18 14:10:40 John
2 2014-12-18 00:15:02 Lisa
3 2014-12-16 19:53:53 Simon
[697]中的:idx=df2.set_index('name').datemax
在[698]中:(df1.groupby('name'))
.申请(lambda x:x.loc[x.date
您可以像这样使用pd.DataFrame.sample
In [697]: idx = df2.set_index('name').datemax
In [698]: (df1.groupby('name')
.apply(lambda x: x.loc[x.date < idx[x.name]].sample(1))
.reset_index(drop=True))
Out[698]:
date name
0 2015-04-01 21:36:55 Dave
1 2015-02-18 14:10:40 John
2 2014-12-18 00:15:02 Lisa
3 2014-12-16 19:53:53 Simon
[697]中的:idx=df2.set_index('name').datemax
在[698]中:(df1.groupby('name'))
.申请(lambda x:x.loc[x.date
它需要是随机的,还是第一个有效日期?它需要是随机的:)它需要是随机的,还是第一个有效日期?它需要是随机的:)