Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/295.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用布尔条件对每个组进行数据帧随机行选择_Python_Datetime_Pandas_Group By - Fatal编程技术网

Python 使用布尔条件对每个组进行数据帧随机行选择

Python 使用布尔条件对每个组进行数据帧随机行选择,python,datetime,pandas,group-by,Python,Datetime,Pandas,Group By,假设我有以下数据帧: df = pd.DataFrame({'name':['Dave','Lisa','John',Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'], 'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:1

假设我有以下数据帧:

df = pd.DataFrame({'name':['Dave','Lisa','John',Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],
'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
数据帧1

            date            name
0   2015-01-31 07:14:39     Dave
1   2014-12-16 22:50:55     Lisa
2   2015-04-12 23:29:11     John
3   2015-04-08 17:57:29     Lisa
4   2015-01-30 03:51:12     Simon
5   2015-02-20 10:33:48     Simon
6   2014-12-15 23:54:03     Simon
7   2014-12-16 19:53:53     Simon
8   2014-12-18 00:15:02     Lisa
9   2015-04-01 21:36:55     Dave
10  2015-04-13 23:25:55     Dave
11  2015-02-18 14:10:40     John
12  2015-02-27 04:56:33     Lisa
数据框架2

    name           datemax
0   Dave    2015-04-13 23:25:55
1   John    2015-04-12 23:29:11
2   Lisa    2015-04-08 17:57:29
3   Simon   2015-02-20 10:33:48
其中“date”和“datemax”列由datetime对象填充

我需要在DATAFRAME1中按“名称”分组,随机选择其中一个日期,但我希望所选日期位于第二个数据框(DATAFRAME2)中该名称的“日期最大值”之前


我正在研究的真正的数据帧比这个示例中的数据帧要大得多,所以我需要一种快速的方法来实现这一点

我会首先剪接所有不符合该标准的日期:

In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[11]:
0    2015-04-13 23:25:55
1    2015-04-08 17:57:29
2    2015-04-12 23:29:11
3    2015-04-08 17:57:29
4    2015-02-20 10:33:48
5    2015-02-20 10:33:48
6    2015-02-20 10:33:48
7    2015-02-20 10:33:48
8    2015-04-08 17:57:29
9    2015-04-13 23:25:55
10   2015-04-13 23:25:55
11   2015-04-12 23:29:11
12   2015-04-08 17:57:29
Name: date, dtype: datetime64[ns]

In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[12]:
0      True
1      True
2     False
3     False
4      True
5     False
6      True
7      True
8      True
9      True
10    False
11     True
12     True
Name: date, dtype: bool

In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]

In [14]: df_old
Out[14]:
                  date   name
0  2015-01-31 07:14:39   Dave
1  2014-12-16 22:50:55   Lisa
4  2015-01-30 03:51:12  Simon
6  2014-12-15 23:54:03  Simon
7  2014-12-16 19:53:53  Simon
8  2014-12-18 00:15:02   Lisa
9  2015-04-01 21:36:55   Dave
11 2015-02-18 14:10:40   John
12 2015-02-27 04:56:33   Lisa

我会首先剪接所有不符合该标准的日期:

In [11]: df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[11]:
0    2015-04-13 23:25:55
1    2015-04-08 17:57:29
2    2015-04-12 23:29:11
3    2015-04-08 17:57:29
4    2015-02-20 10:33:48
5    2015-02-20 10:33:48
6    2015-02-20 10:33:48
7    2015-02-20 10:33:48
8    2015-04-08 17:57:29
9    2015-04-13 23:25:55
10   2015-04-13 23:25:55
11   2015-04-12 23:29:11
12   2015-04-08 17:57:29
Name: date, dtype: datetime64[ns]

In [12]: df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])
Out[12]:
0      True
1      True
2     False
3     False
4      True
5     False
6      True
7      True
8      True
9      True
10    False
11     True
12     True
Name: date, dtype: bool

In [13]: df_old = df[df["date"] < df.groupby("name")["date"].transform(lambda x: df2a.loc[x.name, "datemax"])]

In [14]: df_old
Out[14]:
                  date   name
0  2015-01-31 07:14:39   Dave
1  2014-12-16 22:50:55   Lisa
4  2015-01-30 03:51:12  Simon
6  2014-12-15 23:54:03  Simon
7  2014-12-16 19:53:53  Simon
8  2014-12-18 00:15:02   Lisa
9  2015-04-01 21:36:55   Dave
11 2015-02-18 14:10:40   John
12 2015-02-27 04:56:33   Lisa
我的建议是:

import random

df = pd.DataFrame({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})

df.date = [pd.to_datetime(x) for x in df.date]

df2 = pd.DataFrame([['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48']])

df2.columns = ['name','datemax']

df2.datemax = [pd.to_datetime(x) for x in df2.datemax]

df = df.merge(df2,how='left')

grouped = df.groupby('name')

grouped.apply(lambda x: random.choice([a for a in x['date'].values if a<x['datemax'].values[0]]))
随机导入
数据帧({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa','date']:['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
df.date=[pd.to_datetime(x)表示df.date中的x]
df2=pd.数据帧(['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48'])
df2.columns=['name','datemax']
df2.datemax=[pd.to_datetime(x)表示df2.datemax中的x]
df=df.merge(df2,how='left')
grouped=df.groupby('name')
分组。应用(lambda x:random.choice([a代表x中的a['date'])。如果a这是我的建议:

import random

df = pd.DataFrame({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa'],'date': ['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})

df.date = [pd.to_datetime(x) for x in df.date]

df2 = pd.DataFrame([['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48']])

df2.columns = ['name','datemax']

df2.datemax = [pd.to_datetime(x) for x in df2.datemax]

df = df.merge(df2,how='left')

grouped = df.groupby('name')

grouped.apply(lambda x: random.choice([a for a in x['date'].values if a<x['datemax'].values[0]]))
随机导入
数据帧({'name':['Dave','Lisa','John','Lisa','Simon','Simon','Simon','Lisa','Dave','Dave','John','Lisa','date']:['2015-01-31 07:14:39','2014-12-16 22:50:55','2015-04-12 23:29:11','2015-04-08 17:57:29','2015-01-30 03:51:12','2015-02-20 10:33:48','2014-12-15 23:54:03','2014-12-16 19:53:53','2014-12-18 00:15:02','2015-04-01 21:36:55','2015-04-13 23:25:55','2015-02-18 14:10:40','2015-02-27 04:56:33']})
df.date=[pd.to_datetime(x)表示df.date中的x]
df2=pd.数据帧(['Dave','2015-04-13 23:25:55'],['John','2015-04-12 23:29:11'],['Lisa','2015-04-08 17:57:29'],['Simon','2015-02-20 10:33:48'])
df2.columns=['name','datemax']
df2.datemax=[pd.to_datetime(x)表示df2.datemax中的x]
df=df.merge(df2,how='left')
grouped=df.groupby('name')

分组.apply(lambda x:random.choice([a代表x中的a['date'])。如果a可以使用
pd.DataFrame.sample
like

In [697]: idx = df2.set_index('name').datemax

In [698]: (df1.groupby('name')
              .apply(lambda x: x.loc[x.date < idx[x.name]].sample(1))
              .reset_index(drop=True))
Out[698]:
                 date   name
0 2015-04-01 21:36:55   Dave
1 2015-02-18 14:10:40   John
2 2014-12-18 00:15:02   Lisa
3 2014-12-16 19:53:53  Simon
[697]中的
:idx=df2.set_index('name').datemax
在[698]中:(df1.groupby('name'))
.申请(lambda x:x.loc[x.date
您可以像这样使用
pd.DataFrame.sample

In [697]: idx = df2.set_index('name').datemax

In [698]: (df1.groupby('name')
              .apply(lambda x: x.loc[x.date < idx[x.name]].sample(1))
              .reset_index(drop=True))
Out[698]:
                 date   name
0 2015-04-01 21:36:55   Dave
1 2015-02-18 14:10:40   John
2 2014-12-18 00:15:02   Lisa
3 2014-12-16 19:53:53  Simon
[697]中的
:idx=df2.set_index('name').datemax
在[698]中:(df1.groupby('name'))
.申请(lambda x:x.loc[x.date
它需要是随机的,还是第一个有效日期?它需要是随机的:)它需要是随机的,还是第一个有效日期?它需要是随机的:)