Python 熊猫每日分组,条件基于第一个较高值

Python 熊猫每日分组,条件基于第一个较高值,python,pandas,dataframe,group-by,Python,Pandas,Dataframe,Group By,问题: df = pd.DataFrame({ 'num_1':[1,2,3,4,5,6,7,8,9,10,11,12], 'num_2':[1,2,10,5,5,6,7,8,100,101,102,15], 'dates':pd.date_range('1/1/2011', periods=12, freq='8h')}) df dates num_1 num_2 0 2011-01-01 00:00:00 1 1 1

问题:

df = pd.DataFrame({
    'num_1':[1,2,3,4,5,6,7,8,9,10,11,12],
    'num_2':[1,2,10,5,5,6,7,8,100,101,102,15],    
    'dates':pd.date_range('1/1/2011', periods=12, freq='8h')})

df

    dates             num_1 num_2
0   2011-01-01 00:00:00 1   1
1   2011-01-01 08:00:00 2   2
2   2011-01-01 16:00:00 3   10
3   2011-01-02 00:00:00 4   5
4   2011-01-02 08:00:00 5   5
5   2011-01-02 16:00:00 6   6
6   2011-01-03 00:00:00 7   7
7   2011-01-03 08:00:00 8   8
8   2011-01-03 16:00:00 9   100
9   2011-01-04 00:00:00 10  101
10  2011-01-04 08:00:00 11  102
11  2011-01-04 16:00:00 12  15
我如何找到每天第一次使用
num_2
num_1
。每日
groupby
,条件基于第一个较高的值,如下例所示

数据:

df = pd.DataFrame({
    'num_1':[1,2,3,4,5,6,7,8,9,10,11,12],
    'num_2':[1,2,10,5,5,6,7,8,100,101,102,15],    
    'dates':pd.date_range('1/1/2011', periods=12, freq='8h')})

df

    dates             num_1 num_2
0   2011-01-01 00:00:00 1   1
1   2011-01-01 08:00:00 2   2
2   2011-01-01 16:00:00 3   10
3   2011-01-02 00:00:00 4   5
4   2011-01-02 08:00:00 5   5
5   2011-01-02 16:00:00 6   6
6   2011-01-03 00:00:00 7   7
7   2011-01-03 08:00:00 8   8
8   2011-01-03 16:00:00 9   100
9   2011-01-04 00:00:00 10  101
10  2011-01-04 08:00:00 11  102
11  2011-01-04 16:00:00 12  15
我已突出显示了此数据的条件为
True
的时间:

所需输出:

df = pd.DataFrame({
    'num_1':[1,2,3,4,5,6,7,8,9,10,11,12],
    'num_2':[1,2,10,5,5,6,7,8,100,101,102,15],    
    'dates':pd.date_range('1/1/2011', periods=12, freq='8h')})

df

    dates             num_1 num_2
0   2011-01-01 00:00:00 1   1
1   2011-01-01 08:00:00 2   2
2   2011-01-01 16:00:00 3   10
3   2011-01-02 00:00:00 4   5
4   2011-01-02 08:00:00 5   5
5   2011-01-02 16:00:00 6   6
6   2011-01-03 00:00:00 7   7
7   2011-01-03 08:00:00 8   8
8   2011-01-03 16:00:00 9   100
9   2011-01-04 00:00:00 10  101
10  2011-01-04 08:00:00 11  102
11  2011-01-04 16:00:00 12  15
当条件为
True
0
时,显示
1
的新列

解决方案:

In [85]: df['result'] = \
    ...:     df.dates.isin(
    ...:         df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
    ...:           .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']).astype(int)
    ...:

In [86]: df
Out[86]:
                 dates  num_1  num_2  result
0  2011-01-01 00:00:00      1      1       0
1  2011-01-01 08:00:00      2      2       0
2  2011-01-01 16:00:00      3     10       1
3  2011-01-02 00:00:00      4      5       1
4  2011-01-02 08:00:00      5      5       0
5  2011-01-02 16:00:00      6      6       0
6  2011-01-03 00:00:00      7      7       0
7  2011-01-03 08:00:00      8      8       0
8  2011-01-03 16:00:00      9    100       1
9  2011-01-04 00:00:00     10    101       1
10 2011-01-04 08:00:00     11    102       0
11 2011-01-04 16:00:00     12     15       0
说明:逐步:

In [80]: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False) \
    ...:   .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))
    ...:
Out[80]:
                  dates  num_1  num_2  result
0 2 2011-01-01 16:00:00      3     10       1
1 3 2011-01-02 00:00:00      4      5       1
2 8 2011-01-03 16:00:00      9    100       1
3 9 2011-01-04 00:00:00     10    101       1

In [81]: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False) \
    ...:   .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']
    ...:
Out[81]:
0  2   2011-01-01 16:00:00
1  3   2011-01-02 00:00:00
2  8   2011-01-03 16:00:00
3  9   2011-01-04 00:00:00
Name: dates, dtype: datetime64[ns]

In [82]: df.dates.isin(
    ...:     df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
    ...:       .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates'])
    ...:
Out[82]:
0     False
1     False
2      True
3      True
4     False
5     False
6     False
7     False
8      True
9      True
10    False
11    False
Name: dates, dtype: bool

In [83]: df.dates.isin(
    ...:     df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
    ...:       .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']).astype(int)
    ...:
Out[83]:
0     0
1     0
2     1
3     1
4     0
5     0
6     0
7     0
8     1
9     1
10    0
11    0
Name: dates, dtype: int32

您可以
apply
a
lambda
,它比较条件并用于返回首先出现此条件的索引标签,以将这些行值指定给1:

In [36]:
# assign default value, this sets the dtype to int so we don't have to convert and fillna after the following line
df['result'] = 0
df.loc[df.groupby(df['dates'].dt.date).apply(lambda x: (x['num_2'] > x['num_1']).idxmax()),'result'] = 1
df

Out[36]:
                 dates  num_1  num_2  result
0  2011-01-01 00:00:00      1      1       0
1  2011-01-01 08:00:00      2      2       0
2  2011-01-01 16:00:00      3     10       1
3  2011-01-02 00:00:00      4      5       1
4  2011-01-02 08:00:00      5      5       0
5  2011-01-02 16:00:00      6      6       0
6  2011-01-03 00:00:00      7      7       0
7  2011-01-03 08:00:00      8      8       0
8  2011-01-03 16:00:00      9    100       1
9  2011-01-04 00:00:00     10    101       1
10 2011-01-04 08:00:00     11    102       0
11 2011-01-04 16:00:00     12     15       0

我只需要在groupby之前过滤df,比如
df.groupby((df['dates'].dt.hour>0)和(df['dates'].dt.date))