Python 熊猫每日分组,条件基于第一个较高值
问题:Python 熊猫每日分组,条件基于第一个较高值,python,pandas,dataframe,group-by,Python,Pandas,Dataframe,Group By,问题: df = pd.DataFrame({ 'num_1':[1,2,3,4,5,6,7,8,9,10,11,12], 'num_2':[1,2,10,5,5,6,7,8,100,101,102,15], 'dates':pd.date_range('1/1/2011', periods=12, freq='8h')}) df dates num_1 num_2 0 2011-01-01 00:00:00 1 1 1
df = pd.DataFrame({
'num_1':[1,2,3,4,5,6,7,8,9,10,11,12],
'num_2':[1,2,10,5,5,6,7,8,100,101,102,15],
'dates':pd.date_range('1/1/2011', periods=12, freq='8h')})
df
dates num_1 num_2
0 2011-01-01 00:00:00 1 1
1 2011-01-01 08:00:00 2 2
2 2011-01-01 16:00:00 3 10
3 2011-01-02 00:00:00 4 5
4 2011-01-02 08:00:00 5 5
5 2011-01-02 16:00:00 6 6
6 2011-01-03 00:00:00 7 7
7 2011-01-03 08:00:00 8 8
8 2011-01-03 16:00:00 9 100
9 2011-01-04 00:00:00 10 101
10 2011-01-04 08:00:00 11 102
11 2011-01-04 16:00:00 12 15
我如何找到每天第一次使用num_2
num_1
。每日groupby
,条件基于第一个较高的值,如下例所示
数据:
df = pd.DataFrame({
'num_1':[1,2,3,4,5,6,7,8,9,10,11,12],
'num_2':[1,2,10,5,5,6,7,8,100,101,102,15],
'dates':pd.date_range('1/1/2011', periods=12, freq='8h')})
df
dates num_1 num_2
0 2011-01-01 00:00:00 1 1
1 2011-01-01 08:00:00 2 2
2 2011-01-01 16:00:00 3 10
3 2011-01-02 00:00:00 4 5
4 2011-01-02 08:00:00 5 5
5 2011-01-02 16:00:00 6 6
6 2011-01-03 00:00:00 7 7
7 2011-01-03 08:00:00 8 8
8 2011-01-03 16:00:00 9 100
9 2011-01-04 00:00:00 10 101
10 2011-01-04 08:00:00 11 102
11 2011-01-04 16:00:00 12 15
我已突出显示了此数据的条件为True
的时间:
所需输出:
df = pd.DataFrame({
'num_1':[1,2,3,4,5,6,7,8,9,10,11,12],
'num_2':[1,2,10,5,5,6,7,8,100,101,102,15],
'dates':pd.date_range('1/1/2011', periods=12, freq='8h')})
df
dates num_1 num_2
0 2011-01-01 00:00:00 1 1
1 2011-01-01 08:00:00 2 2
2 2011-01-01 16:00:00 3 10
3 2011-01-02 00:00:00 4 5
4 2011-01-02 08:00:00 5 5
5 2011-01-02 16:00:00 6 6
6 2011-01-03 00:00:00 7 7
7 2011-01-03 08:00:00 8 8
8 2011-01-03 16:00:00 9 100
9 2011-01-04 00:00:00 10 101
10 2011-01-04 08:00:00 11 102
11 2011-01-04 16:00:00 12 15
当条件为True
和0
时,显示1
的新列
解决方案:
In [85]: df['result'] = \
...: df.dates.isin(
...: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
...: .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']).astype(int)
...:
In [86]: df
Out[86]:
dates num_1 num_2 result
0 2011-01-01 00:00:00 1 1 0
1 2011-01-01 08:00:00 2 2 0
2 2011-01-01 16:00:00 3 10 1
3 2011-01-02 00:00:00 4 5 1
4 2011-01-02 08:00:00 5 5 0
5 2011-01-02 16:00:00 6 6 0
6 2011-01-03 00:00:00 7 7 0
7 2011-01-03 08:00:00 8 8 0
8 2011-01-03 16:00:00 9 100 1
9 2011-01-04 00:00:00 10 101 1
10 2011-01-04 08:00:00 11 102 0
11 2011-01-04 16:00:00 12 15 0
说明:逐步:
In [80]: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False) \
...: .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))
...:
Out[80]:
dates num_1 num_2 result
0 2 2011-01-01 16:00:00 3 10 1
1 3 2011-01-02 00:00:00 4 5 1
2 8 2011-01-03 16:00:00 9 100 1
3 9 2011-01-04 00:00:00 10 101 1
In [81]: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False) \
...: .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']
...:
Out[81]:
0 2 2011-01-01 16:00:00
1 3 2011-01-02 00:00:00
2 8 2011-01-03 16:00:00
3 9 2011-01-04 00:00:00
Name: dates, dtype: datetime64[ns]
In [82]: df.dates.isin(
...: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
...: .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates'])
...:
Out[82]:
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 True
9 True
10 False
11 False
Name: dates, dtype: bool
In [83]: df.dates.isin(
...: df.groupby(pd.Grouper(key='dates', freq='D'), as_index=False)
...: .apply(lambda x: x.loc[x.num_2 > x.num_1].head(1))['dates']).astype(int)
...:
Out[83]:
0 0
1 0
2 1
3 1
4 0
5 0
6 0
7 0
8 1
9 1
10 0
11 0
Name: dates, dtype: int32
您可以
apply
alambda
,它比较条件并用于返回首先出现此条件的索引标签,以将这些行值指定给1:
In [36]:
# assign default value, this sets the dtype to int so we don't have to convert and fillna after the following line
df['result'] = 0
df.loc[df.groupby(df['dates'].dt.date).apply(lambda x: (x['num_2'] > x['num_1']).idxmax()),'result'] = 1
df
Out[36]:
dates num_1 num_2 result
0 2011-01-01 00:00:00 1 1 0
1 2011-01-01 08:00:00 2 2 0
2 2011-01-01 16:00:00 3 10 1
3 2011-01-02 00:00:00 4 5 1
4 2011-01-02 08:00:00 5 5 0
5 2011-01-02 16:00:00 6 6 0
6 2011-01-03 00:00:00 7 7 0
7 2011-01-03 08:00:00 8 8 0
8 2011-01-03 16:00:00 9 100 1
9 2011-01-04 00:00:00 10 101 1
10 2011-01-04 08:00:00 11 102 0
11 2011-01-04 16:00:00 12 15 0
我只需要在groupby之前过滤df,比如
df.groupby((df['dates'].dt.hour>0)和(df['dates'].dt.date))