Python 如果日期范围介于开始日期和结束日期之间,则将类别追加到列
我相信这很简单,但我不能把我的头绕在它周围。基本上我有两个数据帧,一个大的df每六小时包含一次流程数据,另一个小的df包含一个条件号、一个开始日期和一个结束日期。我需要用与日期范围相对应的条件编号填充大数据框的条件列,或者如果日期不在小数据框中的任何日期范围之间,则将其留空。所以我的两个框架看起来像这样:Python 如果日期范围介于开始日期和结束日期之间,则将类别追加到列,python,pandas,Python,Pandas,我相信这很简单,但我不能把我的头绕在它周围。基本上我有两个数据帧,一个大的df每六小时包含一次流程数据,另一个小的df包含一个条件号、一个开始日期和一个结束日期。我需要用与日期范围相对应的条件编号填充大数据框的条件列,或者如果日期不在小数据框中的任何日期范围之间,则将其留空。所以我的两个框架看起来像这样: Large df Date P1 P2 7/1/2019 11:00 102 240 7/1/2019 17:00 102 247 7/1/2019 23:00
Large df
Date P1 P2
7/1/2019 11:00 102 240
7/1/2019 17:00 102 247
7/1/2019 23:00 100 219
7/2/2019 5:00 107 213
7/2/2019 11:00 100 226
7/2/2019 17:00 104 239
7/2/2019 23:00 110 240
7/3/2019 5:00 110 232
7/3/2019 11:00 102 215
7/3/2019 17:00 103 219
7/3/2019 23:00 107 243
7/4/2019 5:00 107 246
7/4/2019 11:00 103 219
7/4/2019 17:00 105 220
7/4/2019 23:00 107 220
7/5/2019 5:00 107 227
7/5/2019 11:00 108 208
7/5/2019 17:00 110 248
7/5/2019 23:00 107 235
Small df
Condition Start Time End Time
A 7/1/2019 11:00 7/2/2019 5:00
B 7/3/2019 5:00 7/3/2019 23:00
C 7/4/2019 23:00 7/5/2019 17:00
我需要这样的结果:
Date P1 P2 Cond
7/1/2019 11:00 102 240 A
7/1/2019 17:00 102 247 A
7/1/2019 23:00 100 219 A
7/2/2019 5:00 107 213 A
7/2/2019 11:00 100 226
7/2/2019 17:00 104 239
7/2/2019 23:00 110 240
7/3/2019 5:00 110 232 B
7/3/2019 11:00 102 215 B
7/3/2019 17:00 103 219 B
7/3/2019 23:00 107 243 B
7/4/2019 5:00 107 246
7/4/2019 11:00 103 219
7/4/2019 17:00 105 220
7/4/2019 23:00 107 220 C
7/5/2019 5:00 107 227 C
7/5/2019 11:00 108 208 C
7/5/2019 17:00 110 248 C
7/5/2019 23:00 107 235
你需要:
for i, row in sdf.iterrows():
df.loc[df['Date'].between(row['Start Time'], row['End Time']), 'Cond'] = row['Condition']
输出:
Date P1 P2 Cond
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226 NaN
5 2019-07-02 17:00:00 104 239 NaN
6 2019-07-02 23:00:00 110 240 NaN
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246 NaN
12 2019-07-04 11:00:00 103 219 NaN
13 2019-07-04 17:00:00 105 220 NaN
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235 NaN
你需要:
for i, row in sdf.iterrows():
df.loc[df['Date'].between(row['Start Time'], row['End Time']), 'Cond'] = row['Condition']
输出:
Date P1 P2 Cond
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226 NaN
5 2019-07-02 17:00:00 104 239 NaN
6 2019-07-02 23:00:00 110 240 NaN
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246 NaN
12 2019-07-04 11:00:00 103 219 NaN
13 2019-07-04 17:00:00 105 220 NaN
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235 NaN
您可以执行以下操作:
df1 = pd.read_csv(io.StringIO(s1), sep='\s\s+', engine='python',
converters={'Date': pd.to_datetime})
df2 = pd.read_csv(io.StringIO(s2), sep='\s\s+', engine='python',
converters={'Start Time': pd.to_datetime, 'End Time': pd.to_datetime})
df2 = df2.set_index('Condition').stack().reset_index()
df = pd.merge_asof(df1, df2, left_on='Date', right_on=0, direction='backward')
df.loc[(df['level_1'].eq('End Time')) & (df['Date'] > df[0]), 'Condition'] = ''
print(df.iloc[:, :-2])
Date P1 P2 Condition
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226
5 2019-07-02 17:00:00 104 239
6 2019-07-02 23:00:00 110 240
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246
12 2019-07-04 11:00:00 103 219
13 2019-07-04 17:00:00 105 220
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235
您可以执行以下操作:
df1 = pd.read_csv(io.StringIO(s1), sep='\s\s+', engine='python',
converters={'Date': pd.to_datetime})
df2 = pd.read_csv(io.StringIO(s2), sep='\s\s+', engine='python',
converters={'Start Time': pd.to_datetime, 'End Time': pd.to_datetime})
df2 = df2.set_index('Condition').stack().reset_index()
df = pd.merge_asof(df1, df2, left_on='Date', right_on=0, direction='backward')
df.loc[(df['level_1'].eq('End Time')) & (df['Date'] > df[0]), 'Condition'] = ''
print(df.iloc[:, :-2])
Date P1 P2 Condition
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226
5 2019-07-02 17:00:00 104 239
6 2019-07-02 23:00:00 110 240
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246
12 2019-07-04 11:00:00 103 219
13 2019-07-04 17:00:00 105 220
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235
您可以按如下方式尝试
pd.IntervalIndex
和map
:
inx = pd.IntervalIndex.from_arrays(df2['Start Time'], df2['End Time'], closed='both')
df2.index = inx
df1['cond'] = df1.Date.map(df2.Condition)
Out[423]:
Date P1 P2 cond
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226 NaN
5 2019-07-02 17:00:00 104 239 NaN
6 2019-07-02 23:00:00 110 240 NaN
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246 NaN
12 2019-07-04 11:00:00 103 219 NaN
13 2019-07-04 17:00:00 105 220 NaN
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235 NaN
您可以按如下方式尝试
pd.IntervalIndex
和map
:
inx = pd.IntervalIndex.from_arrays(df2['Start Time'], df2['End Time'], closed='both')
df2.index = inx
df1['cond'] = df1.Date.map(df2.Condition)
Out[423]:
Date P1 P2 cond
0 2019-07-01 11:00:00 102 240 A
1 2019-07-01 17:00:00 102 247 A
2 2019-07-01 23:00:00 100 219 A
3 2019-07-02 05:00:00 107 213 A
4 2019-07-02 11:00:00 100 226 NaN
5 2019-07-02 17:00:00 104 239 NaN
6 2019-07-02 23:00:00 110 240 NaN
7 2019-07-03 05:00:00 110 232 B
8 2019-07-03 11:00:00 102 215 B
9 2019-07-03 17:00:00 103 219 B
10 2019-07-03 23:00:00 107 243 B
11 2019-07-04 05:00:00 107 246 NaN
12 2019-07-04 11:00:00 103 219 NaN
13 2019-07-04 17:00:00 105 220 NaN
14 2019-07-04 23:00:00 107 220 C
15 2019-07-05 05:00:00 107 227 C
16 2019-07-05 11:00:00 108 208 C
17 2019-07-05 17:00:00 110 248 C
18 2019-07-05 23:00:00 107 235 NaN
df1.插入(3,“Cond”,[None]*len(df1))
对于范围内的i(len(df2)):
df1.loc[(df1[“日期”]>=df2[“开始时间”].loc[i])*(df1[“日期”]df1.插入(3,“条件”,[None]*len(df1))
对于范围内的i(len(df2)):
df1.loc[(df1[“日期”]>=df2[“开始时间”].loc[i])*(df1[“日期”]你能提供创建示例的代码吗?你能提供创建示例的代码吗?我不同意,iterrows
速度慢,不能缩放,不应该使用。我使用的数据帧应该很小,只用于查找。这就是为什么。@naturalFrequency我理解,但不知道e OP的设置,如果他们以后可能在更大的数据帧上使用is,我仍然不认为将其作为答案是正确的。我认为这个解决方案没有问题。@NaturalFrequencyNet同意不同意我必须不同意,iterrows
速度慢且不可缩放,不应该使用它。我使用的数据帧应该是小,仅用于查找。这就是为什么。@NaturalFrequency我理解,但不知道OP的设置,也不知道他们以后是否会在更大的数据帧上使用is,我仍然不认为将其作为答案是正确的。我认为此解决方案没有问题。@NaturalFrequencyLet同意不同意