Python 熊猫:按日期范围/确切id筛选
我希望根据另一个只有三列(ID、Start、End)的小得多的数据帧过滤一个大数据帧(数百万行) 下面是我总结的内容(可以使用),但它看起来像是一个Python 熊猫:按日期范围/确切id筛选,python,pandas,time-series,Python,Pandas,Time Series,我希望根据另一个只有三列(ID、Start、End)的小得多的数据帧过滤一个大数据帧(数百万行) 下面是我总结的内容(可以使用),但它看起来像是一个groupby()或np。其中可能更快 设置: import pandas as pd import io csv = io.StringIO(u''' time id num 2018-01-01 00:00:00 A 1 2018-01-01 01:00:00 A 2 2018-01-01 02:00:00 A 3 2018
groupby()
或np。其中
可能更快
设置:
import pandas as pd
import io
csv = io.StringIO(u'''
time id num
2018-01-01 00:00:00 A 1
2018-01-01 01:00:00 A 2
2018-01-01 02:00:00 A 3
2018-01-01 03:00:00 A 4
2018-01-01 04:00:00 A 5
2018-01-01 05:00:00 A 6
2018-01-01 06:00:00 A 6
2018-01-03 07:00:00 B 10
2018-01-03 08:00:00 B 11
2018-01-03 09:00:00 B 12
2018-01-03 10:00:00 B 13
2018-01-03 11:00:00 B 14
2018-01-03 12:00:00 B 15
2018-01-03 13:00:00 B 16
2018-05-29 23:00:00 C 111
2018-05-30 00:00:00 C 122
2018-05-30 01:00:00 C 133
2018-05-30 02:00:00 C 144
2018-05-30 03:00:00 C 155
''')
df = pd.read_csv(csv, sep = '\t')
df['time'] = pd.to_datetime(df['time'])
csv_filter = io.StringIO(u'''
id start end
A 2018-01-01 01:00:00 2018-01-01 02:00:00
B 2018-01-03 09:00:00 2018-01-03 12:00:00
C 2018-05-30 00:00:00 2018-05-30 08:00:00
''')
df_filter = pd.read_csv(csv_filter, sep = '\t')
df_filter['start'] = pd.to_datetime(df_filter['start'])
df_filter['end'] = pd.to_datetime(df_filter['end'])
工作代码
df = pd.merge_asof(df, df_filter, left_on = 'time', right_on = 'start', by = 'id').dropna(subset = ['start']).drop(['start','end'], axis = 1)
df = pd.merge_asof(df, df_filter, left_on = 'time', right_on = 'end', by = 'id', direction = 'forward').dropna(subset = ['end']).drop(['start','end'], axis = 1)
输出
time id num
0 2018-01-01 01:00:00 A 2
1 2018-01-01 02:00:00 A 3
6 2018-01-03 09:00:00 B 12
7 2018-01-03 10:00:00 B 13
8 2018-01-03 11:00:00 B 14
9 2018-01-03 12:00:00 B 15
11 2018-05-30 00:00:00 C 122
12 2018-05-30 01:00:00 C 133
13 2018-05-30 02:00:00 C 144
14 2018-05-30 03:00:00 C 155
有没有关于更优雅/更快的解决方案的想法?为什么不在筛选之前合并
。请注意,当数据集太大时,这将消耗您的内存
newdf=df.merge(df_filter)
newdf=newdf.loc[newdf.time.between(newdf.start,newdf.end),df.columns.tolist()]
newdf
Out[480]:
time id num
1 2018-01-01 01:00:00 A 2
2 2018-01-01 02:00:00 A 3
9 2018-01-03 09:00:00 B 12
10 2018-01-03 10:00:00 B 13
11 2018-01-03 11:00:00 B 14
12 2018-01-03 12:00:00 B 15
15 2018-05-30 00:00:00 C 122
16 2018-05-30 01:00:00 C 133
17 2018-05-30 02:00:00 C 144
18 2018-05-30 03:00:00 C 155
你在合并什么?@elPastor id,key可以在默认情况下搜索(交叉点列),我可以告诉你,这就是关键所在。关于合并哪些键,我想更明确一点,但这很好。我知道有一个更优雅的解决方案。谢谢