Pandas 如何根据不同DF的日期时间范围对大熊猫进行分组

Pandas 如何根据不同DF的日期时间范围对大熊猫进行分组,pandas,dataframe,pandas-groupby,Pandas,Dataframe,Pandas Groupby,我卡住了,解决不了这个问题。。。 我有两个数据帧。 一个有日期时间间隔,另一个有日期时间和值。 我需要根据datetime范围获取MIN()值 将熊猫作为pd导入 timeseries=pd.DataFrame( [ ['2018-01-01T00:00:00.000000000','2018-01-01T03:00:00.000000000'], [2018-01-02T00:00:00.000000000',2018-01-02T03:00:00.000000000'], [2018-01-

我卡住了,解决不了这个问题。。。 我有两个数据帧。 一个有日期时间间隔,另一个有日期时间和值。 我需要根据datetime范围获取MIN()值

将熊猫作为pd导入
timeseries=pd.DataFrame(
[
['2018-01-01T00:00:00.000000000','2018-01-01T03:00:00.000000000'],
[2018-01-02T00:00:00.000000000',2018-01-02T03:00:00.000000000'],
[2018-01-03T00:00:00.000000000',2018-01-03T03:00:00.000000000'],
],dtype='datetime64[ns]',columns=['Start DT','End DT']
值=pd.DataFrame(
[
[2018-01-01T00:00:00.000000000',1],
[2018-01-01T01:00:00.000000000',2],
[2018-01-01T02:00:00.000000000',0],
[2018-01-02T00:00:00.000000000',-1],
[2018-01-02T01:00:00.000000000',3],
[2018-01-02T02:00:00.000000000',10],
[2018-01-03T00:00:00.000000000',7],
[2018-01-03T01:00:00.000000000',11],
[2018-01-03T02:00:00.000000000',2],
],列=['DT','Value'])
所需输出:

    Start DT              End DT  Min
0 2018-01-01 2018-01-01 03:00:00    0
1 2018-01-02 2018-01-02 03:00:00   -1
2 2018-01-03 2018-01-03 03:00:00    2

和想法?

使用
timeseries
列创建的
IntervalIndex
,然后按获取位置,聚合
min
,最后添加到
timeseries
列中:

s = pd.IntervalIndex.from_arrays(timeseries['Start DT'], 
                                 timeseries['End DT'], 
                                 closed='both')

values['new'] = timeseries.index[s.get_indexer(values['DT'])]
print (values)
                   DT  Value  new
0 2018-01-01 00:00:00      1    0
1 2018-01-01 01:00:00      2    0
2 2018-01-01 02:00:00      0    0
3 2018-01-02 00:00:00     -1    1
4 2018-01-02 01:00:00      3    1
5 2018-01-02 02:00:00     10    1
6 2018-01-03 00:00:00      7    2
7 2018-01-03 01:00:00     11    2
8 2018-01-03 02:00:00      2    2

df = timeseries.join(values.groupby('new')['Value'].min().rename('Min'))
print (df)
    Start DT              End DT  Min
0 2018-01-01 2018-01-01 03:00:00    0
1 2018-01-02 2018-01-02 03:00:00   -1
2 2018-01-03 2018-01-03 03:00:00    2
编辑:如果没有添加匹配项,而是添加了缺少的值
-1
,因此选择了最后一个索引值,此处
2

timeseries = pd.DataFrame(
    [
        ['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
        ['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
        ['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
    ], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])

values = pd.DataFrame(
    [   ['2017-12-31T00:00:00.000000000', -10],
        ['2018-01-01T00:00:00.000000000', 1],
        ['2018-01-01T01:00:00.000000000', 2],
        ['2018-01-01T02:00:00.000000000', 0],
        ['2018-01-02T00:00:00.000000000', -1],
        ['2018-01-02T01:00:00.000000000', 3],
        ['2018-01-02T02:00:00.000000000', 10],
        ['2018-01-03T00:00:00.000000000', 7],
        ['2018-01-03T01:00:00.000000000', 11],
        ['2018-01-03T02:00:00.000000000', 2],
    ], columns=['DT', 'Value']) 

values['DT'] = pd.to_datetime(values['DT'])
print (values)
                   DT  Value
0 2017-12-31 00:00:00    -10
1 2018-01-01 00:00:00      1
2 2018-01-01 01:00:00      2
3 2018-01-01 02:00:00      0
4 2018-01-02 00:00:00     -1
5 2018-01-02 01:00:00      3
6 2018-01-02 02:00:00     10
7 2018-01-03 00:00:00      7
8 2018-01-03 01:00:00     11
9 2018-01-03 02:00:00      2


一种可能的解决方案是创建一个变量(
key
),在该变量上连接两个数据集

# create 'key' variable
timeseries['key'] = timeseries['Start DT'].astype(str)
values['key'] = pd.to_datetime(values['DT'].str.replace('T', ' '), format='%Y-%m-%d %H:%M:%S.%f').dt.date.astype(str)

# create dataset with minima
mins = values.groupby('key').agg({'Value': 'min'}).reset_index()

# join
timeseries.merge(mins, on='key').drop(columns=['key'])

    Start DT              End DT  Value
0 2018-01-01 2018-01-01 03:00:00      0
1 2018-01-02 2018-01-02 03:00:00     -1
2 2018-01-03 2018-01-03 03:00:00      2

这是一个很好的方法,但不是很准确。如果
得到的日期未包含在
时间序列中
它们将被索引为-1,并且只会在
时间序列上移动。索引[s.get\u indexer(value['DT'])]
到上一个值。。。p、 对不起,我以前偶尔按过保存。@aero-你能说得更具体些吗?这里的注释有限,不能粘贴整个代码,只需将其添加到第一位的值数组中['2017-12-31T00:00:00.000000000',-10]。结果会出错。它将显示
2018-01-03 2018-01-03 03:00:00-10
!作品
# create 'key' variable
timeseries['key'] = timeseries['Start DT'].astype(str)
values['key'] = pd.to_datetime(values['DT'].str.replace('T', ' '), format='%Y-%m-%d %H:%M:%S.%f').dt.date.astype(str)

# create dataset with minima
mins = values.groupby('key').agg({'Value': 'min'}).reset_index()

# join
timeseries.merge(mins, on='key').drop(columns=['key'])

    Start DT              End DT  Value
0 2018-01-01 2018-01-01 03:00:00      0
1 2018-01-02 2018-01-02 03:00:00     -1
2 2018-01-03 2018-01-03 03:00:00      2
values['DT']=values['DT'].astype(str) #convert to string
s=values['DT'].str.split(' ')#split on space 
values['day']=s.str[0] #take the day part
df4=values.groupby(by='day').min()#groupby and take min value
df4.reset_index(inplace=True) #reset index
df4['day']=pd.to_datetime(df4['day'])#convert back to datetime for merging
final=pd.merge(timeseries,df4,left_on='Start DT',right_on='day',how='inner') #merge