Python 如何在没有精确匹配的情况下按区间索引分组

Python 如何在没有精确匹配的情况下按区间索引分组,python,pandas,group-by,merge,Python,Pandas,Group By,Merge,我有两个数据帧。它们看起来像这样: df_a Framecount probability 0 0.0 [0.00019486549333333332, 4.883635666666667e-06... 1 1.0 [0.00104359155, 3.9232405e-05, 0.0015722045000... 2 2.0 [0.000485

我有两个数据帧。它们看起来像这样:

df_a
     Framecount                                        probability
0           0.0  [0.00019486549333333332, 4.883635666666667e-06...
1           1.0  [0.00104359155, 3.9232405e-05, 0.0015722045000...
2           2.0  [0.00048501002666666667, 1.668179e-05, 0.00052...
3           3.0  [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4           4.0  [0.0004808829, 5.389742e-05, 0.002522127933333...
..          ...                                                ...
906       906.0  [1.677140566666667e-05, 1.1745095666666665e-06...
907       907.0  [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908       908.0  [8.1334184e-05, 0.00012675669636333335, 0.0028...
909       909.0  [0.00014893802999999998, 1.0407592500000001e-0...
910       910.0  [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...
    start    stop                                              probs
0   12.12   12.47  [61, 83, 62, 72, 25, 32, 82, 35, 43, 10, 30, 5...
1   13.44   20.82  [49.285714285714285, 57.142857142857146, 51.42...
2   20.88   29.63  [42.666666666666664, 42.55555555555556, 46.0, ...
3   31.61   33.33  [87.5, 49.0, 46.5, 54.5, 75.0, 47.0, 24.0, 40....
4   33.44   42.21  [48.55555555555556, 66.22222222222223, 45.7777...
5  880.44  887.92  [51.857142857142854, 50.57142857142857, 63.714...
6  888.63  892.07  [45.25, 23.5, 67.25, 68.0, 38.25, 47.25, 50.25...
7  892.13  895.30  [61.333333333333336, 44.0, 43.333333333333336,...
8  895.31  900.99  [68.2, 44.6, 50.8, 35.2, 53.2, 40.4, 34.8, 77....
9  907.58  908.35  [17.0, 78.0, 24.0, 33.0, 88.0, 3.0, 43.0, 2.0,...
以及:

df_a.Framecount
介于df_b.start和
df_b.stop之间时,我想将
df_a.probability
合并到
df_b
df_a.probability
的聚合统计应该是
mean
,但我遇到了错误,因为
df_a.probability
是dtype np array

我使用@TrentonMcKinney提供的代码:

import pandas as pd
import numpy as np

# setup df with start and stop ranges
data = {'start': [12.12, 13.44, 20.88, 31.61, 33.44, 880.44, 888.63, 892.13, 895.31, 907.58], 'stop': [12.47, 20.82, 29.63, 33.33, 42.21, 887.92, 892.07, 895.3, 900.99, 908.35]}
df = pd.DataFrame(data)

# setup sample df_a with Framecount as fc, and probability as prob
np.random.seed(365)
df_a = pd.DataFrame({'fc': range(911), 'prob': np.random.randint(1, 100, (911, 14)).tolist()})

# this will convert the column to np.arrays instead of lists; the remainder of the code works regardless
# df_a.prob = df_a.prob.map(np.array)

# create an IntervalIndex from df start and stop
idx = pd.IntervalIndex.from_arrays(df.start, df.stop, closed='both')

这非常有效,除了在启动和停止时间在同一秒内的情况下,如df_b的第一行,启动和停止分别为12.12和12.47。发生这种情况时,我只想返回df_a.probability值和最接近的Framecount值。在本例中,第一个df_b开始/停止索引将为12.12-12.47,并且因为它是相同的第二个,因此没有属于此范围的df_a.Framecount值。因此,我想在df_a.Framecount==12时返回df_a.probability数组。我怎样才能做到这一点呢?

这可能是我所期望的稍长一点的代码片段,但它满足了您的需要。可能还有更简单的选择,我没有考虑过。 我使用了您提供的代码来重新生成问题

df_a.prob = df_a.prob.map(np.array)    
idx = pd.IntervalIndex.from_arrays(df.start, df.stop, closed='both')
probs=[]
for row, i in enumerate(idx):
     #here, for each intervalIndex we are creating a boolean series showing whether framecount is in IntervalIndex.
     series_bool=df_a.fc.apply(lambda a: a in i) 
     if any(series_bool):
          #if fc is in the range of interval index, we simply take the mean of the zipped list. here zip() solves the problem of taking the mean of np.array dtype objects.
          probs.append([np.mean(k) for k in zip(*df_a.iloc[series_bool[series_bool].index].prob)])
     else:
          #if fc is not in the range of IntervalIndex, i simply rounded the start number and added that probability to the probs list.
          dfa_idx=int(round(df.loc[row,"start"]))
          probs.append(df_a.loc[dfa_idx, "prob"])
现在,我们可以将问题列表与df_b合并:

df['probability']=probs
使用您提供的代码,df_b最终如下所示:

df_a
     Framecount                                        probability
0           0.0  [0.00019486549333333332, 4.883635666666667e-06...
1           1.0  [0.00104359155, 3.9232405e-05, 0.0015722045000...
2           2.0  [0.00048501002666666667, 1.668179e-05, 0.00052...
3           3.0  [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4           4.0  [0.0004808829, 5.389742e-05, 0.002522127933333...
..          ...                                                ...
906       906.0  [1.677140566666667e-05, 1.1745095666666665e-06...
907       907.0  [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908       908.0  [8.1334184e-05, 0.00012675669636333335, 0.0028...
909       909.0  [0.00014893802999999998, 1.0407592500000001e-0...
910       910.0  [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...
    start    stop                                              probs
0   12.12   12.47  [61, 83, 62, 72, 25, 32, 82, 35, 43, 10, 30, 5...
1   13.44   20.82  [49.285714285714285, 57.142857142857146, 51.42...
2   20.88   29.63  [42.666666666666664, 42.55555555555556, 46.0, ...
3   31.61   33.33  [87.5, 49.0, 46.5, 54.5, 75.0, 47.0, 24.0, 40....
4   33.44   42.21  [48.55555555555556, 66.22222222222223, 45.7777...
5  880.44  887.92  [51.857142857142854, 50.57142857142857, 63.714...
6  888.63  892.07  [45.25, 23.5, 67.25, 68.0, 38.25, 47.25, 50.25...
7  892.13  895.30  [61.333333333333336, 44.0, 43.333333333333336,...
8  895.31  900.99  [68.2, 44.6, 50.8, 35.2, 53.2, 40.4, 34.8, 77....
9  907.58  908.35  [17.0, 78.0, 24.0, 33.0, 88.0, 3.0, 43.0, 2.0,...

对端点的地板和天花板进行测量怎么样
idx=pd.IntervalIndex.from_数组(np.floor(df.start)、np.ceil(df.stop)、closed='both')
谢谢@Elif!只是在接受之前确认一下,
if
语句下的平均聚合聚合聚合了相同数组索引中元素的数组平均值,对吗?例如,它将
[1,1,1]
[2,0,1]
的平均值计算为
[1.5,5,1]
,对吗?是的,zip方法正是这样做的。