Python 如何在没有精确匹配的情况下按区间索引分组_Python_Pandas_Group By_Merge

Python 如何在没有精确匹配的情况下按区间索引分组

python pandas merge

Python 如何在没有精确匹配的情况下按区间索引分组,python,pandas,group-by,merge,Python,Pandas,Group By,Merge,我有两个数据帧。它们看起来像这样： df_a Framecount probability 0 0.0 [0.00019486549333333332, 4.883635666666667e-06... 1 1.0 [0.00104359155, 3.9232405e-05, 0.0015722045000... 2 2.0 [0.000485

我有两个数据帧。它们看起来像这样：

df_a
     Framecount                                        probability
0           0.0  [0.00019486549333333332, 4.883635666666667e-06...
1           1.0  [0.00104359155, 3.9232405e-05, 0.0015722045000...
2           2.0  [0.00048501002666666667, 1.668179e-05, 0.00052...
3           3.0  [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4           4.0  [0.0004808829, 5.389742e-05, 0.002522127933333...
..          ...                                                ...
906       906.0  [1.677140566666667e-05, 1.1745095666666665e-06...
907       907.0  [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908       908.0  [8.1334184e-05, 0.00012675669636333335, 0.0028...
909       909.0  [0.00014893802999999998, 1.0407592500000001e-0...
910       910.0  [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...

    start    stop                                              probs
0   12.12   12.47  [61, 83, 62, 72, 25, 32, 82, 35, 43, 10, 30, 5...
1   13.44   20.82  [49.285714285714285, 57.142857142857146, 51.42...
2   20.88   29.63  [42.666666666666664, 42.55555555555556, 46.0, ...
3   31.61   33.33  [87.5, 49.0, 46.5, 54.5, 75.0, 47.0, 24.0, 40....
4   33.44   42.21  [48.55555555555556, 66.22222222222223, 45.7777...
5  880.44  887.92  [51.857142857142854, 50.57142857142857, 63.714...
6  888.63  892.07  [45.25, 23.5, 67.25, 68.0, 38.25, 47.25, 50.25...
7  892.13  895.30  [61.333333333333336, 44.0, 43.333333333333336,...
8  895.31  900.99  [68.2, 44.6, 50.8, 35.2, 53.2, 40.4, 34.8, 77....
9  907.58  908.35  [17.0, 78.0, 24.0, 33.0, 88.0, 3.0, 43.0, 2.0,...

以及：

当

df_a.Framecount

介于df_b.start和

df_b.stop之间时，我想将df_a.probability
合并到df_b
。df_a.probability
的聚合统计应该是mean
，但我遇到了错误，因为df_a.probability
是dtype np array
我使用@TrentonMcKinney提供的代码：
import pandas as pd
import numpy as np

# setup df with start and stop ranges
data = {'start': [12.12, 13.44, 20.88, 31.61, 33.44, 880.44, 888.63, 892.13, 895.31, 907.58], 'stop': [12.47, 20.82, 29.63, 33.33, 42.21, 887.92, 892.07, 895.3, 900.99, 908.35]}
df = pd.DataFrame(data)

# setup sample df_a with Framecount as fc, and probability as prob
np.random.seed(365)
df_a = pd.DataFrame({'fc': range(911), 'prob': np.random.randint(1, 100, (911, 14)).tolist()})

# this will convert the column to np.arrays instead of lists; the remainder of the code works regardless
# df_a.prob = df_a.prob.map(np.array)

# create an IntervalIndex from df start and stop
idx = pd.IntervalIndex.from_arrays(df.start, df.stop, closed='both')

这非常有效，除了在启动和停止时间在同一秒内的情况下，如df_b的第一行，启动和停止分别为12.12和12.47。发生这种情况时，我只想返回df_a.probability值和最接近的Framecount值。在本例中，第一个df_b开始/停止索引将为12.12-12.47，并且因为它是相同的第二个，因此没有属于此范围的df_a.Framecount值。因此，我想在df_a.Framecount==12时返回df_a.probability数组。我怎样才能做到这一点呢？这可能是我所期望的稍长一点的代码片段，但它满足了您的需要。可能还有更简单的选择，我没有考虑过。
我使用了您提供的代码来重新生成问题
df_a.prob = df_a.prob.map(np.array)    
idx = pd.IntervalIndex.from_arrays(df.start, df.stop, closed='both')
probs=[]
for row, i in enumerate(idx):
     #here, for each intervalIndex we are creating a boolean series showing whether framecount is in IntervalIndex.
     series_bool=df_a.fc.apply(lambda a: a in i) 
     if any(series_bool):
          #if fc is in the range of interval index, we simply take the mean of the zipped list. here zip() solves the problem of taking the mean of np.array dtype objects.
          probs.append([np.mean(k) for k in zip(*df_a.iloc[series_bool[series_bool].index].prob)])
     else:
          #if fc is not in the range of IntervalIndex, i simply rounded the start number and added that probability to the probs list.
          dfa_idx=int(round(df.loc[row,"start"]))
          probs.append(df_a.loc[dfa_idx, "prob"])

现在，我们可以将问题列表与df_b合并：
df['probability']=probs

使用您提供的代码，df_b最终如下所示：
df_a
     Framecount                                        probability
0           0.0  [0.00019486549333333332, 4.883635666666667e-06...
1           1.0  [0.00104359155, 3.9232405e-05, 0.0015722045000...
2           2.0  [0.00048501002666666667, 1.668179e-05, 0.00052...
3           3.0  [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4           4.0  [0.0004808829, 5.389742e-05, 0.002522127933333...
..          ...                                                ...
906       906.0  [1.677140566666667e-05, 1.1745095666666665e-06...
907       907.0  [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908       908.0  [8.1334184e-05, 0.00012675669636333335, 0.0028...
909       909.0  [0.00014893802999999998, 1.0407592500000001e-0...
910       910.0  [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...

    start    stop                                              probs
0   12.12   12.47  [61, 83, 62, 72, 25, 32, 82, 35, 43, 10, 30, 5...
1   13.44   20.82  [49.285714285714285, 57.142857142857146, 51.42...
2   20.88   29.63  [42.666666666666664, 42.55555555555556, 46.0, ...
3   31.61   33.33  [87.5, 49.0, 46.5, 54.5, 75.0, 47.0, 24.0, 40....
4   33.44   42.21  [48.55555555555556, 66.22222222222223, 45.7777...
5  880.44  887.92  [51.857142857142854, 50.57142857142857, 63.714...
6  888.63  892.07  [45.25, 23.5, 67.25, 68.0, 38.25, 47.25, 50.25...
7  892.13  895.30  [61.333333333333336, 44.0, 43.333333333333336,...
8  895.31  900.99  [68.2, 44.6, 50.8, 35.2, 53.2, 40.4, 34.8, 77....
9  907.58  908.35  [17.0, 78.0, 24.0, 33.0, 88.0, 3.0, 43.0, 2.0,...

对端点的地板和天花板进行测量怎么样idx=pd.IntervalIndex.from_数组（np.floor（df.start）、np.ceil（df.stop）、closed='both'）
谢谢@Elif！只是在接受之前确认一下，if
语句下的平均聚合聚合聚合了相同数组索引中元素的数组平均值，对吗？例如，它将[1,1,1]
和[2,0,1]
的平均值计算为[1.5,5,1]
，对吗？是的，zip方法正是这样做的。