Python 如何在没有精确匹配的情况下按区间索引分组
我有两个数据帧。它们看起来像这样:Python 如何在没有精确匹配的情况下按区间索引分组,python,pandas,group-by,merge,Python,Pandas,Group By,Merge,我有两个数据帧。它们看起来像这样: df_a Framecount probability 0 0.0 [0.00019486549333333332, 4.883635666666667e-06... 1 1.0 [0.00104359155, 3.9232405e-05, 0.0015722045000... 2 2.0 [0.000485
df_a
Framecount probability
0 0.0 [0.00019486549333333332, 4.883635666666667e-06...
1 1.0 [0.00104359155, 3.9232405e-05, 0.0015722045000...
2 2.0 [0.00048501002666666667, 1.668179e-05, 0.00052...
3 3.0 [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4 4.0 [0.0004808829, 5.389742e-05, 0.002522127933333...
.. ... ...
906 906.0 [1.677140566666667e-05, 1.1745095666666665e-06...
907 907.0 [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908 908.0 [8.1334184e-05, 0.00012675669636333335, 0.0028...
909 909.0 [0.00014893802999999998, 1.0407592500000001e-0...
910 910.0 [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...
start stop probs
0 12.12 12.47 [61, 83, 62, 72, 25, 32, 82, 35, 43, 10, 30, 5...
1 13.44 20.82 [49.285714285714285, 57.142857142857146, 51.42...
2 20.88 29.63 [42.666666666666664, 42.55555555555556, 46.0, ...
3 31.61 33.33 [87.5, 49.0, 46.5, 54.5, 75.0, 47.0, 24.0, 40....
4 33.44 42.21 [48.55555555555556, 66.22222222222223, 45.7777...
5 880.44 887.92 [51.857142857142854, 50.57142857142857, 63.714...
6 888.63 892.07 [45.25, 23.5, 67.25, 68.0, 38.25, 47.25, 50.25...
7 892.13 895.30 [61.333333333333336, 44.0, 43.333333333333336,...
8 895.31 900.99 [68.2, 44.6, 50.8, 35.2, 53.2, 40.4, 34.8, 77....
9 907.58 908.35 [17.0, 78.0, 24.0, 33.0, 88.0, 3.0, 43.0, 2.0,...
以及:
当df_a.Framecount
介于df_b.start和df_b.stop之间时,我想将df_a.probability
合并到df_b
。df_a.probability
的聚合统计应该是mean
,但我遇到了错误,因为df_a.probability
是dtype np array
我使用@TrentonMcKinney提供的代码:
import pandas as pd
import numpy as np
# setup df with start and stop ranges
data = {'start': [12.12, 13.44, 20.88, 31.61, 33.44, 880.44, 888.63, 892.13, 895.31, 907.58], 'stop': [12.47, 20.82, 29.63, 33.33, 42.21, 887.92, 892.07, 895.3, 900.99, 908.35]}
df = pd.DataFrame(data)
# setup sample df_a with Framecount as fc, and probability as prob
np.random.seed(365)
df_a = pd.DataFrame({'fc': range(911), 'prob': np.random.randint(1, 100, (911, 14)).tolist()})
# this will convert the column to np.arrays instead of lists; the remainder of the code works regardless
# df_a.prob = df_a.prob.map(np.array)
# create an IntervalIndex from df start and stop
idx = pd.IntervalIndex.from_arrays(df.start, df.stop, closed='both')
这非常有效,除了在启动和停止时间在同一秒内的情况下,如df_b的第一行,启动和停止分别为12.12和12.47。发生这种情况时,我只想返回df_a.probability值和最接近的Framecount值。在本例中,第一个df_b开始/停止索引将为12.12-12.47,并且因为它是相同的第二个,因此没有属于此范围的df_a.Framecount值。因此,我想在df_a.Framecount==12时返回df_a.probability数组。我怎样才能做到这一点呢?这可能是我所期望的稍长一点的代码片段,但它满足了您的需要。可能还有更简单的选择,我没有考虑过。
我使用了您提供的代码来重新生成问题
df_a.prob = df_a.prob.map(np.array)
idx = pd.IntervalIndex.from_arrays(df.start, df.stop, closed='both')
probs=[]
for row, i in enumerate(idx):
#here, for each intervalIndex we are creating a boolean series showing whether framecount is in IntervalIndex.
series_bool=df_a.fc.apply(lambda a: a in i)
if any(series_bool):
#if fc is in the range of interval index, we simply take the mean of the zipped list. here zip() solves the problem of taking the mean of np.array dtype objects.
probs.append([np.mean(k) for k in zip(*df_a.iloc[series_bool[series_bool].index].prob)])
else:
#if fc is not in the range of IntervalIndex, i simply rounded the start number and added that probability to the probs list.
dfa_idx=int(round(df.loc[row,"start"]))
probs.append(df_a.loc[dfa_idx, "prob"])
现在,我们可以将问题列表与df_b合并:
df['probability']=probs
使用您提供的代码,df_b最终如下所示:
df_a
Framecount probability
0 0.0 [0.00019486549333333332, 4.883635666666667e-06...
1 1.0 [0.00104359155, 3.9232405e-05, 0.0015722045000...
2 2.0 [0.00048501002666666667, 1.668179e-05, 0.00052...
3 3.0 [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4 4.0 [0.0004808829, 5.389742e-05, 0.002522127933333...
.. ... ...
906 906.0 [1.677140566666667e-05, 1.1745095666666665e-06...
907 907.0 [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908 908.0 [8.1334184e-05, 0.00012675669636333335, 0.0028...
909 909.0 [0.00014893802999999998, 1.0407592500000001e-0...
910 910.0 [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...
start stop probs
0 12.12 12.47 [61, 83, 62, 72, 25, 32, 82, 35, 43, 10, 30, 5...
1 13.44 20.82 [49.285714285714285, 57.142857142857146, 51.42...
2 20.88 29.63 [42.666666666666664, 42.55555555555556, 46.0, ...
3 31.61 33.33 [87.5, 49.0, 46.5, 54.5, 75.0, 47.0, 24.0, 40....
4 33.44 42.21 [48.55555555555556, 66.22222222222223, 45.7777...
5 880.44 887.92 [51.857142857142854, 50.57142857142857, 63.714...
6 888.63 892.07 [45.25, 23.5, 67.25, 68.0, 38.25, 47.25, 50.25...
7 892.13 895.30 [61.333333333333336, 44.0, 43.333333333333336,...
8 895.31 900.99 [68.2, 44.6, 50.8, 35.2, 53.2, 40.4, 34.8, 77....
9 907.58 908.35 [17.0, 78.0, 24.0, 33.0, 88.0, 3.0, 43.0, 2.0,...
对端点的地板和天花板进行测量怎么样idx=pd.IntervalIndex.from_数组(np.floor(df.start)、np.ceil(df.stop)、closed='both')
谢谢@Elif!只是在接受之前确认一下,if
语句下的平均聚合聚合聚合了相同数组索引中元素的数组平均值,对吗?例如,它将[1,1,1]
和[2,0,1]
的平均值计算为[1.5,5,1]
,对吗?是的,zip方法正是这样做的。