Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/329.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何按间隔索引分组,聚合列表上的平均值,并连接到另一个数据帧?_Python_Pandas_Numpy - Fatal编程技术网

Python 如何按间隔索引分组,聚合列表上的平均值,并连接到另一个数据帧?

Python 如何按间隔索引分组,聚合列表上的平均值,并连接到另一个数据帧?,python,pandas,numpy,Python,Pandas,Numpy,我有两个数据帧。它们看起来像这样: df_a Framecount probability 0 0.0 [0.00019486549333333332, 4.883635666666667e-06... 1 1.0 [0.00104359155, 3.9232405e-05, 0.0015722045000... 2 2.0 [0.000485

我有两个数据帧。它们看起来像这样:

df_a
     Framecount                                        probability
0           0.0  [0.00019486549333333332, 4.883635666666667e-06...
1           1.0  [0.00104359155, 3.9232405e-05, 0.0015722045000...
2           2.0  [0.00048501002666666667, 1.668179e-05, 0.00052...
3           3.0  [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4           4.0  [0.0004808829, 5.389742e-05, 0.002522127933333...
..          ...                                                ...
906       906.0  [1.677140566666667e-05, 1.1745095666666665e-06...
907       907.0  [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908       908.0  [8.1334184e-05, 0.00012675669636333335, 0.0028...
909       909.0  [0.00014893802999999998, 1.0407592500000001e-0...
910       910.0  [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...
以及:

df_a.Framecount
介于df_b.start和
df_b.stop之间时,我想将
df_a.probability
合并到
df_b
df_a.probability
的聚合统计应该是
mean
,但我遇到了错误,因为
df_a.probability
是dtype np array

我正在尝试使用以下代码:

idx = pd.IntervalIndex.from_arrays(df_text['start'], df_text['stop'])
df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].apply(np.mean), how='left')
第1行创建索引并确定分组。在第2行中,我试图实现group by,并将
df_a.probability
中属于groupby平均指数的所有值进行聚合。我希望每个groupby有一个数组,它是groupby索引中所有数组的平均值。此代码给出了以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-271-19c7d58fb664> in <module>
      1 idx = pd.IntervalIndex.from_arrays(df_text['start'], df_text['stop'])
      2 f = lambda x: np.mean(np.array(x.tolist()), axis=0)
----> 3 df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].apply(np.mean), how='left')

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
   5808             group_keys=group_keys,
   5809             squeeze=squeeze,
-> 5810             observed=observed,
   5811         )
   5812 

~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
    407                 sort=sort,
    408                 observed=observed,
--> 409                 mutated=self.mutated,
    410             )
    411 

~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
    588 
    589         elif is_in_axis(gpr):  # df.groupby('name')
--> 590             if gpr in obj:
    591                 if validate:
    592                     obj._check_label_or_level_ambiguity(gpr, axis=axis)

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in __contains__(self, key)
   1848     def __contains__(self, key) -> bool_t:
   1849         """True if the key is in the info axis"""
-> 1850         return key in self._info_axis
   1851 
   1852     @property

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in __contains__(self, key)
   3898     @Appender(_index_shared_docs["contains"] % _index_doc_kwargs)
   3899     def __contains__(self, key) -> bool:
-> 3900         hash(key)
   3901         try:
   3902             return key in self._engine

TypeError: unhashable type: 'numpy.ndarray'

我得到了同样的错误


如何做到这一点?

我希望有人能想出一个不涉及循环的解决方案,但由于所有内容都已排序,我认为性能实际上不会那么差(两个数据帧的长度呈线性,没有内存开销)。我不知道您的数据帧的精确规格,因此我将首先创建一些示例

n_a = 11
df_a = pd.DataFrame(
    {"Framecount": list(range(n_a)), "probability": np.random.rand(n_a)}
)

n_b = 6
start = np.linspace(0, n_a, n_b)
end = start + n_a / (n_b - 1) - 1e-5
df_b = pd.DataFrame({"start": start, "end": end, "mean": [np.nan] * n_b})

print(df_a)
    Framecount  probability
0            0     0.099412
1            1     0.492661
2            2     0.043000
3            3     0.382923
4            4     0.208177
5            5     0.110007
6            6     0.369756
7            7     0.324723
8            8     0.702838
9            9     0.182167
10          10     0.578837

print(df_b)
   start       end  mean
0    0.0   2.19999  NaN
1    2.2   4.39999  NaN
2    4.4   6.59999  NaN
3    6.6   8.79999  NaN
4    8.8  10.99999  NaN
5   11.0  13.19999  NaN
现在我将循环遍历数据帧,聚合当前
开始
结束
之间的所有值,并在
dfu b
中的相应行中赋值:

i = j = 0
while i < n_a and j < n_b:
    # seek to next row of df_b where start <= df_a[i]
    while i < n_a and df_a.loc[i, "Framecount"] < df_b.loc[j, "start"]:
        i += 1

    accum = 0
    count = 0
    while i < n_a and df_a.loc[i, "Framecount"] < df_b.loc[j, "end"]:
        accum += df_a.loc[i, "probability"]
        count += 1
        i += 1

    df_b.loc[j, "mean"] = accum / count
    j += 1

print(df_b)
   start       end      mean
0    0.0   2.19999  0.211691
1    2.2   4.39999   0.29555
2    4.4   6.59999  0.239882
3    6.6   8.79999  0.513781
4    8.8  10.99999  0.380502
5   11.0  13.19999      NaN
i=j=0
而i
  • 出现此错误的原因是
    idx.get\u indexer\u non\u unique(df\u vid['Framecount'])
    创建了一个
    元组
    ,而您不能以这种方式
    groupby
    元组。
    • df_vid.groupby(idx.get_indexer\u non_unique(df_vid['Framecount'])[0])
      选择元组中的第一个数组将有效
  • idx.get\u indexer(df\u a.fc)
    将生成一个数组,其索引为
    fc
    所属的间隔。如果没有匹配的时间间隔,索引将显示为
    -1
  • df_a.groupby(idx.get_indexer(df_a.fc))
    按索引数组分组
  • .agg({'prob':list})
    将每个
    fc
    的所有列表聚合到一个列表中。
    • 每个组的结果是一个列表列表
  • .prob.map(np.mean)
    返回组中所有列表的总体平均值
  • .prob.apply(λx:[np.mean(v)表示x中的v])
    为每个列表返回一个平均值列表
  • 没有
    'fc'
    值落入
    12.12-12.47的存储箱中
  • 将熊猫作为pd导入
    将numpy作为np导入
    #使用开始和停止范围设置df
    数据={'start':[12.12,13.44,20.88,31.61,33.44,880.44,888.63,892.13895.31907.58],'stop':[12.47,20.82,29.63,33.33,42.21887.92,892.07895.3900.99,908.35]}
    df=pd.DataFrame(数据)
    #设置样本df_a,帧数为fc,概率为prob
    np.random.seed(365)
    df_a=pd.DataFrame({'fc':range(911),'prob':np.random.randint(1100,(911,14)).tolist())
    #这将把列转换为np.array而不是list;代码的其余部分无论如何都可以工作
    #df_a.prob=df_a.prob.map(np.array)
    #从df start和stop创建IntervalIndex
    idx=pd.IntervalIndex.from_数组(df.start、df.stop、closed='both')
    
    这将在轴=0上创建一个平均值列表
    dfg=df_a.groupby(idx.get_indexer(df_a.fc)).agg({'prob':list}).prob.apply(lambda x:np.mean(x,axis=0))
    #将df与dfg连接起来
    dfj=df.join(dfg)
    #方法列表的显示(dfj)
    启停探头
    0 12.12 12.47南
    1   13.44   20.82  [49.3, 57.1, 51.4, 45.9, 47.1, 45.9, 45.9, 55.3, 32.6, 48.0, 42.0, 45.0, 50.4, 54.4]
    2   20.88   29.63  [42.7, 42.6, 46.0, 45.9, 54.1, 55.9, 50.1, 55.2, 51.7, 54.0, 37.6, 60.9, 49.2, 45.6]
    3   31.61   33.33  [87.5, 49.0, 46.5, 54.5, 75.0, 47.0, 24.0, 40.5, 52.5, 21.0, 51.0, 72.5, 34.5, 50.5]
    4   33.44   42.21  [48.6, 66.2, 45.8, 64.7, 43.1, 69.0, 54.4, 52.1, 52.6, 59.6, 51.1, 42.1, 43.3, 38.0]
    5  880.44  887.92  [51.9, 50.6, 63.7, 47.7, 51.3, 34.9, 51.3, 53.0, 53.4, 65.1, 38.6, 49.4, 48.1, 44.1]
    6  888.63  892.07  [45.2, 23.5, 67.2, 68.0, 38.2, 47.2, 50.2, 75.8, 35.2, 46.8, 55.0, 57.5, 44.2, 78.0]
    7  892.13  895.30  [61.3, 44.0, 43.3, 36.3, 63.7, 89.7, 51.7, 57.0, 50.0, 68.7, 80.7, 46.3, 66.7, 11.3]
    8  895.31  900.99  [68.2, 44.6, 50.8, 35.2, 53.2, 40.4, 34.8, 77.4, 61.0, 35.2, 26.0, 47.8, 30.4, 55.4]
    9  907.58  908.35     [17.0, 78.0, 24.0, 33.0, 88.0, 3.0, 43.0, 2.0, 36.0, 48.0, 8.0, 87.0, 36.0, 34.0]
    
    这将为每组创建一个平均值
    dfg=df_a.groupby(idx.get_indexer(df_a.fc)).agg({'prob':list}).prob.map(np.mean)
    #将df与dfg连接起来
    dfj=df.join(dfg)
    #总平均值显示(dfj)
    启停探头
    0 12.12 12.47南
    1   13.44   20.82  47.877551
    2   20.88   29.63  49.380952
    3   31.61   33.33  50.428571
    4   33.44   42.21  52.182540
    5  880.44  887.92  50.224490
    6  888.63  892.07  52.303571
    7  892.13  895.30  55.047619
    8  895.31  900.99  47.171429
    9  907.58  908.35  38.357143
    
    感谢您的回复。我很困惑。什么是n_a和n_b?@connor449只是组成数据帧的长度。对你来说,它们都是906。我得到了这个错误:`----------------------------------------------------------------------ZeroDivisionError回溯(最后一次调用)在17 I+=118--->19 df_b.loc[j,“mean”]=累计/计数20 j+=1 21零除误差:零除`
    df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].apply((np.mean), how='left')
    
    df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].mean()), how='left')
    
    n_a = 11
    df_a = pd.DataFrame(
        {"Framecount": list(range(n_a)), "probability": np.random.rand(n_a)}
    )
    
    n_b = 6
    start = np.linspace(0, n_a, n_b)
    end = start + n_a / (n_b - 1) - 1e-5
    df_b = pd.DataFrame({"start": start, "end": end, "mean": [np.nan] * n_b})
    
    print(df_a)
        Framecount  probability
    0            0     0.099412
    1            1     0.492661
    2            2     0.043000
    3            3     0.382923
    4            4     0.208177
    5            5     0.110007
    6            6     0.369756
    7            7     0.324723
    8            8     0.702838
    9            9     0.182167
    10          10     0.578837
    
    print(df_b)
       start       end  mean
    0    0.0   2.19999  NaN
    1    2.2   4.39999  NaN
    2    4.4   6.59999  NaN
    3    6.6   8.79999  NaN
    4    8.8  10.99999  NaN
    5   11.0  13.19999  NaN
    
    i = j = 0
    while i < n_a and j < n_b:
        # seek to next row of df_b where start <= df_a[i]
        while i < n_a and df_a.loc[i, "Framecount"] < df_b.loc[j, "start"]:
            i += 1
    
        accum = 0
        count = 0
        while i < n_a and df_a.loc[i, "Framecount"] < df_b.loc[j, "end"]:
            accum += df_a.loc[i, "probability"]
            count += 1
            i += 1
    
        df_b.loc[j, "mean"] = accum / count
        j += 1
    
    print(df_b)
       start       end      mean
    0    0.0   2.19999  0.211691
    1    2.2   4.39999   0.29555
    2    4.4   6.59999  0.239882
    3    6.6   8.79999  0.513781
    4    8.8  10.99999  0.380502
    5   11.0  13.19999      NaN