Python 计算由长度不等的二维索引列表给出的数据帧行组的平均值_Python_Pandas_Numpy

Python 计算由长度不等的二维索引列表给出的数据帧行组的平均值

python pandas numpy

Python 计算由长度不等的二维索引列表给出的数据帧行组的平均值,python,pandas,numpy,Python,Pandas,Numpy,我有一个包含n行的数据帧。我还有一个二维索引数组。此数组也有n行，但是每行的长度可以是可变的。我需要根据索引对数据帧行进行分组，并计算列的平均值例如：如果我有数据帧df和数组ind，我需要 [df.loc[ind[n]，col\u name].mean（）表示ind中的n] 我使用applypandas函数实现了这一点： size = 100000 df = pd.DataFrame(columns=['a']) df['a'] = np.arange(size) np.random.see

我有一个包含n行的数据帧。我还有一个二维索引数组。此数组也有n行，但是每行的长度可以是可变的。我需要根据索引对数据帧行进行分组，并计算列的平均值

例如：

如果我有数据帧df和数组ind，我需要

[df.loc[ind[n]，col\u name].mean（）表示ind中的n]

我使用

apply

pandas函数实现了这一点：

size = 100000
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
np.random.seed(1)
ind = np.array([np.random.randint(0, size, size=5) for _ in range(size)])
def group(row):
    return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)

但这是缓慢的和规模很差。在这种情况下，这样做要快得多

df.a.values[ind].mean(axis=1)

但是，据我所知，这只适用于ind的所有元素，因为ind的所有元素都具有相同的长度，而以下代码不适用：

new_ind = ind.tolist()
new_ind[0].pop()
df.a.values[new_ind].mean(axis=1)

我曾玩弄过熊猫分组法，但没有成功。有没有另一种有效的方法可以根据长度不等的索引列表对行进行分组并返回列的平均值？

我想这就是您想要的。。。我将尺寸设置得更低，以便于演示

这是一个代码的简化版本，带有可重复（固定）的

ind

，您可以对其进行测试

import pandas as pd
import numpy as np
size = 10
df = pd.DataFrame(columns=['a'])
df['a'] = np.arange(size)
ind = np.array([[5, 8, 9, 5, 0],
       [0, 1, 7, 6, 9],
       [2, 4, 5, 2, 4],
       [2, 4, 7, 7, 9],
       [1, 7, 0, 6, 9],
       [9, 7, 6, 9, 1],
       [0, 1, 8, 8, 3],
       [9, 8, 7, 3, 6],
       [5, 1, 9, 3, 4],
       [8, 1, 4, 0, 3]])
def group(row):
    return df.loc[ind[df.index.get_loc(row.name)], 'a'].mean()
df['avg'] = df.apply(group, axis=1)

下面也给出了同样的结论

df['comparison'] = df.a.values[ind].mean(axis=1)

In [86]: (df['comparison'] == df['avg']).all()
Out[86]: True

时间安排

前
```
0.5263588428497314
```
在
```
0.014391899108886719之后
```


带bincount
0.03328204154968262


比较和缩放

为了比较缩放比例，我设置了三个timeit
函数（底部的代码），并定义了要测试缩放比例的大小
import timeit
sizes = [10, 100, 1000, 10000]
res_mine = map(mine, sizes)
res_bincount = map(bincount, sizes)
res_original = map(original, sizes[:-1])

定时码
请注意，我必须减少original的运行次数，因为它需要很长的设置时间
保持数据帧较短以便于演示
np.random.seed(1)

size = 10
df = pd.DataFrame(dict(a=np.arange(size)))

# array of variable length sub-arrays
ind = np.array([
    np.random.randint(
        0, size, size=np.random.randint(1, 11)
    ) for _ in range(size)
])


解决方案

与权重
参数一起使用。

这应该是一个非常快速的解决方案
# get an array of the lengths of sub-arrays
lengths = np.array([len(x) for x in ind])
# simple np.arange for initial positions
positions = np.arange(len(ind))
# get at the underlying values of column `'a'`
values = df.a.values

# for each position repeated the number of times equal to
# the length of the sub-array at that position,
# add to the bin, identified by the position, the amount
# from values at the indices from the sub-array
# divide sums by lengths to get averages
avg = np.bincount(
    positions.repeat(lengths),
    values[np.concatenate(ind)]
) / lengths

df.assign(avg=avg)

   a       avg
0  0  3.833333
1  1  4.250000
2  2  6.200000
3  3  6.000000
4  4  5.200000
5  5  5.400000
6  6  2.000000
7  7  3.750000
8  8  6.500000
9  9  6.200000


定时
此表确定了每行的最小时间量，该行中的每一个其他值表示为最小时间所用时间量的倍数。最后一列标识由相应行指定的数据长度的最快方法
Method pir      mcf Best
Size                    
10       1  12.3746  pir
30       1  44.0495  pir
100      1  124.054  pir
300      1    270.6  pir
1000     1  576.505  pir
3000     1  819.034  pir
10000    1  990.847  pir


代码
这里的问题是可伸缩性，你应该在更大的尺寸上发布一些时间。我认为你的时间不多了。请随意测试自己@piRSquared我已经发布了code@AlexanderMcFarlane好的df.a.values[ind].mean（axis=1）
在ind
是一个可变长度子数组的数组时不起作用，这是OP要求的。您确定中间结果没有缓存吗？另外，mcf不是我的解决方案：）它是原始的one@AlexanderMcFarlane我看到断开的地方了。您的解决方案要求ind
具有相同的长度。这使整个阵列成为一个二维阵列，可用于切片df.a.值
，然后切片平均值
'D。根据OP“然而，每行的长度可以是可变的”。您的解决方案不允许这样做。如果使用np.bincount
像等长子数组一样简单，我就不会遇到麻烦了。我会将这标记为正确答案，因为这是一个漂亮的解决方案。这不是我想问的，但我可能不太清楚ind
是一个包含数据帧索引值的数组，而不是值的numpy数组中的位置。我认为，如果对索引进行无序排列，应该会得到不同的答案，因为ind
现在将指向不同的行集。换句话说，我认为这个解决方案只有在数据帧索引为[0，（n-1）]时才成立，但如果我错了，请纠正我。但是，为了便于使用，我可以轻松地将ind重新定义为位置，而不是索引。谢谢
Method pir      mcf Best
Size                    
10       1  12.3746  pir
30       1  44.0495  pir
100      1  124.054  pir
300      1    270.6  pir
1000     1  576.505  pir
3000     1  819.034  pir
10000    1  990.847  pir

def mcf(d, i):
    g = lambda r: d.loc[i[d.index.get_loc(r.name)], 'a'].mean()
    return d.assign(avg=d.apply(g, 1))

def pir(d, i):
    lengths = np.array([len(x) for x in i])
    positions = np.arange(len(i))
    values = d.a.values

    avg = np.bincount(
        positions.repeat(lengths),
        values[np.concatenate(i)]
    ) / lengths

    return d.assign(avg=avg)

results = pd.DataFrame(
    index=pd.Index([10, 30, 100, 300, 1000, 3000, 10000], name='Size'),
    columns=pd.Index(['pir', 'mcf'], name='Method')
)

for i in results.index:

    df = pd.DataFrame(dict(a=np.arange(i)))
    ind = np.array([
        np.random.randint(
            0, i, size=np.random.randint(1, 11)
        ) for _ in range(i)
    ])

    for j in results.columns:

        stmt = '{}(df, ind)'.format(j)
        setp = 'from __main__ import df, ind, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=10))

results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))

fig, (a1, a2) = plt.subplots(2, 1, figsize=(6, 6))
results.plot(loglog=True, lw=3, ax=a1)
results.div(results.min(1), 0).round(2).plot.bar(logy=True, ax=a2)