Python 按重叠列表分组

Python 按重叠列表分组,python,pandas,data-science,Python,Pandas,Data Science,我有一个这样的数据帧 data 0 1.5 1 1.3 2 1.3 3 1.8 4 1.3 5 1.8 6 1.5 我有一个这样的列表: indices = [[0, 3, 4], [0, 3], [2, 6, 4], [1, 3, 4, 5]] 我想使用列表列表生成数据帧中每个组的总和,所以 group1 = df[0] + df[1] + df[2] group2 = df[1] + df[2] + df[3] group3 = df[2] + df[3

我有一个这样的数据帧

   data
0   1.5
1   1.3
2   1.3
3   1.8
4   1.3
5   1.8
6   1.5
我有一个这样的列表:

indices = [[0, 3, 4], [0, 3], [2, 6, 4], [1, 3, 4, 5]]
我想使用列表列表生成数据帧中每个组的总和,所以

group1 = df[0] + df[1] + df[2]
group2 = df[1] + df[2] + df[3]
group3 = df[2] + df[3] + df[4]
group4 = df[3] + df[4] + df[5]
所以我在寻找类似
df.groupby(index.sum)的东西


我知道这可以通过使用for循环迭代完成,并将总和应用到每个
df.iloc[sublist],
,但我正在寻找一种更快的方法。

使用列表理解:

a = [df.loc[x, 'data'].sum() for x in indices]
print (a)
[4.6, 3.3, 4.1, 6.2]

使用
groupby+sum
解决方案是可行的,但不确定是否有更好的性能:

df1 = pd.DataFrame({
    'd' : df['data'].values[np.concatenate(indices)], 
    'g' : np.arange(len(indices)).repeat([len(x) for x in indices])
})

print (df1)
      d  g
0   1.5  0
1   1.8  0
2   1.3  0
3   1.5  1
4   1.8  1
5   1.3  2
6   1.5  2
7   1.3  2
8   1.3  3
9   1.8  3
10  1.3  3
11  1.8  3

在小样本数据中测试的性能-在实际数据中应不同:

In [150]: %timeit [df.loc[x, 'data'].sum() for x in indices]
4.84 ms ± 80.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [151]: %%timeit
     ...: df['data'].values
     ...: [arr[x].sum() for x in indices]
     ...: 
     ...: 
20.9 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [152]: %timeit pd.DataFrame({'d' : df['data'].values[np.concatenate(indices)],'g' : np.arange(len(indices)).repeat([len(x) for x in indices])}).groupby('g')['d'].sum()
1.46 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

关于真实数据

In [37]: %timeit [df.iloc[x, 0].sum() for x in indices]
158 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [38]: arr = df['data'].values
    ...: %timeit \
    ...: [arr[x].sum() for x in indices]
5.99 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In[49]: %timeit pd.DataFrame({'d' : df['last'].values[np.concatenate(sample_indices['train'])],'g' : np.arange(len(sample_indices['train'])).repeat([len(x) for x in sample_indices['train']])}).groupby('g')['d'].sum()
   ...: 
5.97 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


有意思。。底部的两个答案都很快。

嵌套列表的长度相同吗?没有。可以是任意长度。但是没有重复。
{f“Group{i+1}”:df.reindex(x).sum()表示枚举(索引)}中的i,x
。?哈哈,我尝试过按您提到的方式进行分组,但问题是列表确实有重叠,因此g被覆盖。我认为现在最好的方法就是做列表理解。@Landmaster-有趣的是,有广播,所以对我来说工作很好…@Landmaster-第二个解决方案似乎应该是最快的,使用示例数据添加性能计时。似乎是这样。。疯狂的是,先提取值列表,然后进行应用,效果最好。@Landmaster-如果可能,你能在真实数据中进行测试吗?什么解决方案最快?我想是第二个,但也许不是。
In [150]: %timeit [df.loc[x, 'data'].sum() for x in indices]
4.84 ms ± 80.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [151]: %%timeit
     ...: df['data'].values
     ...: [arr[x].sum() for x in indices]
     ...: 
     ...: 
20.9 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [152]: %timeit pd.DataFrame({'d' : df['data'].values[np.concatenate(indices)],'g' : np.arange(len(indices)).repeat([len(x) for x in indices])}).groupby('g')['d'].sum()
1.46 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %timeit [df.iloc[x, 0].sum() for x in indices]
158 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [38]: arr = df['data'].values
    ...: %timeit \
    ...: [arr[x].sum() for x in indices]
5.99 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In[49]: %timeit pd.DataFrame({'d' : df['last'].values[np.concatenate(sample_indices['train'])],'g' : np.arange(len(sample_indices['train'])).repeat([len(x) for x in sample_indices['train']])}).groupby('g')['d'].sum()
   ...: 
5.97 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)