Python 使用pandas聚合所有数据帧行对组合_Python_Pandas_Aggregate_Combinations_Itertools

Python 使用pandas聚合所有数据帧行对组合

python pandas

Python 使用pandas聚合所有数据帧行对组合,python,pandas,aggregate,combinations,itertools,Python,Pandas,Aggregate,Combinations,Itertools,我使用PythonPandas跨数据帧执行分组和聚合，但现在我想执行特定的行成对聚合（n选择2，统计组合）。下面是示例数据，我想看看[mygenes]中的所有基因对： import pandas import itertools mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4'] df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],

我使用PythonPandas跨数据帧执行分组和聚合，但现在我想执行特定的行成对聚合（n选择2，统计组合）。下面是示例数据，我想看看[mygenes]中的所有基因对：

import pandas
import itertools

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']

df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
                       'case1'   : [0,1,1,0,0],
                       'case2'   : [1,1,1,0,1],
                       'control1':[0,0,1,1,1],
                       'control2':[1,0,0,1,0] })
>>> df
   Gene  case1  case2  control1  control2
0  ABC1      0      1         0         1
1  ABC2      1      1         0         0
2  ABC3      1      1         1         0
3  ABC4      0      0         1         1
4  ABC5      0      1         1         0

最终产品应如下所示（默认情况下应用np.sum即可）：

使用itertools（$

itertools.combinations（mygenes，2）

）可以很容易地获得基因对集，但我不知道如何根据它们的值对特定的行进行聚合。有人能提供建议吗？谢谢

在走得太远之前，你应该记住你的数据很快就会变大。对于5行，输出将是

C（5,2）

或

5+4+3+2+1

，依此类推

也就是说，我会考虑在numpy中这样做以提高速度（顺便说一句，你可能想在你的问题中添加一个numpy标签）。无论如何，这并不像可能的那样矢量化，但至少应该是一个开始：

df2 = df.set_index('Gene').loc[mygenes].reset_index()

import math
sz = len(df2)
sz2 = math.factorial(sz) / ( math.factorial(sz-2) * 2 )

Gene = df2['Gene'].tolist()
abc = df2.ix[:,1:].values

import math
arr = np.zeros([sz2,4])
gene2 = []
k = 0

for i in range(sz):
    for j in range(sz):
        if i != j and i < j:
            gene2.append( gene[i] + gene[j] )
            arr[k] = abc[i] + abc[j]
            k += 1

pd.concat( [ pd.DataFrame(gene2), pd.DataFrame(arr) ], axis=1 )
Out[1780]: 
          0  0  1  2  3
0  ABC1ABC2  1  2  0  1
1  ABC1ABC3  1  2  1  1
2  ABC1ABC4  0  1  1  2
3  ABC2ABC3  2  2  1  0
4  ABC2ABC4  1  1  1  1
5  ABC3ABC4  1  1  2  1

df2=df.set_index（'Gene'）.loc[mygenes].reset_index（）
输入数学
sz=len（df2）
sz2=数学阶乘（sz）/（数学阶乘（sz-2）*2）
Gene=df2['Gene'].tolist（）
abc=df2.ix[：，1::]值
输入数学
arr=np.zero（[sz2,4]）
基因2=[]
k=0
对于范围内的i（sz）：
对于范围内的j（sz）：
如果我j和i


根据大小/速度问题，您可能需要分离字符串和数字代码，并将数字段矢量化。如果您的数据很大，那么这段代码不太可能很好地扩展，如果数据很大，这可能会决定您需要什么样的答案（也可能需要考虑如何存储结果）。
我想不出一种聪明的矢量化方法来做到这一点，但除非性能是一个真正的瓶颈，否则我倾向于使用最简单的有意义的方法。在这种情况下，我可能设置索引（“基因”）
，然后使用loc
选择行：
>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
              case1  case2  control1  control2
(ABC1, ABC2)      1      2         0         1
(ABC1, ABC3)      1      2         1         1
(ABC1, ABC4)      0      1         1         2
(ABC2, ABC3)      2      2         1         0
(ABC2, ABC4)      1      1         1         1
(ABC3, ABC4)      1      1         2         1

啊，这是很好的loc用法，比我的要简单得多。我刚刚意识到他只是在要求我基因的子集，所以如果一次只有几个，那么最简单的肯定是最好的。FWIW我不认为对我使用的numpy代码进行矢量化太难（如果值得的话）。这种方法看起来很容易编码，今天下午我将继续测试它的伸缩性。实际数据集大约有6k列和1700行，这将导致略多于100万个组合（输出行）。谢谢你的邀请response@alexhli：ehh，那是很多行。：-/我不认为这会在那个政权中表现得很好。。。嘿，等等。140万行*6000列是84亿个数字。即使每个值只使用1字节，也就是8G。这将有点难以管理（当我意识到我在使用浮动时没有内存时，我正在对它进行矢量化。）
>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
              case1  case2  control1  control2
(ABC1, ABC2)      1      2         0         1
(ABC1, ABC3)      1      2         1         1
(ABC1, ABC4)      0      1         1         2
(ABC2, ABC3)      2      2         1         0
(ABC2, ABC4)      1      1         1         1
(ABC3, ABC4)      1      1         2         1