Python 查找具有匹配列名的列的平均值_Python_Pandas_Group By_Average_Mean

Python 查找具有匹配列名的列的平均值

python pandas

Python 查找具有匹配列名的列的平均值,python,pandas,group-by,average,mean,Python,Pandas,Group By,Average,Mean,我有一个类似于以下的数据框，但有数千行和数千列： x y ghb_00hr_rep1 ghb_00hr_rep2 ghb_00hr_rep3 ghl_06hr_rep1 ghl_06hr_rep2 x y 2 3 2 1 3 x y 5 7 6 2

我有一个类似于以下的数据框，但有数千行和数千列：

x  y  ghb_00hr_rep1  ghb_00hr_rep2    ghb_00hr_rep3   ghl_06hr_rep1  ghl_06hr_rep2
x  y           2           3                 2                1         3
x  y           5           7                 6                2         1

我希望我的输出如下所示：

 ghb_00hr     hl_06hr
    2.3           2
     6           1.5

我的目标是找到匹配列的平均值。我想到了这个：

temp=df.groupby（name，axis=1）.agg（'mean'）

，但我不确定如何将“name”定义为匹配列

我以前的策略如下：

name = pd.Series(['_'.join(i.split('_')[:-1]) 
        for i in df.columns[3:]],
        index = df.columns[3:]
)
temp = df.groupby(name, axis=1).agg('mean')
    avg = pd.concat([df.iloc[:, :3], temp], 
    axis=1
)

但是，“复制”的数量在1-4之间，因此不能选择按索引位置分组

不确定是否有更好的方法来执行此操作，或者我是否在正确的轨道上。

您可以将

df.columns

转换为set，然后迭代：

df = pd.DataFrame([[1, 2, 3, 4, 5, 6]], columns=['a', 'a', 'a', 'b', 'b', 'b'])

for column in set(df.columns):
    print(column, df[common_name].mean(axis=1))

意志输出

a 0    2.0
dtype: float64
b 0    5.0
dtype: float64

如果顺序重要，请使用排序的

：

for column in sorted(set(df.columns)):

从这里，您可以获得几乎任何格式的输出。

一个选项是groupby

level=0

：

(df.set_index(['name','x','y'])
   .groupby(level=0, axis=1)
   .mean().reset_index()
)

输出：

    name  x  y  ghb_00hr  ghl_06hr
0  gene1  x  y  2.333333       2.0
1  gene2  x  y  6.000000       1.5

   ghb_00hr  ghl_06hr
0  2.333333       2.0
1  6.000000       1.5

更新：对于修改后的问题：

d = df.filter(like='gh')
# or d = df.iloc[:, 2:]
# depending on your columns of interest

names = d.columns.str.rsplit('_', n=1).str[0]

d.groupby(names, axis=1).mean()

输出：

    name  x  y  ghb_00hr  ghl_06hr
0  gene1  x  y  2.333333       2.0
1  gene2  x  y  6.000000       1.5

   ghb_00hr  ghl_06hr
0  2.333333       2.0
1  6.000000       1.5

name，x，y

是数据中的正常列吗？还有，您期望的输出是什么？name、x、y是列，但我不尝试对它们执行任何操作。我希望平均文件的输出如下：我将向问题添加所需的输出，因为它在注释中的格式不正确，并删除第一列，因为它们不相关，我可以轻松地将这些列与我创建的临时df合并。问题比我自整个列名不完全匹配。我相应地编辑了这篇文章，还列出了当我认为每个专栏只有3次重复时我使用的内容。有没有类似于这种方法的解决方案可以奏效？谢谢！你能解释一下

d=df.filter

在做什么吗？它提取了所有包含

gh

的列。根据你的心意修改那句话。太棒了。下面使用iloc的那一行也在做同样的事情，只是用索引而不是名称？是的，用列的数字索引而不是名称。