Python 熊猫:不同大小数据帧之间的复杂映射

Python 熊猫:不同大小数据帧之间的复杂映射,python,pandas,dataframe,Python,Pandas,Dataframe,我需要映射两个完全不同的数据帧(感谢生物学)。所有关于pandas的教程都是简单得多的转换,如果没有4个嵌套循环,我就无法解决这个问题(真正的新手)。真的很好奇一个pythonic的方法来解决这个问题,而不必回到Excel 第一个类似于df1。对a-j类别中数千个基因的0和1的观察 import pandas as pd import numpy as np df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list

我需要映射两个完全不同的数据帧(感谢生物学)。所有关于pandas的教程都是简单得多的转换,如果没有4个嵌套循环,我就无法解决这个问题(真正的新手)。真的很好奇一个pythonic的方法来解决这个问题,而不必回到Excel

第一个类似于df1。对a-j类别中数千个基因的0和1的观察

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])

print(df1)

        a  b  c  d  e  f  g  h  i  j
gene1   1  0  1  0  1  0  1  1  1  0
gene2   0  1  0  0  0  0  0  0  1  0
gene3   0  1  1  1  1  1  0  0  0  0
gene4   1  0  1  0  0  1  0  1  1  1
gene5   0  0  1  0  0  0  0  0  0  0
gene6   0  1  0  0  1  0  1  0  1  0
gene7   1  1  0  1  1  0  0  0  1  0
gene8   0  0  0  1  1  1  1  0  1  0
gene9   1  0  1  0  1  0  1  1  0  1
gene10  1  0  0  0  1  0  1  0  1  1
第二个类似于df2。较低级别类别的较高级别类别映射(X-W)。这个女孩有NaNs,没有索引

df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
                       'Y': ['d', 'b', 'c','f'],
                       'Z':['g', 'h','e','NaN'],
                       'W': ['i', 'j','NaN','Nan']},index=None)

print(df2)

     W    X  Y    Z
0    i    a  d    g
1    j  NaN  b    h
2  NaN  NaN  c    e
3  Nan  NaN  f  NaN
我需要的是类似result1的东西。还有一件棘手的事情。例如,gene4在i和j类别中,两者都在W类别中,但我仍然希望result1.loc['gene4','W']中有一个'1'。最终结果仍然需要是二进制的

result1 = pd.DataFrame({'X': ['1','0','0','1','0','0','1','0','1','1'],
                   'Y': ['1','1','1','1','1','1','1','1','1','0'],
                   'Z': ['1','0','1','1','0','1','1','1','1','1'],
                   'W': ['1','1','0','1','0','1','1','1','1','1']}, index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
print(result1)


        W  X  Y  Z
gene1   1  1  1  1
gene2   1  0  1  0
gene3   0  0  1  1
gene4   1  1  1  1
gene5   0  0  1  0
gene6   1  0  1  1
gene7   1  1  1  1
gene8   1  0  1  1
gene9   1  1  1  1
gene10  1  1  0  1
这可能是另一种可能的结果格式。[根据实际预期结果更新]。如果有人想教他们两个(或者简单的相互转换),那就多加赞赏,科学也会心存感激

result1 = pd.DataFrame({'1': ['gene1','gene1','gene1','gene1'],
                       '2': ['gene2','gene4','gene2','gene3'],
                       '3': ['gene4','gene7','gene3','gene4'],
                       '4': ['gene6','gene9','gene4','gene6'],
                       '5': ['gene7','gene10','gene5','gene7'],
                       '6': ['gene8','NaN','gene6','gene8'],
                       '7': ['gene9','NaN','gene7','gene9'],
                       '8': ['gene10','NaN','gene8','gene10'],
                       '9': ['NaN','NaN','gene9','NaN'],
                       },
                       index = ['W','X','Y','Z'])
print(result1)

       1      2      3      4       5      6      7       8      9
W  gene1  gene2  gene4  gene6   gene7  gene8  gene9  gene10    NaN
X  gene1  gene4  gene7  gene9  gene10    NaN    NaN     NaN    NaN
Y  gene1  gene2  gene3  gene4   gene5  gene6  gene7   gene8  gene9
Z  gene1  gene3  gene4  gene6   gene7  gene8  gene9  gene10    NaN

非常感谢您耐心阅读这一长问题。

开始!让我们试试这个

df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])

df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
                       'Y': ['d', 'b', 'c','f'],
                       'Z':['g', 'h','e','NaN'],
                       'W': ['i', 'j','NaN','NaN']},index=None)

df2 = df2.replace('NaN',np.nan)

gmap = df2.stack().reset_index().drop('level_0',axis=1).set_index(0)['level_1']

df3 = df1.stack().replace(0,np.nan).dropna().reset_index(level=1)['level_1'].map(gmap).reset_index().drop_duplicates()

df_out = df3.groupby(['index','level_1'])['level_1'].count().unstack()

print(df_out)
输出:

level_1    W    X    Y    Z
index                      
gene1    1.0  NaN  NaN  NaN
gene10   1.0  1.0  1.0  1.0
gene2    1.0  1.0  1.0  1.0
gene3    1.0  1.0  1.0  1.0
gene4    1.0  NaN  1.0  1.0
gene5    1.0  NaN  1.0  NaN
gene6    1.0  1.0  1.0  1.0
gene7    NaN  1.0  1.0  1.0
gene8    NaN  NaN  1.0  1.0
gene9    1.0  NaN  NaN  1.0
cols        1      2      3      4      5      6      7      8      9       10
level_1                                                                       
W        gene1  gene2  gene3  gene4  gene5   None  gene7  gene8  gene9  gene10
X         None   None  gene3   None  gene5   None   None  gene8  gene9  gene10
Y        gene1  gene2  gene3  gene4  gene5  gene6  gene7  gene8  gene9  gene10
Z         None  gene2   None  gene4   None  gene6   None  gene8  gene9    None
编辑以获得可选输出 输出:

level_1    W    X    Y    Z
index                      
gene1    1.0  NaN  NaN  NaN
gene10   1.0  1.0  1.0  1.0
gene2    1.0  1.0  1.0  1.0
gene3    1.0  1.0  1.0  1.0
gene4    1.0  NaN  1.0  1.0
gene5    1.0  NaN  1.0  NaN
gene6    1.0  1.0  1.0  1.0
gene7    NaN  1.0  1.0  1.0
gene8    NaN  NaN  1.0  1.0
gene9    1.0  NaN  NaN  1.0
cols        1      2      3      4      5      6      7      8      9       10
level_1                                                                       
W        gene1  gene2  gene3  gene4  gene5   None  gene7  gene8  gene9  gene10
X         None   None  gene3   None  gene5   None   None  gene8  gene9  gene10
Y        gene1  gene2  gene3  gene4  gene5  gene6  gene7  gene8  gene9  gene10
Z         None  gene2   None  gene4   None  gene6   None  gene8  gene9    None

这里到底是什么关系?你提供了一堆问号和随机数据。。。如果你想费尽心机去创建这些示例(第一个问题非常好!),只需完成剩下的步骤,创建你想要的数据框架。你是对的,抱歉。添加问号数组并不难,我只是创建了一个并复制了4次。我将解决玩具的例子,并编辑预期的结果数据帧。惊人!Python确实是一种魔力。小细节,而不是NaNs,我希望在最终结果中为零,但当然这可以通过df_out.fillna(0)快速完成。谢谢大家!@我编辑了JRCX以获得可选输出。