Python 熊猫:不同大小数据帧之间的复杂映射
我需要映射两个完全不同的数据帧(感谢生物学)。所有关于pandas的教程都是简单得多的转换,如果没有4个嵌套循环,我就无法解决这个问题(真正的新手)。真的很好奇一个pythonic的方法来解决这个问题,而不必回到Excel 第一个类似于df1。对a-j类别中数千个基因的0和1的观察Python 熊猫:不同大小数据帧之间的复杂映射,python,pandas,dataframe,Python,Pandas,Dataframe,我需要映射两个完全不同的数据帧(感谢生物学)。所有关于pandas的教程都是简单得多的转换,如果没有4个嵌套循环,我就无法解决这个问题(真正的新手)。真的很好奇一个pythonic的方法来解决这个问题,而不必回到Excel 第一个类似于df1。对a-j类别中数千个基因的0和1的观察 import pandas as pd import numpy as np df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
print(df1)
a b c d e f g h i j
gene1 1 0 1 0 1 0 1 1 1 0
gene2 0 1 0 0 0 0 0 0 1 0
gene3 0 1 1 1 1 1 0 0 0 0
gene4 1 0 1 0 0 1 0 1 1 1
gene5 0 0 1 0 0 0 0 0 0 0
gene6 0 1 0 0 1 0 1 0 1 0
gene7 1 1 0 1 1 0 0 0 1 0
gene8 0 0 0 1 1 1 1 0 1 0
gene9 1 0 1 0 1 0 1 1 0 1
gene10 1 0 0 0 1 0 1 0 1 1
第二个类似于df2。较低级别类别的较高级别类别映射(X-W)。这个女孩有NaNs,没有索引
df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
'Y': ['d', 'b', 'c','f'],
'Z':['g', 'h','e','NaN'],
'W': ['i', 'j','NaN','Nan']},index=None)
print(df2)
W X Y Z
0 i a d g
1 j NaN b h
2 NaN NaN c e
3 Nan NaN f NaN
我需要的是类似result1的东西。还有一件棘手的事情。例如,gene4在i和j类别中,两者都在W类别中,但我仍然希望result1.loc['gene4','W']中有一个'1'。最终结果仍然需要是二进制的
result1 = pd.DataFrame({'X': ['1','0','0','1','0','0','1','0','1','1'],
'Y': ['1','1','1','1','1','1','1','1','1','0'],
'Z': ['1','0','1','1','0','1','1','1','1','1'],
'W': ['1','1','0','1','0','1','1','1','1','1']}, index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
print(result1)
W X Y Z
gene1 1 1 1 1
gene2 1 0 1 0
gene3 0 0 1 1
gene4 1 1 1 1
gene5 0 0 1 0
gene6 1 0 1 1
gene7 1 1 1 1
gene8 1 0 1 1
gene9 1 1 1 1
gene10 1 1 0 1
这可能是另一种可能的结果格式。[根据实际预期结果更新]。如果有人想教他们两个(或者简单的相互转换),那就多加赞赏,科学也会心存感激
result1 = pd.DataFrame({'1': ['gene1','gene1','gene1','gene1'],
'2': ['gene2','gene4','gene2','gene3'],
'3': ['gene4','gene7','gene3','gene4'],
'4': ['gene6','gene9','gene4','gene6'],
'5': ['gene7','gene10','gene5','gene7'],
'6': ['gene8','NaN','gene6','gene8'],
'7': ['gene9','NaN','gene7','gene9'],
'8': ['gene10','NaN','gene8','gene10'],
'9': ['NaN','NaN','gene9','NaN'],
},
index = ['W','X','Y','Z'])
print(result1)
1 2 3 4 5 6 7 8 9
W gene1 gene2 gene4 gene6 gene7 gene8 gene9 gene10 NaN
X gene1 gene4 gene7 gene9 gene10 NaN NaN NaN NaN
Y gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
Z gene1 gene3 gene4 gene6 gene7 gene8 gene9 gene10 NaN
非常感谢您耐心阅读这一长问题。开始!让我们试试这个
df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
'Y': ['d', 'b', 'c','f'],
'Z':['g', 'h','e','NaN'],
'W': ['i', 'j','NaN','NaN']},index=None)
df2 = df2.replace('NaN',np.nan)
gmap = df2.stack().reset_index().drop('level_0',axis=1).set_index(0)['level_1']
df3 = df1.stack().replace(0,np.nan).dropna().reset_index(level=1)['level_1'].map(gmap).reset_index().drop_duplicates()
df_out = df3.groupby(['index','level_1'])['level_1'].count().unstack()
print(df_out)
输出:
level_1 W X Y Z
index
gene1 1.0 NaN NaN NaN
gene10 1.0 1.0 1.0 1.0
gene2 1.0 1.0 1.0 1.0
gene3 1.0 1.0 1.0 1.0
gene4 1.0 NaN 1.0 1.0
gene5 1.0 NaN 1.0 NaN
gene6 1.0 1.0 1.0 1.0
gene7 NaN 1.0 1.0 1.0
gene8 NaN NaN 1.0 1.0
gene9 1.0 NaN NaN 1.0
cols 1 2 3 4 5 6 7 8 9 10
level_1
W gene1 gene2 gene3 gene4 gene5 None gene7 gene8 gene9 gene10
X None None gene3 None gene5 None None gene8 gene9 gene10
Y gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
Z None gene2 None gene4 None gene6 None gene8 gene9 None
编辑以获得可选输出
输出:
level_1 W X Y Z
index
gene1 1.0 NaN NaN NaN
gene10 1.0 1.0 1.0 1.0
gene2 1.0 1.0 1.0 1.0
gene3 1.0 1.0 1.0 1.0
gene4 1.0 NaN 1.0 1.0
gene5 1.0 NaN 1.0 NaN
gene6 1.0 1.0 1.0 1.0
gene7 NaN 1.0 1.0 1.0
gene8 NaN NaN 1.0 1.0
gene9 1.0 NaN NaN 1.0
cols 1 2 3 4 5 6 7 8 9 10
level_1
W gene1 gene2 gene3 gene4 gene5 None gene7 gene8 gene9 gene10
X None None gene3 None gene5 None None gene8 gene9 gene10
Y gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
Z None gene2 None gene4 None gene6 None gene8 gene9 None
这里到底是什么关系?你提供了一堆问号和随机数据。。。如果你想费尽心机去创建这些示例(第一个问题非常好!),只需完成剩下的步骤,创建你想要的数据框架。你是对的,抱歉。添加问号数组并不难,我只是创建了一个并复制了4次。我将解决玩具的例子,并编辑预期的结果数据帧。惊人!Python确实是一种魔力。小细节,而不是NaNs,我希望在最终结果中为零,但当然这可以通过df_out.fillna(0)快速完成。谢谢大家!@我编辑了JRCX以获得可选输出。