Python 在大熊猫的遗传树中寻找第一个祖先
我的数据框看起来像这样Python 在大熊猫的遗传树中寻找第一个祖先,python,pandas,Python,Pandas,我的数据框看起来像这样 plant ancestor1 ancestor2 ancestor3 ancestor4 ancestor5 XX XX1 XX2 XX3 XX4 XX5 YY YY1 YY2 YY3 YY4 ZY ZZ1 ZZ2 YY2 YY3 YY4 SS1 SS2 SS3 plant oldes
plant ancestor1 ancestor2 ancestor3 ancestor4 ancestor5
XX XX1 XX2 XX3 XX4 XX5
YY YY1 YY2 YY3 YY4
ZY ZZ1 ZZ2 YY2 YY3 YY4
SS1 SS2 SS3
plant oldest
XX XX5
XX1 XX5
XX2 XX5
XX3 XX5
XX4 XX5
YY YY4
YY1 YY4
YY2 YY4
YY3 YY4
ZY YY4
ZZ1 YY4
ZZ2 YY4
SS1 SS3
SS2 SS3
我希望每种植物都有最古老的祖先。最终输出应该是这样的
plant ancestor1 ancestor2 ancestor3 ancestor4 ancestor5
XX XX1 XX2 XX3 XX4 XX5
YY YY1 YY2 YY3 YY4
ZY ZZ1 ZZ2 YY2 YY3 YY4
SS1 SS2 SS3
plant oldest
XX XX5
XX1 XX5
XX2 XX5
XX3 XX5
XX4 XX5
YY YY4
YY1 YY4
YY2 YY4
YY3 YY4
ZY YY4
ZZ1 YY4
ZZ2 YY4
SS1 SS3
SS2 SS3
我怎样才能做到这一点
df2 = df.ffill(axis=1).melt(id_vars='ancestor5', value_name='plant')
df2 = df2.rename(columns={'ancestor5': 'oldest'}).drop(columns='variable')
df2 = df2[df2['oldest'] != df2['plant']]
print(df2)
oldest plant
0 XX5 XX
1 YY4 YY
2 YY4 ZY
3 SS3 SS1
4 XX5 XX1
5 YY4 YY1
6 YY4 ZZ1
7 SS3 SS2
8 XX5 XX2
9 YY4 YY2
10 YY4 ZZ2
12 XX5 XX3
13 YY4 YY3
14 YY4 YY2
16 XX5 XX4
18 YY4 YY3
说明:使用melt转换为长格式的数据帧,但在此之前,请确保使用ffill有一列始终包含祖先。稍后,删除正向填充复制值的行。可能是以下行:
df = pd.DataFrame({'plant': ['x', 'y','z'],
'ancestor1':['X1','Y1','Z2'],
'ancestor2':['X2','','Z2'],
'ancestor3':['X3','','']})
df['oldest'] = [list(filter(len,list(df.iloc[i])))[-1] for i in range(len(df))]
下面是另一个使用列表理解的方法(可能有点难看)
dfout = pd.DataFrame([
(y, x[-1]) for x in [[i for i in ii if i] for ii in df.values]
for y in x[:-1]
], columns = ['plant', 'oldest']
)
完整示例:
import pandas as pd
df = pd.DataFrame({
'plant': ['XX','YY','ZY'],
'ancestor1': ['XX1','YY1','ZZ1'],
'ancestor2': ['XX2','YY2',''],
'ancestor3': ['XX3','','']
})
df = df[['plant','ancestor1','ancestor2','ancestor3']]
dfout = pd.DataFrame([
(y, x[-1]) for x in [[i for i in ii if i] for ii in df.values]
for y in x[:-1]
], columns = ['plant', 'oldest']
)
print(dfout)
返回:
plant oldest
0 XX XX3
1 XX1 XX3
2 XX2 XX3
3 YY YY2
4 YY1 YY2
5 ZY ZZ1
下面是一个使用numpy-isin、repeat和concatenate以及列表理解的快速方法。这种方式还允许空的祖先位置为空字符串或无或任何其他占位符
df_vals = df.values
# count the number of sub-ancestors in each row
repeats = (~np.isin(df_vals, ['', None])).sum(axis=1) - 1
# find the oldest ancestor in each row
oldest_ancestors = np.array([df_vals[row, col] for row, col in enumerate(repeats)])
# make the oldest column by repeating the each oldest ancestor for each sub-ancestor
oldest = np.repeat(oldest_ancestors, repeats)
# make the plant column by getting all the sub-ancestors from each row and concatenating
plant = np.concatenate([df_vals[row][:col] for row, col in enumerate(repeats)])
df2 = pd.DataFrame({'plant': plant, 'oldest': oldest})
-
设置数据帧:
df = pd.DataFrame({'plant': ['XX', 'YY', 'ZY', 'SS1'],
'ancestor1': ['XX1', 'YY1', 'ZZ1', 'SS2'],
'ancestor2': ['XX2', 'YY2', 'ZZ2', 'SS3'],
'ancestor3': ['XX3', 'YY3', 'YY2', None],
'ancestor4': ['XX4', 'YY4', 'YY3', None],
'ancestor5': ['XX5', None, 'YY4', None]})
抱歉,我得到了TypeError:drop()得到了一个意外的关键字参数'columns',这很奇怪。你能打印行的输出直到那个部分(所以
df2.rename(columns={'ancestor5':'olester'})
?)哦,我知道了,我只是把变量改成了变量