Python 在大熊猫的遗传树中寻找第一个祖先_Python_Pandas

Python 在大熊猫的遗传树中寻找第一个祖先

python pandas

Python 在大熊猫的遗传树中寻找第一个祖先,python,pandas,Python,Pandas,我的数据框看起来像这样 plant ancestor1 ancestor2 ancestor3 ancestor4 ancestor5 XX XX1 XX2 XX3 XX4 XX5 YY YY1 YY2 YY3 YY4 ZY ZZ1 ZZ2 YY2 YY3 YY4 SS1 SS2 SS3 plant oldes

我的数据框看起来像这样

plant ancestor1 ancestor2 ancestor3 ancestor4 ancestor5
XX     XX1          XX2      XX3      XX4       XX5
YY     YY1          YY2      YY3      YY4
ZY     ZZ1          ZZ2      YY2      YY3       YY4
SS1    SS2          SS3

plant oldest
XX     XX5
XX1    XX5
XX2    XX5
XX3    XX5
XX4    XX5
YY     YY4
YY1    YY4
YY2    YY4
YY3    YY4
ZY     YY4
ZZ1    YY4
ZZ2    YY4
SS1    SS3
SS2    SS3

我希望每种植物都有最古老的祖先。最终输出应该是这样的

plant ancestor1 ancestor2 ancestor3 ancestor4 ancestor5
XX     XX1          XX2      XX3      XX4       XX5
YY     YY1          YY2      YY3      YY4
ZY     ZZ1          ZZ2      YY2      YY3       YY4
SS1    SS2          SS3

plant oldest
XX     XX5
XX1    XX5
XX2    XX5
XX3    XX5
XX4    XX5
YY     YY4
YY1    YY4
YY2    YY4
YY3    YY4
ZY     YY4
ZZ1    YY4
ZZ2    YY4
SS1    SS3
SS2    SS3

我怎样才能做到这一点

df2 = df.ffill(axis=1).melt(id_vars='ancestor5', value_name='plant')
df2 = df2.rename(columns={'ancestor5': 'oldest'}).drop(columns='variable')
df2 = df2[df2['oldest'] != df2['plant']]
print(df2)

   oldest plant
0     XX5    XX
1     YY4    YY
2     YY4    ZY
3     SS3   SS1
4     XX5   XX1
5     YY4   YY1
6     YY4   ZZ1
7     SS3   SS2
8     XX5   XX2
9     YY4   YY2
10    YY4   ZZ2
12    XX5   XX3
13    YY4   YY3
14    YY4   YY2
16    XX5   XX4
18    YY4   YY3

说明：使用melt转换为长格式的数据帧，但在此之前，请确保使用ffill有一列始终包含祖先。稍后，删除正向填充复制值的行。

可能是以下行：

df = pd.DataFrame({'plant': ['x', 'y','z'], 
                   'ancestor1':['X1','Y1','Z2'],
                   'ancestor2':['X2','','Z2'],
                   'ancestor3':['X3','','']})
df['oldest'] = [list(filter(len,list(df.iloc[i])))[-1] for i in range(len(df))]

下面是另一个使用列表理解的方法（可能有点难看）

dfout = pd.DataFrame([
        (y, x[-1]) for x in [[i for i in ii if i] for ii in df.values] 
        for y in x[:-1]
    ], columns = ['plant', 'oldest']
)

完整示例：

import pandas as pd

df = pd.DataFrame({
    'plant': ['XX','YY','ZY'],
    'ancestor1': ['XX1','YY1','ZZ1'],
    'ancestor2': ['XX2','YY2',''],
    'ancestor3': ['XX3','','']
})

df = df[['plant','ancestor1','ancestor2','ancestor3']]
dfout = pd.DataFrame([
        (y, x[-1]) for x in [[i for i in ii if i] for ii in df.values] 
        for y in x[:-1]
    ], columns = ['plant', 'oldest']
)
print(dfout)

  plant oldest
0    XX    XX3
1   XX1    XX3
2   XX2    XX3
3    YY    YY2
4   YY1    YY2
5    ZY    ZZ1

下面是一个使用numpy-isin、repeat和concatenate以及列表理解的快速方法。这种方式还允许空的祖先位置为空字符串或无或任何其他占位符

df_vals = df.values
# count the number of sub-ancestors in each row
repeats = (~np.isin(df_vals, ['', None])).sum(axis=1) - 1
# find the oldest ancestor in each row
oldest_ancestors = np.array([df_vals[row, col] for row, col in enumerate(repeats)])
# make the oldest column by repeating the each oldest ancestor for each sub-ancestor
oldest = np.repeat(oldest_ancestors, repeats)
# make the plant column by getting all the sub-ancestors from each row and concatenating
plant = np.concatenate([df_vals[row][:col] for row, col in enumerate(repeats)])
df2 = pd.DataFrame({'plant': plant, 'oldest': oldest})

设置数据帧：

df = pd.DataFrame({'plant': ['XX', 'YY', 'ZY', 'SS1'],
                   'ancestor1': ['XX1', 'YY1', 'ZZ1', 'SS2'],
                   'ancestor2': ['XX2', 'YY2', 'ZZ2', 'SS3'],
                   'ancestor3': ['XX3', 'YY3', 'YY2', None],
                   'ancestor4': ['XX4', 'YY4', 'YY3', None],
                   'ancestor5': ['XX5', None, 'YY4', None]})

抱歉，我得到了TypeError:drop（）得到了一个意外的关键字参数'columns'，这很奇怪。你能打印行的输出直到那个部分（所以

df2.rename（columns={'ancestor5'：'olester'}）

？）哦，我知道了，我只是把变量改成了变量