Python 删除重复条目并提取所需信息

Python 删除重复条目并提取所需信息,python,Python,我有一个2 X 2的mattrix,看起来像这样: DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16 44 23 49 DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383 DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY8

我有一个2 X 2的mattrix,看起来像这样:

DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16  44  23  49
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2   121 264 383
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.7 2   96  5   95
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20   3   115 133 260
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.3e+03 3   21  277 295
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+03 14  29  345 360
DNA_pol3_beta   121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.9e-18 1   121 1   121
DNA_pol3_beta   121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+02 30  80  157 209
DNA_pol3_beta   121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94    2   101 273 369
SMC_N   220 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 1.2e-14 3   199 19  351
AAA_21  303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1   32  40  68
AAA_21  303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0015  231 300 279 352
AAA_15  369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 4e-05   4   53  19  67
AAA_15  369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 8.8e+02 347 363 332 348
AAA_23  200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0014  3   41  22  60
我想过滤掉结果,例如,对于项目“DNA_pol3_beta_3”,有2个条目。在这两个条目中,我只想提取第5列中相应值最低的行。这意味着,在两个条目中:

DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2   121 264 383

上面的一个应该在结果中。类似地,“DNA_pol3_beta_2”有4个条目,程序应仅提取

DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20   3   115 133 260
因为它在4列中第5列的值最低。此外,程序应忽略第5列的值小于1E-5的条目

我尝试了以下代码:

for i in lines:
    if lines[i+1] == lines [i]:
        if lines[i+1][4] > lines [i][4]:
            evalue = lines[i][4]
        else:
            evalue = lines[i+1][4]

你最好用熊猫来做这个。见下文:

import pandas as pd

df=pd.read_csv('yourfile.txt', sep=' ', skipinitialspace=True, names=(range(9)))

df=df[df[4]>=0.00001]

result=df.loc[df.groupby(0)[4].idxmin()].sort_index().reset_index(drop=True)
输出:

>>> print(result)
                 0    1                                        2    3           4   5    6    7    8
0  DNA_pol3_beta_3  121  Paja_0001_peg_[locus_tag=BCY86_RS00005]  384  1200.00000  16   44   23   49
1  DNA_pol3_beta_2  116  Paja_0001_peg_[locus_tag=BCY86_RS00005]  384     3.70000   2   96    5   95
2    DNA_pol3_beta  121  Paja_0001_peg_[locus_tag=BCY86_RS00005]  384     0.94000   2  101  273  369
3           AAA_21  303  Paja_0002_peg_[locus_tag=BCY86_RS00010]  378     0.00011   1   32   40   68
4           AAA_15  369  Paja_0002_peg_[locus_tag=BCY86_RS00010]  378     0.00004   4   53   19   67
5           AAA_23  200  Paja_0002_peg_[locus_tag=BCY86_RS00010]  378     0.00140   
如果您想将文件返回到csv,可以使用
df.to\u csv()将其保存。