Python 如何找到包含其ID的GC内容的最高百分比?

Python 如何找到包含其ID的GC内容的最高百分比?,python,bioinformatics,Python,Bioinformatics,我在Rosalind上遇到了一个问题,涉及到输出具有最高GC内容百分比和百分比的字符串的ID,我遇到了一个问题。我能够输出样本数据ID+百分比,但我不知道如何找到最高值 #!/bin/env python3 import sys input = input() file = open(input, "r") #print(input) gc = 0 at = 0 unknown = 0 for line in file: if line.startswith(">"):

我在Rosalind上遇到了一个问题,涉及到输出具有最高GC内容百分比和百分比的字符串的ID,我遇到了一个问题。我能够输出样本数据ID+百分比,但我不知道如何找到最高值

#!/bin/env python3

import sys

input = input()
file = open(input, "r")
#print(input)

gc = 0
at = 0
unknown = 0

for line in file:
  if line.startswith(">"):
    if (gc + at) > 0:
     total = gc + at
     percentage = float(gc)/float(total) * 100
     result = percentage
     print(seq_id, result)
    seq_id = line.strip()
    gc = at = unknown = 0
  else:
   nuc_str = list(line.strip())
   for n in nuc_str:
    if n == "G" or n == "g" or n == "C" or n == "c":
     gc += 1.0
    elif n == "A" or n == "a" or n == "T" or n == "t":
     at += 1.0
    else:
     unknown += 1.0
total = gc + at
percentage = float(gc)/float(total) * 100
result = percentage
print(seq_id, result)
所需的输出为:

Rosalind_0808
60.919540
我得到的结果是:

>Rosalind_6404 53.75
>Rosalind_5959 53.57142857142857
>Rosalind_0808 60.91954022988506
输入文件是取自rosalind的样本数据文件:

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT

我尽量不使用biopython。

有一些优化可以让代码更好、更简洁。例如,使用
元组的
列表
,可以将序列头与其GC内容关联起来

所以你的循环看起来像:

# Create an empty list to collect results
results = []
for line in file:
    if line.startswith(">"):
        if (gc + at) > 0:
            total = gc + at
            percentage = float(gc)/float(total) * 100
            result = percentage
            # Add results to the list here
            results.append((seq_id, result))
            seq_id = line.strip()
            gc = at = unknown = 0
    else:
        nuc_str = list(line.strip())
        for n in nuc_str:
            if n == "G" or n == "g" or n == "C" or n == "c":
                gc += 1.0
            elif n == "A" or n == "a" or n == "T" or n == "t":
                at += 1.0
            else:
                unknown += 1.0
运行循环后,
结果
如下所示:

[
 ("Rosalind_6404", 53.75), 
 ("Rosalind_5959", 53.57142857142857), 
 ("Rosalind_0808", 60.91954022988506),
]
然后,为了收集具有最高GC内容的序列,我们可以使用
itemgetter

from operator import itemgetter
result = max(results, key=itemgetter(1))
使用
参数,我们告诉
max
查找元组中位置
1
处值最高的列表项。
结果
如下所示:

("Rosalind_0808", 60.91954022988506)
from operator import itemgetter

res = []
with open("test.fa", "r") as fh:
    for line in fh:
        line = line.strip()
        if line[0] in ">":
            header = line[1:]
            seq = next(fh).strip()
            gc = sum(1 for base in seq if base.lower() in "gc")
            res.append((header, gc / len(seq) * 100))

result = max(res, key=itemgetter(1))
print("{}\n{:.6f}".format(*result))
要以您需要的格式输出结果,我们可以使用
格式

 print("{}\n{:.6f}".format(*result))
哪些产出:

Rosalind_0808
60.919540

其他优化:

  • 使用
    with
    语句打开文件,这有助于在处理后正确关闭文件
  • 计算GC,并使用序列长度(from
    len
    )作为除数
  • 使用
    lower()
我写了自己的版本,看起来像这样:

("Rosalind_0808", 60.91954022988506)
from operator import itemgetter

res = []
with open("test.fa", "r") as fh:
    for line in fh:
        line = line.strip()
        if line[0] in ">":
            header = line[1:]
            seq = next(fh).strip()
            gc = sum(1 for base in seq if base.lower() in "gc")
            res.append((header, gc / len(seq) * 100))

result = max(res, key=itemgetter(1))
print("{}\n{:.6f}".format(*result))

您可以使用一些优化来使代码更好、更简洁。例如,使用
元组的
列表
,可以将序列头与其GC内容关联起来

所以你的循环看起来像:

# Create an empty list to collect results
results = []
for line in file:
    if line.startswith(">"):
        if (gc + at) > 0:
            total = gc + at
            percentage = float(gc)/float(total) * 100
            result = percentage
            # Add results to the list here
            results.append((seq_id, result))
            seq_id = line.strip()
            gc = at = unknown = 0
    else:
        nuc_str = list(line.strip())
        for n in nuc_str:
            if n == "G" or n == "g" or n == "C" or n == "c":
                gc += 1.0
            elif n == "A" or n == "a" or n == "T" or n == "t":
                at += 1.0
            else:
                unknown += 1.0
运行循环后,
结果
如下所示:

[
 ("Rosalind_6404", 53.75), 
 ("Rosalind_5959", 53.57142857142857), 
 ("Rosalind_0808", 60.91954022988506),
]
然后,为了收集具有最高GC内容的序列,我们可以使用
itemgetter

from operator import itemgetter
result = max(results, key=itemgetter(1))
使用
参数,我们告诉
max
查找元组中位置
1
处值最高的列表项。
结果
如下所示:

("Rosalind_0808", 60.91954022988506)
from operator import itemgetter

res = []
with open("test.fa", "r") as fh:
    for line in fh:
        line = line.strip()
        if line[0] in ">":
            header = line[1:]
            seq = next(fh).strip()
            gc = sum(1 for base in seq if base.lower() in "gc")
            res.append((header, gc / len(seq) * 100))

result = max(res, key=itemgetter(1))
print("{}\n{:.6f}".format(*result))
要以您需要的格式输出结果,我们可以使用
格式

 print("{}\n{:.6f}".format(*result))
哪些产出:

Rosalind_0808
60.919540

其他优化:

  • 使用
    with
    语句打开文件,这有助于在处理后正确关闭文件
  • 计算GC,并使用序列长度(from
    len
    )作为除数
  • 使用
    lower()
我写了自己的版本,看起来像这样:

("Rosalind_0808", 60.91954022988506)
from operator import itemgetter

res = []
with open("test.fa", "r") as fh:
    for line in fh:
        line = line.strip()
        if line[0] in ">":
            header = line[1:]
            seq = next(fh).strip()
            gc = sum(1 for base in seq if base.lower() in "gc")
            res.append((header, gc / len(seq) * 100))

result = max(res, key=itemgetter(1))
print("{}\n{:.6f}".format(*result))

您应该将它们附加到列表或字典中,然后打印最大值,否则给定脚本,您将始终打印给定的每个输入。好的,我将尝试将它们附加到列表或字典中,然后打印最大值,否则给定脚本,您将始终打印给定的每个输入。好的,我将尝试