如何使用python获得具有条件的序列计数(在fasta中)?
我有一个fasta文件(fasta是一个文件,其中头行以如何使用python获得具有条件的序列计数(在fasta中)?,python,bioinformatics,biopython,fasta,Python,Bioinformatics,Biopython,Fasta,我有一个fasta文件(fasta是一个文件,其中头行以开头,后跟与该头对应的序列行)。我想得到与TRINITY匹配的序列的计数,以及在每个TRINITY序列之后以>K开头的总序列的计数。我能够获得TRINITY序列的计数,但不确定如何获得相应的TRINITY序列组的>K计数。如何在python中实现这一点 myfasta.fasta: >TRINITY_DN12824_c0_g1_i1 TGGTGACCTGAATGGTCACCACGTCCATACAGA >K00363:119:HT
开头,后跟与该头对应的序列行)。我想得到与TRINITY匹配的序列的计数,以及在每个TRINITY序列之后以>K
开头的总序列的计数。我能够获得TRINITY序列的计数,但不确定如何获得相应的TRINITY序列组的>K
计数。如何在python中实现这一点
myfasta.fasta:
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>K00363:119:HTJ23BBXX:1:2228:9678:46223_(reversed) 1:N:0:CGATGTAT
TAGATTTAAAATAGACGCTTCCATAGA
>TRINITY_DN12824_c0_g1_i1
TGGTGACCTGAATGGTCACCACGTCCATACAGA
>K00363:119:HTJ23BBXX:1:1212:18730:9403 1:N:0:CGATGTAT
CACTATTACAATTCTGATGTTTTAATTACTGAGACAT
>TRINITY_DN555_c0_g1_i1
>K00363:119:HTJ23BBXX:1:2228:9658:46188_(reversed) 1:N:0:CGATGTAT
CGATGCTAGATTTAAAATAGACG
>K00363:119:HTJ23BBXX:1:2106:15260:10387_(reversed) 1:N:0:CGATGTAT
TTAAAATAGACGCTTCCATAGAGA
我想要的结果是:
reference reference_counts Corresponding_K_sequences
>TRINITY_DN12824_c0_g1_i1 2 3
>TRINITY_DN555_c0_g1_i1 1 2
下面是我编写的代码,它只计算TRINITY序列计数,但无法将其扩展到还将计算相应的>K序列的位,因此非常感谢您的帮助。
要运行:
python code.py myfasta.fasta output.txt
import sys
import os
from Bio import SeqIO
from collections import defaultdict
filename = sys.argv[1]
outfile = sys.argv[2]
dedup_records = defaultdict(list)
for record in SeqIO.parse(filename, "fasta"):
#print(record)
#print(record.id)
if record.id.startswith('TRINITY'):
#print(record.id)
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
#print(dedup_records)
with open(outfile, 'w') as output:
# # to get the counts of duplicated TRINITY ids (sorted order)
for seq, ids in sorted(dedup_records.items(), key = lambda t: len(t[1]), reverse=True):
#output.write("{} {}\n".format(ids,len(ids)))
print(ids, len(ids))
你有正确的想法,但你需要跟踪以“TRINITY”开头的最后一个标题,并稍微改变你的结构:
from Bio import SeqIO
from collections import defaultdict
TRIN, d = None, defaultdict(lambda: [0,0])
for r in SeqIO.parse('myfasta.fasta', 'fasta'):
if r.id.startswith('TRINITY'):
TRIN = r.id
d[TRIN][0] += 1
elif r.id.startswith('K'):
if TRIN:
d[TRIN][1] += 1
print('reference\treference_counts\tCorresponding_K_sequences')
for k,v in d.items():
print('{}\t{}\t{}'.format(k,v[0],v[1]))
产出:
reference reference_counts Corresponding_K_sequences
TRINITY_DN12824_c0_g1_i1 2 3
TRINITY_DN555_c0_g1_i1 1 2
你有正确的想法,但你需要跟踪以“TRINITY”开头的最后一个标题,并稍微改变你的结构:
from Bio import SeqIO
from collections import defaultdict
TRIN, d = None, defaultdict(lambda: [0,0])
for r in SeqIO.parse('myfasta.fasta', 'fasta'):
if r.id.startswith('TRINITY'):
TRIN = r.id
d[TRIN][0] += 1
elif r.id.startswith('K'):
if TRIN:
d[TRIN][1] += 1
print('reference\treference_counts\tCorresponding_K_sequences')
for k,v in d.items():
print('{}\t{}\t{}'.format(k,v[0],v[1]))
产出:
reference reference_counts Corresponding_K_sequences
TRINITY_DN12824_c0_g1_i1 2 3
TRINITY_DN555_c0_g1_i1 1 2
当然,我忽略了这里的序列,不确定这是否是完美的!谢谢你的帮助。我当然忽略了这里的序列,不确定这是否是完美的!谢谢你的帮助。