Python 将fasta文件与包含序列id的txt文件进行比较

Python 将fasta文件与包含序列id的txt文件进行比较,python,biopython,Python,Biopython,我需要帮助,因为我被困住了。 我有一个带有序列ID的txt文件,它 看起来像这样--> 然后我有一个典型的fasta文件 >sp|P00115|CYC6_SYNP3 Cytochrome c6 OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=petJ PE=1 SV=2 MKTLLTILALTLVTLTTWLSTPAFAADIADGAKVFSANCAACHMGGGNVVMANKTLKKEA LEQFGMNSA

我需要帮助,因为我被困住了。 我有一个带有序列ID的txt文件,它 看起来像这样-->

然后我有一个典型的fasta文件

>sp|P00115|CYC6_SYNP3 Cytochrome c6 OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=petJ PE=1 SV=2
MKTLLTILALTLVTLTTWLSTPAFAADIADGAKVFSANCAACHMGGGNVVMANKTLKKEA
LEQFGMNSADAIMYQVQNGKNAMPAFGGRLSEAQIENVAAYVLDQSSKNWAG
>tr|K9RTH7|K9RTH7_SYNP3 N-acyl-D-glucosamine 2-epimerase OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_2130 PE=4 SV=1
MAPQINFPFSDLIAGYVTSYDTETDIFGLKTSDGREFPVKLSPMAYAKVIQNFDEGYPDA
TSTMRAWLTPGRFLFVYGVFYPDTDVFDAKQVVFAGKKEDDYVFEKQDWWIQQINALGKF
YVKAQFGQEEIDYRNYRTDLSVSGERSSVKFRQETDTISRLVYGFATAFMMTGDEVFLEA
AEKGTEYLRDHMRFVDRDEDIIYWYHGIDVQGEKELKIFASEFGDDYDAIPAYEQIYALA
GPIQTYRCTGDPRILSDAEQTIKLFDKFFLDQSEYGGYFSHIDPLMLDPRSDSLGRNKAR
KNWNSVGDHAPAYLINLWLATGEQKYADMLEYTFDTIEKYFPDYENSPFVQERFYEDWSH
DTTWGWQQNRAVVGHNLKIAWNLMRMQSLKPKEQYVGLAQKIADLMPSVGSDQQRGGWSD
TVERLLTNNSKFHQFVWHDRKAWWQQEQAILAYLILGGILEHDDYHRLGREAAAFYNAWF
LDLEDGGVYFNVLANGISYLARGNERAKGSHSMSGYHSFELCYLAAVYTNFLITKHPMDF
YFKPLPNGFPDRILRVSPDILPPGSILLESVEIDGKAYTDFDSQALTVKLPETKERVKVK
VRLAPKS
>tr|K9RXQ9|K9RXQ9_SYNP3 Uncharacterized protein OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_3008 PE=4 SV=1
MKVEILKKRLNKECPMTTTRMPEDVIQELKQIASLLVFWGYQPLIGADIGQGLRTDLEQL
EDDKVSALVASLKRHRVSDEVLQTALMETTIN
我需要比较这两个文件,找到基于id的序列描述并打印它。 我的代码:


对于y中的x,您只是缺少了某种循环
此外,文件处理程序在Python中是可移植的(对于非二进制模式,按行迭代),这将避免您在开始迭代之前将整个文件加载到内存中(如
.readlines()
所做的那样)

#加载第一个文件并创建一个有用的结构
比较_dict={}
以open(“reference.txt”)作为fh:
对于fh中的线路:
如果行:#扔掉空行,可以做更严格的比较
比较dict[line.strip()]=None
#形成一组可能的前缀
compare\u tuple=tuple(“>”+a表示compare\u dict.keys()中的a)
以open(“proteome.fasta”)作为fh:
对于行_no,枚举中的行(fh,1):#行从1开始,而不是0
if line.startswith(比较\u元组)
键,值=行。拆分(“,1)
key=key[1::#从前缀中去掉“>”
比较dict[键]=值
打印(“在L{}:{}上找到{}”。格式(键,行号,值))
#(可选)显示.fasta文件中没有的密钥
对于键,比较目录项()中的值:
如果值为“无”:
打印(“未能找到{}的定义”。格式(键))

您能提供一份
.fasta
文件的样本吗?当然可以。它是经过编辑的。
>sp|P00115|CYC6_SYNP3 Cytochrome c6 OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=petJ PE=1 SV=2
MKTLLTILALTLVTLTTWLSTPAFAADIADGAKVFSANCAACHMGGGNVVMANKTLKKEA
LEQFGMNSADAIMYQVQNGKNAMPAFGGRLSEAQIENVAAYVLDQSSKNWAG
>tr|K9RTH7|K9RTH7_SYNP3 N-acyl-D-glucosamine 2-epimerase OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_2130 PE=4 SV=1
MAPQINFPFSDLIAGYVTSYDTETDIFGLKTSDGREFPVKLSPMAYAKVIQNFDEGYPDA
TSTMRAWLTPGRFLFVYGVFYPDTDVFDAKQVVFAGKKEDDYVFEKQDWWIQQINALGKF
YVKAQFGQEEIDYRNYRTDLSVSGERSSVKFRQETDTISRLVYGFATAFMMTGDEVFLEA
AEKGTEYLRDHMRFVDRDEDIIYWYHGIDVQGEKELKIFASEFGDDYDAIPAYEQIYALA
GPIQTYRCTGDPRILSDAEQTIKLFDKFFLDQSEYGGYFSHIDPLMLDPRSDSLGRNKAR
KNWNSVGDHAPAYLINLWLATGEQKYADMLEYTFDTIEKYFPDYENSPFVQERFYEDWSH
DTTWGWQQNRAVVGHNLKIAWNLMRMQSLKPKEQYVGLAQKIADLMPSVGSDQQRGGWSD
TVERLLTNNSKFHQFVWHDRKAWWQQEQAILAYLILGGILEHDDYHRLGREAAAFYNAWF
LDLEDGGVYFNVLANGISYLARGNERAKGSHSMSGYHSFELCYLAAVYTNFLITKHPMDF
YFKPLPNGFPDRILRVSPDILPPGSILLESVEIDGKAYTDFDSQALTVKLPETKERVKVK
VRLAPKS
>tr|K9RXQ9|K9RXQ9_SYNP3 Uncharacterized protein OS=Synechococcus sp. (strain ATCC 27167 / PCC 6312) OX=195253 GN=Syn6312_3008 PE=4 SV=1
MKVEILKKRLNKECPMTTTRMPEDVIQELKQIASLLVFWGYQPLIGADIGQGLRTDLEQL
EDDKVSALVASLKRHRVSDEVLQTALMETTIN
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys

p = "proteome.fasta"
file = "reference.txt"
out = "jopik.txt"


with open(out, "w") as o:
    sys.stdout = o
    for seq_record in SeqIO.parse(open(p, mode = "r"),"fasta"):
        seq_record.description=' '.join(seq_record.description.split()[1:])
        with open(file,"r") as f:
            line = f.readlines()
            print(line)
            if (seq_record.id == line):
                    i = seq_record.description
                    print(i)