Python 如何继续追加到一个列表行直到某个字符?
我试图在一个列表中添加“>”字符之前多行,以便将其转换为字典中的值。例如,我试图:Python 如何继续追加到一个列表行直到某个字符?,python,jupyter-notebook,Python,Jupyter Notebook,我试图在一个列表中添加“>”字符之前多行,以便将其转换为字典中的值。例如,我试图: > 1 AAA CCC > 2 成为AAACCC 代码如下: def parse_fasta(path): with open(path) as thefile: label = [] sequences = [] for k, line in enumerate(thefile): if line.startswith
> 1
AAA
CCC
> 2
成为AAACCC
代码如下:
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = []
for k, line in enumerate(thefile):
if line.startswith('>'):
labeler = line.strip('>').strip('\n')
label.append(labeler)
else:
seqfix = ''.join(line.strip('\n'))
sequences.append(seqfix)
dict_version = {k: v for k, v in zip(label, sequences)}
return dict_version
parse_fasta('small.fasta')
您可以边做边创建词典。这里有一个方法 编辑:删除了defaultdict(因此没有模块) 示例文件:
>1FN3:A|PDBID|CHAIN|SEQUENCE
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNAL
SALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
>5OKT:A|PDBID|CHAIN|SEQUENCE
MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQ
GGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGK
KGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAAT
KRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*
>2PAB:A|PDBID|CHAIN|SEQUENCE
GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWK
ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*
>3IDP:B|PDBID|CHAIN|SEQUENCE
HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRK
TRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHED
LTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIF
MVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS
>4QUD:A|PDBID|CHAIN|SEQUENCE
MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRN
LKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFII
QACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVN
RKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH
创建的字典的漂亮印刷体是:
{'1FN3:A|PDBID|CHAIN|SEQUENCE': 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR',
'2PAB:A|PDBID|CHAIN|SEQUENCE': 'GPTGTGESKCPLMVKVLDAVRGSPAINVAVHVFRKAADDTWEPFASGKTSESGELHGLTTEEQFVEGIYKVEIDTKSYWKALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE*',
'3IDP:B|PDBID|CHAIN|SEQUENCE': 'HHHHHHDRNRMKTLGRRDSSDDWEIPDGQITVGQRIGSGSFGTVYKGKWHGDVAVKMLNVTAPTPQQLQAFKNEVGVLRKTRHVNILLFMGYSTKPQLAIVTQWCEGSSLYHHLHIIETKFEMIKLIDIARQTAQGMDYLHAKSIIHRDLKSNNIFLHEDLTVKIGDFGLATEKSRWSGSHQFEQLSGSILWMAPEVIRMQDKNPYSFQSDVYAFGIVLYELMTGQLPYSNINNRDQIIFMVGRGYLSPDLSKVRSNCPKAMKRLMAECLKKKRDERPLFPQILASIELLARSLPKIHRS',
'4QUD:A|PDBID|CHAIN|SEQUENCE': 'MENTENSVDSKSIKNLEPKIIHGSESMDSGISLDNSYKMDYPEMGLCIIINNKNFHKSTGMTSRSGTDVDAANLRETFRNLKYEVRNKNDLTREEIVELMRDVSKEDHSKRSSFVCVLLSHGEEGIIFGTNGPVDLKKIFNFFRGDRCRSLTGKPKLFIIQACRGTELDCGIETDSGVDDDMACHKIPVEADFLYAYSTAPGYYSWRNSKDGSWFIQSLCAMLKQYADKLEFMHILTRVNRKVATEFESFSFDATFHAKKQIPCIVSMLTKELYFYH',
'5OKT:A|PDBID|CHAIN|SEQUENCE': 'MGSSHHHHHHSSGLVPRGSHMELRVGNRYRLGRKIGSGSFGDIYLGTDIAAGEEVAIKLECVKTKHPQLHIESKIYKMMQGGVGIPTIRWCGAEGDYNVMVMELLGPSLEDLFNFCSRKFSLKTVLLLADQMISRIEYIHSKNFIHRDVKPDNFLMGLGKKGNLVYIIDFGLAKKYRDARTHQHIPYRENKNLTGTARYASINTHLGIEQSRRDDLESLGYVLMYFNLGSLPWQGLKAATKRQKYERISEKKMSTPIEVLCKGYPSEFATYLNFCRSLRFDDKPDYSYLRQLFRNLFHRQGFSYDYVFDWNMLK*'}
编辑:在尝试后使用解决方案:
from pprint import pprint
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = ''
total_seq = []
for line in thefile:
line = line.strip()
if len(line) == 0:
continue
if line.startswith('>'):
line = line.strip('>')
label.append(line)
if len(sequences) > 0:
total_seq.append(sequences)
sequences = ''
else:
sequences += line
total_seq.append(sequences)
dict_version = {k: v for k, v in zip(label, total_seq)}
return dict_version
d = parse_fasta('fasta_sample.txt')
pprint(d)
您将看到我做了一些更改以获得正确的输出。我添加了一个数组total_seq
来保存每个序列头的序列。(您没有这个问题,这是您解决方案中的一个问题)。代码中的联接
没有执行任何操作。该值只是一个字符串,尽管您的想法是正确的。您将在修订后的代码中看到,join
将一个标题id的累积序列连接到一个fasta字符字符串中
我测试了空白行,如果该行是空白的,我做了一个continue
(len(line)==0
)
如果len(序列)>0,则进行测试,以查看是否已经看到任何序列。他们不会在第一张唱片上看到。它会在看到任何序列之前看到ID
for
循环完成后,需要添加最后一个序列
total_seq.追加(序列)
因为当检测到新ID时,除了最后一个序列之外的所有其他序列都被添加到总序列中
我希望这个解释能对您有所帮助,因为它更接近您的代码。为什么不使用string.split('>')
?您的示例有助于我们了解一般情况下的预期,但是,您能否进一步帮助我们,并给出一个生成您想要修复的seq1到seq4的文本示例?我重述了答案,以便更密切地遵循您的方法,进行必要的更改以更正错误。希望这会有帮助。很遗憾,我不能使用库来回答这个问题,但是谢谢:)添加“如果关键字不在dict_版本中:dict_版本[key]=''”将消除包含defaultdict的需要。非常感谢你的回答,非常有帮助。我理解这两个答案,只是你在我的解决方案方法中写的那一个是我尝试过但写不出来的。
from pprint import pprint
def parse_fasta(path):
with open(path) as thefile:
label = []
sequences = ''
total_seq = []
for line in thefile:
line = line.strip()
if len(line) == 0:
continue
if line.startswith('>'):
line = line.strip('>')
label.append(line)
if len(sequences) > 0:
total_seq.append(sequences)
sequences = ''
else:
sequences += line
total_seq.append(sequences)
dict_version = {k: v for k, v in zip(label, total_seq)}
return dict_version
d = parse_fasta('fasta_sample.txt')
pprint(d)