Python 在已经剥离行之后删除空白？_Python_Bioinformatics

Python 在已经剥离行之后删除空白？

python

Python 在已经剥离行之后删除空白？,python,bioinformatics,Python,Bioinformatics,我在删除Fasta文件中的所有空格时遇到问题，以下是我目前使用的程序： import re for line in f: line = line.rstrip(' \n\r') if line.startswith(">"): seqid = re.search('Segment:[(0-9)]',line).group() seqID.append(seqid) else:

我在删除Fasta文件中的所有空格时遇到问题，以下是我目前使用的程序：

import re
    for line in f:
        line = line.rstrip(' \n\r')
        if line.startswith(">"):
            seqid = re.search('Segment:[(0-9)]',line).group()
            seqID.append(seqid)
        else:
            numSeq = len(line)

这就是测试文件的样子（我只使用了前两个来显示seqId）：

当我把它打印出来时，它是这样打印出来的：

ATTATATTCAGTATGGAAAGAATAAAAGAACTACGGAATCTGATGTCGCAGTCTCGCACTCGCGAGATAC 70
TGACAAAAACCACAGTGGACCATATGGCCATAATTAAGAAGTACACATCGGGGAGACAGGAAAAGAACCC 70
GTCACTTAGGATGAAATGGATGATGGCAATGAAATATCCAATCACTGCTGACAAAAGGGTAACAGAAATG 70
 0
ATTATATTCAGTATGGAAAGAATAAAAGAATTACGGAATCTGATGTCGCAATCTCGCACTCGCGAGATAC 70
TGACAAAAACCACAGTGGACCATATGGCCATAATTAAGAAGTACACATCGGGGAGACAGGAAAAGAACCC 70
GTCACTTAGGATGAAATGGATGATGGCAATGAAATACCCAATCACTGCTGACAAAAGAATAACAGAAATG 70
 0
ATTATATTCAGTATGGAAAGAATAAAAGAACTACGGAATCTGATGTCGCAGTCTCGCACTCGCGAGATAC 70
TGACAAAAACCACAGTGGACCATATGGCCATAATTAAGAAGTACACATCGGGGAGACAGGAAAAGAACCC 70
GTCACTTAGGATGAAATGGATGATGGCAATGAAATATCCAATCACTGCTGACAAAAGGGTAACAGAAATG 70
 0

我如何让它连接线和删除线与0核苷酸？抱歉，由于睡眠不足，措辞不当。如果你对我的问题有任何疑问，请随时提问

以下是完整的程序：

from __future__ import division
import re
f = open('fastatest.fasta','r')
numGC = 0;
allGC = []; #array that contains all the GC%'s
sequences = []; #The array that contains all the sequences
seqID = []; #The array that contains all seqIds
seqLen = [];
numSeq = 0
GCPercent = 0
#Concatinating the FASTA file
for line in f:
    line = line.rstrip(' \n\r')
    if line.startswith(">"):
        seqid = re.search('Segment:[(0-9)]',line).group()
        seqID.append(seqid)
    else: #Find the Length and GC%
        numSeq = len(line)
        #print seqid, numSeq
        GCPercent = (( line.count('G') + line.count('C') ) / (numSeq)*100)
        allGC.append(GCPercent);
        sequences.append(line)
        seqLen.append(numSeq)
        print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

以及我收到的输出：

Segment:1   70  40.00
Segment:1   70  44.29
Segment:1   70  38.57
Traceback (most recent call last):
  File "blah", line 20, in <module>
    GCPercent = (( line.count('G') + line.count('C') ) / (numSeq)*100)
ZeroDivisionError: division by zero

段：17040.00
部分：17044.29
部分：17038.57
回溯（最近一次呼叫最后一次）：
文件“blah”，第20行，在
GCPercent=（（行计数（'G'）+行计数（'C'））/（numSeq）*100）
ZeroDivision错误：被零除

条件附加是否有效

if not seqid.strip.startswith('0'):
    seqID.append(seqid)

如果没有，则可以查看

seqid

的外观

条件附加是否有效

if not seqid.strip.startswith('0'):
    seqID.append(seqid)

如果没有，则可以查看

seqid

的外观

条件附加是否有效

if not seqid.strip.startswith('0'):
    seqID.append(seqid)

如果没有，则可以查看

seqid

的外观

条件附加是否有效

if not seqid.strip.startswith('0'):
    seqID.append(seqid)

如果没有，则可以查看

seqid

的外观

当直线长度为0时，可以直接跳到循环的下一个迭代：

numSeq = len(line)  # from your code for reference
if not numSeq:
    continue

当直线长度为0时，可以直接跳到循环的下一个迭代：

numSeq = len(line)  # from your code for reference
if not numSeq:
    continue

当直线长度为0时，可以直接跳到循环的下一个迭代：

numSeq = len(line)  # from your code for reference
if not numSeq:
    continue

当直线长度为0时，可以直接跳到循环的下一个迭代：

numSeq = len(line)  # from your code for reference
if not numSeq:
    continue

如果文件在每个序列后都有一个空行（也是在最后一个序列后！），那么应该可以这样做：

if line.startswith(">"):
    seqid = re.search('Segment:[(0-9)]',line).group()
    seqID.append(seqid)
    sequence = ""
elif len(line.strip()): 
    sequence += line.strip()  # three lines will make a sequence
else: #Find the Length and GC%
    numSeq = len(sequence)
    #print seqid, numSeq
    GCPercent = (( sequence.count('G') + sequence.count('C') ) / (numSeq)*100)
    allGC.append(GCPercent);
    sequences.append(sequence)
    seqLen.append(numSeq)
    print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

我刚刚添加了三行，并在四个位置用序列替换了

line

。看起来是一个最小变化的解决方案，但我还没有测试过

如果文件在每个序列之后都有一个空行（也是在最后一个序列之后！），那么应该可以：

if line.startswith(">"):
    seqid = re.search('Segment:[(0-9)]',line).group()
    seqID.append(seqid)
    sequence = ""
elif len(line.strip()): 
    sequence += line.strip()  # three lines will make a sequence
else: #Find the Length and GC%
    numSeq = len(sequence)
    #print seqid, numSeq
    GCPercent = (( sequence.count('G') + sequence.count('C') ) / (numSeq)*100)
    allGC.append(GCPercent);
    sequences.append(sequence)
    seqLen.append(numSeq)
    print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

我刚刚添加了三行，并在四个位置用序列替换了

line

。看起来是一个最小变化的解决方案，但我还没有测试过

如果文件在每个序列之后都有一个空行（也是在最后一个序列之后！），那么应该可以：

if line.startswith(">"):
    seqid = re.search('Segment:[(0-9)]',line).group()
    seqID.append(seqid)
    sequence = ""
elif len(line.strip()): 
    sequence += line.strip()  # three lines will make a sequence
else: #Find the Length and GC%
    numSeq = len(sequence)
    #print seqid, numSeq
    GCPercent = (( sequence.count('G') + sequence.count('C') ) / (numSeq)*100)
    allGC.append(GCPercent);
    sequences.append(sequence)
    seqLen.append(numSeq)
    print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

我刚刚添加了三行，并在四个位置用序列替换了

line

。看起来是一个最小变化的解决方案，但我还没有测试过

如果文件在每个序列之后都有一个空行（也是在最后一个序列之后！），那么应该可以：

if line.startswith(">"):
    seqid = re.search('Segment:[(0-9)]',line).group()
    seqID.append(seqid)
    sequence = ""
elif len(line.strip()): 
    sequence += line.strip()  # three lines will make a sequence
else: #Find the Length and GC%
    numSeq = len(sequence)
    #print seqid, numSeq
    GCPercent = (( sequence.count('G') + sequence.count('C') ) / (numSeq)*100)
    allGC.append(GCPercent);
    sequences.append(sequence)
    seqLen.append(numSeq)
    print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

我刚刚添加了三行，并在四个位置用序列替换了

line

。看起来是一个最小变化的解决方案，但我还没有测试过

您可以通过检查空行来忽略空行：

from __future__ import division
import re

numGC = 0;
allGC = []; #array that contains all the GC%'s
sequences = []; #The array that contains all the sequences
seqID = []; #The array that contains all seqIds
seqLen = [];
numSeq = 0
GCPercent = 0

with open('fastatest.fasta', 'r') as f:
    #Concatinating the FASTA file
    for line in f:
        line = line.rstrip(' \n\r')
        if line:  # non-empty line?
            if line.startswith(">"):
                seqid = re.search('Segment:[(0-9)]',line).group()
                seqID.append(seqid)
            else: #Find the Length and GC%
                numSeq = len(line)
                #print seqid, numSeq
                GCPercent = ((line.count('G') + line.count('C')) /
                             (numSeq)*100)
                allGC.append(GCPercent);
                sequences.append(line)
                seqLen.append(numSeq)
                print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

输出：

段：17040.00
部分：17044.29
部分：17038.57
部分：17037.14
部分：17044.29
部分：17037.14

您可以通过检查空行来忽略空行：

from __future__ import division
import re

numGC = 0;
allGC = []; #array that contains all the GC%'s
sequences = []; #The array that contains all the sequences
seqID = []; #The array that contains all seqIds
seqLen = [];
numSeq = 0
GCPercent = 0

with open('fastatest.fasta', 'r') as f:
    #Concatinating the FASTA file
    for line in f:
        line = line.rstrip(' \n\r')
        if line:  # non-empty line?
            if line.startswith(">"):
                seqid = re.search('Segment:[(0-9)]',line).group()
                seqID.append(seqid)
            else: #Find the Length and GC%
                numSeq = len(line)
                #print seqid, numSeq
                GCPercent = ((line.count('G') + line.count('C')) /
                             (numSeq)*100)
                allGC.append(GCPercent);
                sequences.append(line)
                seqLen.append(numSeq)
                print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

输出：

段：17040.00
部分：17044.29
部分：17038.57
部分：17037.14
部分：17044.29
部分：17037.14

您可以通过检查空行来忽略空行：

from __future__ import division
import re

numGC = 0;
allGC = []; #array that contains all the GC%'s
sequences = []; #The array that contains all the sequences
seqID = []; #The array that contains all seqIds
seqLen = [];
numSeq = 0
GCPercent = 0

with open('fastatest.fasta', 'r') as f:
    #Concatinating the FASTA file
    for line in f:
        line = line.rstrip(' \n\r')
        if line:  # non-empty line?
            if line.startswith(">"):
                seqid = re.search('Segment:[(0-9)]',line).group()
                seqID.append(seqid)
            else: #Find the Length and GC%
                numSeq = len(line)
                #print seqid, numSeq
                GCPercent = ((line.count('G') + line.count('C')) /
                             (numSeq)*100)
                allGC.append(GCPercent);
                sequences.append(line)
                seqLen.append(numSeq)
                print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

输出：

段：17040.00
部分：17044.29
部分：17038.57
部分：17037.14
部分：17044.29
部分：17037.14

您可以通过检查空行来忽略空行：

from __future__ import division
import re

numGC = 0;
allGC = []; #array that contains all the GC%'s
sequences = []; #The array that contains all the sequences
seqID = []; #The array that contains all seqIds
seqLen = [];
numSeq = 0
GCPercent = 0

with open('fastatest.fasta', 'r') as f:
    #Concatinating the FASTA file
    for line in f:
        line = line.rstrip(' \n\r')
        if line:  # non-empty line?
            if line.startswith(">"):
                seqid = re.search('Segment:[(0-9)]',line).group()
                seqID.append(seqid)
            else: #Find the Length and GC%
                numSeq = len(line)
                #print seqid, numSeq
                GCPercent = ((line.count('G') + line.count('C')) /
                             (numSeq)*100)
                allGC.append(GCPercent);
                sequences.append(line)
                seqLen.append(numSeq)
                print "%s\t%d\t%.2f" % (seqid,numSeq,GCPercent)

输出：

段：17040.00
部分：17044.29
部分：17038.57
部分：17037.14
部分：17044.29
部分：17037.14

尝试使用Biopython

from Bio import SeqIO
for record in SeqIO.parse("fasta.fas","fasta"):
     print record.id
     print record.seq

这将删除所有新线等。

尝试使用Biopython

from Bio import SeqIO
for record in SeqIO.parse("fasta.fas","fasta"):
     print record.id
     print record.seq

这将删除所有新线等。

尝试使用Biopython

from Bio import SeqIO
for record in SeqIO.parse("fasta.fas","fasta"):
     print record.id
     print record.seq

这将删除所有新线等。

尝试使用Biopython

from Bio import SeqIO
for record in SeqIO.parse("fasta.fas","fasta"):
     print record.id
     print record.seq

这将删除所有新行等。

输入文件的示例将非常有用。您能提供输入文件的示例吗？我用示例输入文件运行示例代码，您得到的输出与您列出的相去甚远。这是您提供的完整代码段吗？不，这不是完整的代码，我只是想弄清楚如何删除空行。输入文件的示例将非常有用。您能否提供输入文件的外观示例？我使用示例输入文件运行示例代码，您得到的输出与您列出的相去甚远。这是您提供的完整代码段吗？不，这不是完整的代码，我只是想弄清楚如何删除空行。输入文件的示例将非常有用。您能否提供输入文件的外观示例？我使用示例输入文件运行示例代码，您得到的输出与您列出的相去甚远。这是您提供的完整代码段吗？不，这不是完整的代码，我只是想弄清楚如何删除空行。输入文件的示例将非常有用。您能否提供输入文件的外观示例？我使用示例输入文件运行示例代码，您得到的输出与您列出的相去甚远。这是您提供的完整代码段吗？不，这不是完整的代码，我只是想找出如何删除空行。这就解决了零除问题。你会不会碰巧知道如何使它将片段包裹在一起，而不是在70个核苷酸后停止？我想你可以使用

'.join（sequences）

来解决零除问题。你会不会碰巧知道如何使它将片段包裹在一起，而不是在70个核苷酸后停止？我想你可以使用

'.join（sequences）

来解决零除问题。你会不会碰巧知道如何使它将片段包裹在一起，而不是在70个核苷酸后停止？我想你可以使用

'.join（sequences）

来解决这个问题