Python：一种使用read（）忽略/解释换行符的方法_Python_Newline_Fasta_File Read

Python：一种使用read（）忽略/解释换行符的方法

python

Python：一种使用read（）忽略/解释换行符的方法,python,newline,fasta,file-read,Python,Newline,Fasta,File Read,因此，我在从更大（>GB）的文本文件中提取文本时遇到问题。该文件的结构如下所示： >header1 hereComesTextWithNewlineAtPosition_80 hereComesTextWithNewlineAtPosition_80 hereComesTextWithNewlineAtPosition_80 andEnds >header2 hereComesTextWithNewlineAtPosition_80 hereComesTextWithNewlineA

因此，我在从更大（>GB）的文本文件中提取文本时遇到问题。该文件的结构如下所示：

>header1
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
andEnds
>header2
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAlineAtPosition_80
MaybeAnotherTargetBBBBBBBBBBBrestText
andEndsSomewhereHere

现在我有了信息，在标题为

header2的条目中，我需要将文本从位置X提取到位置Y（在本例中为A），从标题下一行的第一个字母1开始
但是：位置不考虑换行符。所以基本上，当它表示从1到95时，它实际上只表示从1到80的字母以及下一行的15个字母
我的第一个解决方案是使用file.read（X-1）跳过前面不需要的部分，然后使用file.read（Y-X）获得我想要的部分，但是当它延伸到换行时，我只提取了几个字符
除了read（）之外，还有其他python函数可以解决这个问题吗？我想用空字符串替换所有换行符，但文件可能相当大（数百万行）
我还试图通过将extractLength//80
作为附加长度来考虑换行符，但在类似示例的情况下，这是有问题的，例如当95个字符的长度为2-80-3，超过3行时，我实际上需要2个附加位置，但95//80
为1
更新：
我修改了代码以使用Biopython：
for s in SeqIO.parse(sys.argv[2], "fasta"): 
        #foundClusters stores the information for substrings I want extracted
        currentCluster = foundClusters.get(s.id)

        if(currentCluster is not None):

            for i in range(len(currentCluster)):

                outputFile.write(">"+s.id+"|cluster"+str(i)+"\n")

                flanking = 25

                start = currentCluster[i][0]
                end = currentCluster[i][1]
                left = currentCluster[i][2]

                if(start - flanking < 0):
                    start = 0
                else:
                    start = start - flanking

                if(end + flanking > end + left):
                    end = end + left
                else:
                    end = end + flanking

                #for debugging only
                print(currentCluster)
                print(start)
                print(end)

                outputFile.write(s.seq[start, end+1])

和它的工作：）
与：
Biopython让您可以解析fasta文件并轻松访问其id、描述和序列。然后你就有了一个Seq
对象，你可以方便地操作它，而无需重新编码所有内容（如反向补码等）。
为什么不使用Biopython？你应该提取什么的规则一点也不清楚，或者为什么你不能逐行遍历文件。@jeanrjc我对biopython没有经验。你会如何使用它？至于为什么我不逐行地对文件进行迭代：我想提取的文本是由我所说的开始和结束位置指定的，跨越多个行，在中间行开始/结束是可能的，但是我不想要整行。但是你怎么知道开始和结束位置是什么呢？对行的一部分进行切片可能比使用更容易。读取。我有第二个文件，其中包含我要提取的所有站点的格式为：headerX:startPosition，endPosition
的信息。由于我不知道如何正确确定该行中的开始/结束位置，因此我无法分割该行的一部分，但这可能是可以计算的。我修改了代码以使用biopython，但使用您的示例时出现错误。也许你能帮忙？我更新了我的开场白。我的错误是，你必须写：outputFile.write（s.seq[start:end+1]）
，而不是outputFile.write（s.seq[start，end+1]）
[[1, 55, 2782]]
0
80
Traceback (most recent call last):
  File "findClaClusters.py", line 92, in <module>
    outputFile.write(s.seq[start, end+1])
  File "/usr/local/lib/python3.4/dist-packages/Bio/Seq.py", line 236, in __getitem__
   return Seq(self._data[index], self.alphabet)
TypeError: string indices must be integers

outRecord = SeqRecord(s.seq[start: end+1], id=s.id+"|cluster"+str(i), description="Repeat-Cluster")
SeqIO.write(outRecord, outputFile, "fasta")

from Bio import SeqIO
X = 66
Y = 130
for s in in SeqIO.parse("test.fst", "fasta"):
    if "header2" == s.id:
         print s.seq[X: Y+1]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA