Python：如何基于包含二进制内容的文本文件提取DNA序列？_Python_Python 2.7_Bioinformatics_Biopython_Fasta

Python：如何基于包含二进制内容的文本文件提取DNA序列？

python python-2.7

Python：如何基于包含二进制内容的文本文件提取DNA序列？,python,python-2.7,bioinformatics,biopython,fasta,Python,Python 2.7,Bioinformatics,Biopython,Fasta,例如，我有一个fasta文件，其中包含以下内容顺序： >human1 AGGGCGSTGC >human2 GCTTGCGCTAG >human3 TTCGCTAG 如何使用python读取包含以下内容的文本文件以提取序列？1表示真，0表示假。只有值为1的序列将被提取示例文本文件： 0 1 1 预期产出： >human2 GCTTGCGCTAG >human3 TTCGCTAG 我不熟悉fasta文件格式，但希望这能有所帮助。您可以按以下方式在pyth

例如，我有一个fasta文件，其中包含以下内容顺序：

>human1
AGGGCGSTGC
>human2
GCTTGCGCTAG
>human3
TTCGCTAG

如何使用python读取包含以下内容的文本文件以提取序列？1表示真，0表示假。只有值为1的序列将被提取

示例文本文件：

0
1
1

预期产出：

>human2
GCTTGCGCTAG
>human3
TTCGCTAG

我不熟悉fasta文件格式，但希望这能有所帮助。您可以按以下方式在python中打开文件，并提取列表中的有效行条目

valid = []
with open('test.txt') as f:
    all_lines = f.readlines() # get all the lines
    all_lines = [x.strip() for x in all_lines] # strip away newline chars
    for i in range(len(all_lines)):
        if all_lines[i] == '1': # if it matches our condition
            valid.append(i) # add the index to our list

    print valid # or get only the fasta file contents on these lines

我使用以下文本文件test.txt运行它：

并在打印有效时获得输出：

我认为这将有助于您继续前进，但如果您需要更详细的答案，请在评论中告诉我。

您可以创建一个列表，作为阅读fasta文件时的遮罩：

with open('mask.txt') as mf:
    mask = [ s.strip() == '1' for s in mf.readlines() ]

然后：

或：

我认为这可能会对您有所帮助，我真的认为您应该花一些时间学习Python。Python是一种很好的生物信息学语言

display = []
with open('test.txt') as f:
    for line in f.readlines():
        display.append(int(line.strip()))

output_DNA = []
with open('XX.fasta') as f:
    index = -1
    for line in f.readlines():
        if line[0] == '>':
            index = index + 1

        if display[index]:
            output_DNA.append(line)

print output_DNA

因此，最好使用biopython

from Bio import SeqIO

mask = ["1"==_.strip() for _ in open("mask.txt")]
seqs = [seq for seq in SeqIO.parse(open("input.fasta"), "fasta")]
seqs_filter = [seq for flag, seq in zip(mask, seqs) if flag]
for seq in seqs_filter:
  print seq.format("fasta")

你会得到：

>human2 GCTTGCGCTAG >human3 TTCGCTAG 解释

parse fasta：格式fasta可能需要检查几行序列，最好使用专门的库来读取解析器并写入输出

掩码：我读取反掩码文件并转换为布尔值[False，True，True]

过滤器：使用zip函数为每个序列匹配他的掩码，下面我使用列表理解来过滤

您的问题是一般性的。你写了一些代码吗？试图弄清楚。所以你有一个DNA序列文件，还有一个单独的文本文件，每行都有一个0或1？然后你想解析文本文件来确定哪些序列是有效的？文本文件实际上是二进制格式，或者您所指的0和1是以asciior或其他编码文本的形式明文写入的？文本文件是明文0和1。

display = []
with open('test.txt') as f:
    for line in f.readlines():
        display.append(int(line.strip()))

output_DNA = []
with open('XX.fasta') as f:
    index = -1
    for line in f.readlines():
        if line[0] == '>':
            index = index + 1

        if display[index]:
            output_DNA.append(line)

print output_DNA

from Bio import SeqIO

mask = ["1"==_.strip() for _ in open("mask.txt")]
seqs = [seq for seq in SeqIO.parse(open("input.fasta"), "fasta")]
seqs_filter = [seq for flag, seq in zip(mask, seqs) if flag]
for seq in seqs_filter:
  print seq.format("fasta")

>human2 GCTTGCGCTAG >human3 TTCGCTAG