Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/unit-testing/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python:使用Bed文件从FASTA文件中提取DNA序列_Python_Bioinformatics_Biopython_Fasta_Bed - Fatal编程技术网

Python:使用Bed文件从FASTA文件中提取DNA序列

Python:使用Bed文件从FASTA文件中提取DNA序列,python,bioinformatics,biopython,fasta,bed,Python,Bioinformatics,Biopython,Fasta,Bed,我可以知道如何从fasta文件中提取dna序列吗?我试过床上工具和samtools。Bedtools getfasta做得很好,但对于我的一些文件返回“警告:fasta文件中未找到染色体”,但事实是bed文件中的染色体名称与fasta完全相同。我正在寻找python可以为我完成此任务的其他替代方法。床文件:chr1:117223140-117223856 3 7 chr1:117223140-117223856 5快速文件:>chr1:117223140-117223856cgtggctaggg

我可以知道如何从fasta文件中提取dna序列吗?我试过床上工具和samtools。Bedtools getfasta做得很好,但对于我的一些文件返回“警告:fasta文件中未找到染色体”,但事实是bed文件中的染色体名称与fasta完全相同。我正在寻找python可以为我完成此任务的其他替代方法。

床文件:
chr1:117223140-117223856 3 7

chr1:117223140-117223856 5

快速文件:
>chr1:117223140-117223856
cgtggctagggctagccc

所需输出:
>chr1:117223140-117223856
cr1:117223140-117223856
TGGGC
BioPython
是您想要使用的:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict

# read names and postions from bed file
positions = defaultdict(list)
with open('positions.bed') as f:
    for line in f:
        name, start, stop = line.split()
        positions[name].append((int(start), int(stop)))

# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))

# search for short sequences
short_seq_records = []
for name in positions:
    for (start, stop) in positions[name]:
        long_seq_record = records[name]
        long_seq = long_seq_record.seq
        alphabet = long_seq.alphabet
        short_seq = str(long_seq)[start-1:stop]
        short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
        short_seq_records.append(short_seq_record)

# write to file
with open('output.fasta', 'w') as f:
    SeqIO.write(short_seq_records, f, 'fasta')

BioPython
是您想要使用的:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict

# read names and postions from bed file
positions = defaultdict(list)
with open('positions.bed') as f:
    for line in f:
        name, start, stop = line.split()
        positions[name].append((int(start), int(stop)))

# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('sequences.fasta'), 'fasta'))

# search for short sequences
short_seq_records = []
for name in positions:
    for (start, stop) in positions[name]:
        long_seq_record = records[name]
        long_seq = long_seq_record.seq
        alphabet = long_seq.alphabet
        short_seq = str(long_seq)[start-1:stop]
        short_seq_record = SeqRecord(Seq(short_seq, alphabet), id=name, description='')
        short_seq_records.append(short_seq_record)

# write to file
with open('output.fasta', 'w') as f:
    SeqIO.write(short_seq_records, f, 'fasta')
请尝试以下方法:

from Bio import SeqIO

#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])

output = open("output.fasta", "w")
for line in open("input.bed"):
  id, begin, end = line.split()
  if id in dict_fasta:
    #[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
    #[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
    output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
  else:
    print id + " don't found"

output.close()
你会发现,染色体的第一个碱基编号为1:

>chr1:117223140-117223856 CGTGG >chr1:117223140-117223856 TGGGC >chr1:117223140-117223856 CGTGG >chr1:117223140-117223856 TGGGC 得到,染色体的第一个碱基编号为0:

>chr1:117223140-117223856 GTGGG >chr1:117223140-117223856 GGGCT >chr1:117223140-117223856 GTGGG >chr1:117223140-117223856 GGGCT 请尝试以下方法:

from Bio import SeqIO

#I use RAM, and to store fasta in dictionary
parser = SeqIO.parse(open("input.fasta")
dict_fasta = dict([(seq.id, seq) for seq in parser, "fasta")])

output = open("output.fasta", "w")
for line in open("input.bed"):
  id, begin, end = line.split()
  if id in dict_fasta:
    #[int(begin)-1:int(end)] if the first base in a chromosome is numbered 1
    #[int(begin):int(end)+1] if the first base in a chromosome is numbered 0
    output.write(dict_fasta[id][int(begin)-1:int(end)].format("fasta"))
  else:
    print id + " don't found"

output.close()
你会发现,染色体的第一个碱基编号为1:

>chr1:117223140-117223856 CGTGG >chr1:117223140-117223856 TGGGC >chr1:117223140-117223856 CGTGG >chr1:117223140-117223856 TGGGC 得到,染色体的第一个碱基编号为0:

>chr1:117223140-117223856 GTGGG >chr1:117223140-117223856 GGGCT >chr1:117223140-117223856 GTGGG >chr1:117223140-117223856 GGGCT
您的床文件需要以制表符分隔,以便bedtools使用。用制表符替换冒号、破折号和空格


BedTools文档页面显示“BedTools要求所有床输入文件(以及从stdin接收的输入)都以制表符分隔。”。

您的床文件需要以制表符分隔,以便BedTools使用。用制表符替换冒号、破折号和空格


BedTools文档页面上显示“BedTools要求所有BED输入文件(以及从stdin接收的输入)都用制表符分隔。”。

您想要的输出似乎是错误的。染色体编号为0-from-,因此从cr1:117223140-117223856中提取编号为3到7的染色体将产生GTGGG,从中提取编号为5到9的染色体将是GGGCT。您想要的输出似乎是错误的。染色体编号为0-from-,因此从cr1:117223140-117223856中提取编号为3到7的染色体将产生GTGGG,从中提取编号为5到9的染色体将成为GGGCT。这很有帮助。如何确保输出顺序遵循bed文件?谢谢。这很有帮助。如何确保输出顺序遵循bed文件?