Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/341.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在python中编辑文本(.fastq)文件_Python_Bioinformatics - Fatal编程技术网

如何在python中编辑文本(.fastq)文件

如何在python中编辑文本(.fastq)文件,python,bioinformatics,Python,Bioinformatics,我有一个类似下面的小例子的文件。每4行与一个ID相关。每个ID的第二行以N开头。我想删除这些行开头的N,其他所有内容都将保持不变。 我想用python实现这一点。你知道怎么做吗 例如: @SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 NGCGACCTCAGATCAGACGTGGCGACC +SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 #<&l

我有一个类似下面的小例子的文件。每4行与一个ID相关。每个ID的第二行以N开头。我想删除这些行开头的N,其他所有内容都将保持不变。 我想用python实现这一点。你知道怎么做吗

例如:

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
NGCGACCTCAGATCAGACGTGGCGACC
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
#<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
NGCCGACATCGAAGGATCAA
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
#<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
NACAAACCCTTGTGTCGAGGGC
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
#=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
NGGGACATGACAGCCTGGACCATCG
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
#=ABBGGGGGGGGGGGGGGGGGGGG
@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947长度=50
NGCGACCTCAGATCAGACGGTGGCGACC
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947长度=50

# 如果我完全按照您的要求去做(从每个序列中删除起始N),那么这将使序列处于不一致的状态

FASTQ文件的每四行保存前两行序列的质量值。因此,如果从序列中删除第一个字符,还需要从具有质量值的行中删除第一个字符

您可以用纯Python做一些非常简单的事情,比如

with open("example.fastq") as f:
    for idx, line in enumerate(f.read().splitlines()):
        if idx % 2:
            print(line[1:])
        else:
            print(line)
但如果你打算定期处理生物数据,你真的应该开始使用生物信息学模块,比如。如果您试图做一些会使文件形状不一致或没有意义的事情,它将警告您

解决方案如下所示:

from Bio import SeqIO
from Bio import Seq

new_records = []
for record in SeqIO.parse("example.fastq", "fastq"):
    sequence = str(record.seq)
    letter_annotations = record.letter_annotations

    # You first need to empty the existing letter annotations
    record.letter_annotations = {}

    new_sequence = sequence[1:]
    record.seq = Seq.Seq(new_sequence)


    new_letter_annotations = {'phred_quality': letter_annotations['phred_quality'][1:]}
    record.letter_annotations = new_letter_annotations

    new_records.append(record)


with open('without_starting_N.fastq', 'w') as output_handle:
    SeqIO.write(new_records, output_handle, "fastq")
哪个输出

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
GCGACCTCAGATCAGACGTGGCGACC
+
<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
GCCGACATCGAAGGATCAA
+
<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
ACAAACCCTTGTGTCGAGGGC
+
=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
GGGACATGACAGCCTGGACCATCG
+
=ABBGGGGGGGGGGGGGGGGGGGG
@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947长度=50
GCGACCTCAGATCAGACGTGGGCGACC
+

请注意,要使用有效的fastq格式,还需要删除质量行的第一个字符。你想要的并不能保持基础和品质之间的匹配。
@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
GCGACCTCAGATCAGACGTGGCGACC
+
<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
GCCGACATCGAAGGATCAA
+
<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
ACAAACCCTTGTGTCGAGGGC
+
=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
GGGACATGACAGCCTGGACCATCG
+
=ABBGGGGGGGGGGGGGGGGGGGG