Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/unix/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从fasta文件中查找fastq文件的唯一第一行顶部和底部_Python_Unix_Awk_Grep_Fastq - Fatal编程技术网

Python 从fasta文件中查找fastq文件的唯一第一行顶部和底部

Python 从fasta文件中查找fastq文件的唯一第一行顶部和底部,python,unix,awk,grep,fastq,Python,Unix,Awk,Grep,Fastq,我有两个文件,一个是fasta文件,另一个是fastq文件。我想使用fasta,阅读并搜索fastq文件中的每一行,然后打印顶行和底行。这就是我所拥有的 fasta文件 阅读1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaa

我有两个文件,一个是
fasta
文件,另一个是
fastq
文件。我想使用
fasta
,阅读并搜索
fastq
文件中的每一行,然后打印顶行和底行。这就是我所拥有的

fasta文件

阅读1

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa c

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa g

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ga

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa c

+

CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############

@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821:N:0:TTAGGC

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

+
;@?DBD听起来你想要一个只有唯一序列的fastq

这是一种极为低效的方法,但它应该是有效的。它将您的fastq文件存储为一个列表,希望它不会太大。它只会抛出重复的序列,而不是质量分数或任何东西

fastqFile = list(open(fastq))
out = []
output = open('output.fastq', 'at')

for lineNum, line in enumerate(fastqFile):
    if lineNum < 4:
        out.append(line)
        output.write(line)
    else:
        if line not in out and lineNum % 4 != 3:
            output.write(fastqFile[lineNum - 1])
            output.write(line)
            output.write(fastqFile[lineNum + 1])
            output.write(fastqFile[lineNum + 2])
            out.append(fastqFile[lineNum - 1])
            out.append(line)
            out.append(fastqFile[lineNum + 1])
            out.append(fastqFile[lineNum + 2])
fastqFile=list(打开(fastq))
out=[]
输出=打开('output.fastq','at')
对于lineNum,枚举中的行(fastqFile):
如果lineNum<4:
out.append(行)
输出。写入(行)
其他:
如果行未输入输出且行数为%4!=三:
output.write(fastqFile[lineNum-1])
输出。写入(行)
output.write(fastqFile[lineNum+1])
output.write(fastqFile[lineNum+2])
out.append(fastqFile[lineNum-1])
out.append(行)
out.append(fastqFile[lineNum+1])
out.append(fastqFile[lineNum+2])

我想我知道你想说什么,所以这是我的代码。根据请求,它将只接受fasta序列的第一次出现。这可能不是最好的方法,但我对python不太熟悉

# open the file into a list
fasta = open('fasta1.fa', 'r').read().splitlines()
fastq = open('fastq1.fq', 'r').read().splitlines()

# get rid of headers
# if headers important, please indicate in example
fastaseq = [s for s in fasta if not any('>' in t for t in s)]

# get rid of whitespace
fastaseq = filter(None, fastaseq)
fastq = filter(None, fastq)

# new list
newfastq = []

# go through each item in your fasta list
# if it matches, get the line above and below
# put in the new list
for fa in fastaseq:
    if fa in fastq:
        ind = fastq.index(fa)
        printblock = fastq[ind-1:ind+2]
    elif fa not in fastq:
        printblock = []
    if printblock:
        newfastq.append(printblock)

# print everything to file  
with open('fastq2.fq', 'w') as f:
    for block in newfastq:
        for item in block:
            f.writelines(item + '\n')

除非你有充分的理由自己做这件事,否则使用

法斯塔:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG
fastq(基于您的,但不完全相同,因为您的输出格式不正确):

代码:


编辑:以上按fastq文件而不是fasta文件的顺序打印记录。如果顺序不重要,您应该使用该方法。否则,您可以将记录添加到字典中,其中键是它们在FASTA文件中的索引,并在最后打印它们,对字典进行排序:

from Bio import SeqIO
import sys

with open("fasta") as fh:
    fasta = fh.read().splitlines()

seen = set()
records = {}

for record in SeqIO.parse(open('fastq'), 'fastq'):
    seq = str(record.seq)
    if seq in fasta and seq not in seen:
        seen.add(seq)
        records[fasta.index(seq)] = record

for record in sorted(records):
    sys.stdout.write(records[record].format('fastq'))

(这里我也使用了
sys.stdout.write
而不是
print
,以避免额外的换行。)

生物学家,你怎么了。。总是发布最糟糕的格式问题!您需要阅读以了解Stackoverflow和提问时的格式设置工作原理。对此感到抱歉!下一次就不行了,在过去的几个月里,你问了14个问题,所有的问题都很糟糕,都必须彻底检查(似乎大部分是我提出的)。如果你希望社区能帮助你,那么你至少可以做的是正确地格式化你的问题。我只想在fastq中打印一次fasta文件的第一行file@pauloAlmeidaThat不是“最终输出文件”中的内容。您应该在问题中有一个所需输出的示例(我不理解您所说的“在fastq文件中打印一次fasta文件的第一行”是什么意思)。在任何情况下,一旦你有了Biopython对象中的记录,你只能打印你想要的。在我提到的最终输出文件中,第一行只出现了一次!这意味着我在fastq文件中重复了相同序列的行,我想打印第一行,底部和顶部各有一行@paulo AlimeidaThat,这是代码的功能;它只打印该行一次(打印的每个序列都被存储起来,这样就不会打印两次)。@abh,我的原始代码没有根据fasta文件对序列进行排序,现在是这样。
@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
@@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85
@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
+
@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########
from Bio import SeqIO

with open("fasta") as fh:
    fasta = fh.read().splitlines()

seen = set()

for record in SeqIO.parse(open('fastq'), 'fastq'):
    seq = str(record.seq)
    if seq in fasta and seq not in seen:
        seen.add(seq)
        print record.format('fastq')
from Bio import SeqIO
import sys

with open("fasta") as fh:
    fasta = fh.read().splitlines()

seen = set()
records = {}

for record in SeqIO.parse(open('fastq'), 'fastq'):
    seq = str(record.seq)
    if seq in fasta and seq not in seen:
        seen.add(seq)
        records[fasta.index(seq)] = record

for record in sorted(records):
    sys.stdout.write(records[record].format('fastq'))