Python 利用已知序列从fasta文件中提取序列和头_Python_Indexing_Fasta

Python 利用已知序列从fasta文件中提取序列和头

python indexing

Python 利用已知序列从fasta文件中提取序列和头,python,indexing,fasta,Python,Indexing,Fasta,我正在尝试比较两个文件，并提取包含其他文件子集的序列。我也想提取标识符。然而，我能做的是能够提取序列，包括子集。示例文件包括： text.fa >header1 ETTTHAASCISATTVQEQ*TLFRLLP >header2 SKSPCSDSDY**AAA >header3 SSGAVAAAPTTA 以及当我运行代码时，我有以下输出： ETTTHAASCISATTVQEQ*TLFRLLP SSGAVAAAPTTA 但是，我的预期输出是带有标题的： >head

我正在尝试比较两个文件，并提取包含其他文件子集的序列。我也想提取标识符。然而，我能做的是能够提取序列，包括子集。示例文件包括：

text.fa
>header1
ETTTHAASCISATTVQEQ*TLFRLLP
>header2
SKSPCSDSDY**AAA
>header3
SSGAVAAAPTTA

以及

当我运行代码时，我有以下输出：

ETTTHAASCISATTVQEQ*TLFRLLP
SSGAVAAAPTTA

但是，我的预期输出是带有标题的：

>header1
ETTTHAASCISATTVQEQ*TLFRLLP
>header3
SSGAVAAAPTTA

我的代码分为两部分，首先我使用这些序列创建文件，然后我尝试从原始fasta文件中提取它们的标题：

def get_nucl(filename):
    with open(filename,'r') as fd:
        nucl = []
        for line in fd:
            if line[0]!='>':
                nucl.append(line.strip())
        return nucl
def finding(filename,reffile):
        nucl = get_nucl(filename)
        with open(reffile,'r') as reffile2:
            for line in reffile2:
                for element in nucl:
                    if line.strip() in element:
                            yield(element)



    with open('sequencesmatched.txt','w') as output:
            results = finding('text.fa','textref.fa',)
            for res in results:
                print(res)
                output.write(res + '\n')

因此，在这个

sequencesmatched.txt

中，我有

text.fa

的序列，其子字符串为

textref.fa

。作为：

ETTTHAASCISATTVQEQ*TLFRLLP
SSGAVAAAPTTA

在另一部分中，要检索相应的头和这些序列：

    def finding(filename,seqfile):
        with open(filename,'r') as fastafile:
                with open(seqfile,'r') as sequf:
                        alls=[]
                        for line in fastafile:
                                alls.append(line.strip())
                        print(alls)
                        sequfs = []
                        for line2 in sequf:
                                sequfs.append(line2.strip())
                                if str(line.strip()) == str(line2.strip()):
                                        num = alls.index(line.strip())
                                        print(alls[num-1] + line)


print(finding('text.fa','sequencesmatched.txt'))

但是，作为输出，我只能检索一个序列，这是第一个匹配：

>header1
ETTTHAASCISATTVQEQ*TLFRLLP

也许我可以不使用第二个文件，但我无法进行正确的循环来获取序列及其各自的头。因此，我走了很长的路

如果你能帮忙，我会很高兴的

如果您的文件始终是相同的结构，您可以做一些更简单的事情：

def get_nucl(filename):
    with open(filename, 'r') as fd:
        headers = {}
        key = ''
        for line in fd.readlines():    
            if '>' in line:
                key = line.strip()[1:] # to remove the '>'
            else:
                headers[key] = line.strip()

    return headers

这里我假设您的文件以“>headern”开头，如果不是，您必须添加一些测试。现在您有了一个类似于

headers['header1']='ettthasecatvqeq*TLFRLLP'

的词汇表

现在，要找到匹配项，只需使用该命令：

def finding(filename, reffile):
    headers = get_nucl(filename)
    with open(reffile, 'r') as f:
        matches = {}
        for line in f.readlines():
            for key, value in headers.items():
                if line.stip() in value and key not in matches:
                    matches[key] = value

    return matches

因此，当您有一个标题与其值匹配的dict时，如果您有一个子字符串，并且您已经将标题值作为键，那么您只需签入dict即可

刚刚看到您打印了（查找（..），您的函数已经打印了，所以只需调用它。

这是一个错误：all[num-1]，它不是您的列表all，而是python中的一个函数。您是否错过了拼写？s是missing@Bestasttung谢谢！我没有注意到。现在，我没有错误，但是得到了一个不想要的输出。我正在编辑这个问题。谢谢你的代码，但是，我只得到了“无”作为输出。是的，对不起，忘记了

if line.stip（）中的line.strip（）在值中输入而不是在匹配项中输入：

。现在工作正常非常感谢。但是，我仍然只能获得第一个标头和第一个序列。它没有显示第二个匹配项。我刚刚用您提供的示例对其进行了测试，我在matches dict中有header1和header3。因此，您可能遗漏了一些内容，或者您没有提供好的源文件.

print（matches）

：

{'header1'：'ETTTHASCATTVQEQ*TLFRLLP'，'header3'：'SSGAVAAAPTTA'}

def finding(filename, reffile):
    headers = get_nucl(filename)
    with open(reffile, 'r') as f:
        matches = {}
        for line in f.readlines():
            for key, value in headers.items():
                if line.stip() in value and key not in matches:
                    matches[key] = value

    return matches