如何在Python中获得序列中非共享插入和间隙的数量?

如何在Python中获得序列中非共享插入和间隙的数量?,python,bioinformatics,biopython,fasta,Python,Bioinformatics,Biopython,Fasta,我试图获得一系列序列中包含的插入和间隙的数量,这些插入和间隙与它们对齐的引用有关;因此,所有序列现在都具有相同的长度。 比如说 >reference AGCAGGCAAGGCAA--GGAA-CCA >sequence1 AAAA---AAAGCAATTGGAA-CCA >sequence2 AGCAGGCAAAACAA--GGAAACCA 在本例中,sequence1有两个插入(两个T)和三个间隙。不应计算最后一个间隙,因为它同时出现在参考和序列1中。Sequence2有一

我试图获得一系列序列中包含的插入和间隙的数量,这些插入和间隙与它们对齐的引用有关;因此,所有序列现在都具有相同的长度。

比如说

>reference
AGCAGGCAAGGCAA--GGAA-CCA
>sequence1
AAAA---AAAGCAATTGGAA-CCA
>sequence2
AGCAGGCAAAACAA--GGAAACCA
在本例中,sequence1有两个插入(两个T)和三个间隙。不应计算最后一个间隙,因为它同时出现在参考和序列1中。Sequence2有一个插入(最后一个三元组前有一个A)并且没有间隙。(同样,间隙与参考共享,不应输入计数。)。序列1和序列2也有3个多态性

我当前的脚本能够给出差异的估计,但不能给出如上所述的“相关间隙和插入”的计数。比如说

records = list(SeqIO.parse(file("sequences.fasta"),"fasta"))
reference = records[0] #reference is the first sequence in the file
del records[0]

for record in records:
   gaps = record.seq.count("-") - reference.seq.count("-")
   basesinreference = reference.seq.count("A") + reference.seq.count("C") + reference.seq.count("G") + reference.seq.count("T")
   basesinsequence = record.seq.count("A") + record.seq.count("C") + record.seq.count("G") + record.seq.count("T")
   print(record.id)
   print(gaps)
   print(basesinsequence - basesinreference)
#Gives
sequence1
1 #Which means sequence 1 has one more Gap than the reference
-1 #Which means sequence 1 has one base less than the reference
sequence2
-1 #Which means sequence 2 has one Gap less than the reference
1 #Which means sequence 2 has one more base than the reference

我是一个Python新手,仍然在学习这种语言的工具。有没有办法做到这一点?我正在考虑拆分序列,一次迭代比较一个位置并计算差异,但我不确定在Python中是否可行(更不用说它会非常慢)。

这是
zip
函数的工作。我们并行地迭代引用和测试序列,查看其中一个是否在当前位置包含
-
。我们使用该测试的结果来更新字典中插入、删除和未更改的计数

def kind(u, v):
    if u == '-':
        if v != '-':
            return 'I'  # insertion
    else:
        if v == '-':
            return 'D'  # deletion
    return 'U'          # unchanged

reference = 'AGCAGGCAAGGCAA--GGAA-CCA'

sequences = [
    'AGCA---AAGGCAATTGGAA-CCA',
    'AGCAGGCAAGGCAA--GGAAACCA',
]

print('Reference')
print(reference)
for seq in sequences:
    print(seq)
    counts = dict.fromkeys('DIU', 0)
    for u, v in zip(reference, seq):
        counts[kind(u, v)] += 1
    print(counts)
输出

Reference
AGCAGGCAAGGCAA--GGAA-CCA
AGCA---AAGGCAATTGGAA-CCA
{'I': 2, 'D': 3, 'U': 19}
AGCAGGCAAGGCAA--GGAAACCA
{'I': 1, 'D': 0, 'U': 23}
Reference
AGCAGGCAAGGCAA--GGAA-CCA
AAAA---AAAGCAATTGGAA-CCA
{'D': 3, 'P': 3, 'I': 2, 'U': 16}
AGCAGGCAAAACAA--GGAAACCA
{'D': 0, 'P': 2, 'I': 1, 'U': 21}

这里有一个更新版本,它还检查多态性

def kind(u, v):
    if u == '-':
        if v != '-':
            return 'I'  # insertion
    else:
        if v == '-':
            return 'D'  # deletion
        elif v != u:
            return 'P'  # polymorphism
    return 'U'          # unchanged

reference = 'AGCAGGCAAGGCAA--GGAA-CCA'

sequences = [
    'AAAA---AAAGCAATTGGAA-CCA',
    'AGCAGGCAAAACAA--GGAAACCA',
]

print('Reference')
print(reference)
for seq in sequences:
    print(seq)
    counts = dict.fromkeys('DIPU', 0)
    for u, v in zip(reference, seq):
        counts[kind(u, v)] += 1
    print(counts)
输出

Reference
AGCAGGCAAGGCAA--GGAA-CCA
AGCA---AAGGCAATTGGAA-CCA
{'I': 2, 'D': 3, 'U': 19}
AGCAGGCAAGGCAA--GGAAACCA
{'I': 1, 'D': 0, 'U': 23}
Reference
AGCAGGCAAGGCAA--GGAA-CCA
AAAA---AAAGCAATTGGAA-CCA
{'D': 3, 'P': 3, 'I': 2, 'U': 16}
AGCAGGCAAAACAA--GGAAACCA
{'D': 0, 'P': 2, 'I': 1, 'U': 21}

使用Biopython和numpy:

from Bio import AlignIO
from collections import Counter
import numpy as np


alignment = AlignIO.read("alignment.fasta", "fasta")

events = []

for i in range(alignment.get_alignment_length()):
    this_column = alignment[:, i]

    # Mark insertions, polymorphism and deletions following PM 2Ring notation
    events.append(["U" if b == this_column[0] else
                   "I" if this_column[0] == "-" else
                   "P" if b != "-" else
                   "D" for b in this_column])

# Apply a Counter over the columns (axis 0) of the array
print(np.apply_along_axis(Counter, 0, np.array(events)))
这将以与对齐相同的顺序输出计数数组:

[[Counter({'U': 23})
  Counter({'U': 15, 'P': 3, 'D': 3, 'I': 2})
  Counter({'U': 21, 'P': 2, 'I': 1})]]

此选项适用于间隙,但无法检测多态性。我更新了示例中的序列,分别包含3个和2个多态性。@j91您应该提到,在。。。你们应该从序列的角度解释它的含义,不是每个读到你们问题的人都是生物学家。但我会添加代码的更新版本。很有趣!我想知道这与我的代码在速度方面的比较。IIRC,计数器比普通dict慢一点,但我希望您的
事件。append
比调用我的
kind
函数更有效,因为Python函数调用相对较慢。老实说,为了清晰起见,我更喜欢您的答案。这样做只是为了玩
Biopython