Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/http/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何扩展模糊dna序列_Python_Biopython_Dna Sequence - Fatal编程技术网

Python 如何扩展模糊dna序列

Python 如何扩展模糊dna序列,python,biopython,dna-sequence,Python,Biopython,Dna Sequence,假设你有这样的DNA序列: AATCRVTAA 其中R和V是DNA核苷酸的模糊值,其中R表示A或G,V表示A,C或G 是否有一种Biopython方法来生成可以由上述不明确序列表示的所有不同序列组合 例如,这里的输出是: AATCAATAA AATCACTAA AATCAGTAA AATCGATAA AATCGCTAA AATCGGTAA 我最终编写了自己的函数: from Bio import Seq from itertools import product def extend_am

假设你有这样的DNA序列:

AATCRVTAA
其中
R
V
是DNA核苷酸的模糊值,其中
R
表示
A
G
V
表示
A
C
G

是否有一种Biopython方法来生成可以由上述不明确序列表示的所有不同序列组合

例如,这里的输出是:

AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA

我最终编写了自己的函数:

from Bio import Seq
from itertools import product

def extend_ambiguous_dna(seq):
   """return list of all possible sequences given an ambiguous DNA input"""
   d = Seq.IUPAC.IUPACData.ambiguous_dna_values
   r = []
   for i in product(*[d[j] for j in seq]):
      r.append("".join(i))
   return r 

In [1]: extend_ambiguous_dna("AV")
Out[1]: ['AA', 'AC', 'AG']
它允许您使用

In [2]: extend_ambiguous_dna("NN")

Out[2]: ['GG', 'GA', 'GT', 'GC',
         'AG', 'AA', 'AT', 'AC',
         'TG', 'TA', 'TT', 'TC',
         'CG', 'CA', 'CT', 'CC']

希望这能为其他人节省时间

我不确定是否有一种biopython方法可以做到这一点,但这里有一种使用itertools的方法:

s = "AATCRVTAA"
ambig = {"R": ["A", "G"], "V":["A", "C", "G"]}
groups = itertools.groupby(s, lambda char:char not in ambig)
splits = []
for b,group in groups:
    if b:
        splits.extend([[g] for g in group])
    else:
        for nuc in group:
            splits.append(ambig[nuc])
answer = [''.join(p) for p in itertools.product(*splits)]
输出:

In [189]: answer
Out[189]: ['AATCAATAA', 'AATCACTAA', 'AATCAGTAA', 'AATCGATAA', 'AATCGCTAA', 'AATCGGTAA']

也许是一种更短更快的方法,因为很可能这个函数将用于非常大的数据:

from Bio import Seq
from itertools import product

def extend_ambiguous_dna(seq):
   """return list of all possible sequences given an ambiguous DNA input"""
   d = Seq.IUPAC.IUPACData.ambiguous_dna_values
   return [ list(map("".join, product(*map(d.get, seq)))) ]
使用
map
可以使用C语言而不是Python语言执行循环。事实证明,这比使用普通循环甚至列表理解要快得多

现场试验 用一个简单的dict作为
d
而不是
不明确的值返回的dict

产出:

# len(seq) = 2:
List delay: 0.02 ms
Map delay: 0.01 ms

# len(seq) = 3:
List delay: 0.04 ms
Map delay: 0.02 ms

# len(seq) = 4
List delay: 0.08 ms
Map delay: 0.06 ms

# len(seq) = 5
List delay: 0.43 ms
Map delay: 0.17 ms

# len(seq) = 10
List delay: 126.68 ms
Map delay: 77.15 ms

# len(seq) = 12
List delay: 1887.53 ms
Map delay: 1320.49 ms

显然,
map
更好,但只是2或3倍。可以肯定的是,它可以进一步优化。

还有一个itertools解决方案:

from itertools import product
import re

lu = {'R':'AG', 'V':'ACG'}

def get_seqs(seq):
    seqs = []
    nrepl = seq.count('R') + seq.count('V')
    sp_seq = [a for a in re.split(r'(R|V)', seq) if a]
    pr_terms = [lu[a] for a in sp_seq if a in 'RV']

    for cmb in product(*pr_terms):
        seqs.append(''.join(sp_seq).replace('R', '%s').replace('V', '%s') % cmb)
    return seqs

seq = 'AATCRVTAA'

print 'seq: ', seq
print '\n'.join(get_seqs(seq))

seq1 = 'RAATCRVTAAR'
print 'seq: ', seq1
print '\n'.join(get_seqs(seq1))
输出:
特殊情况下的错误输出,其中有两个或多个相同的相邻不明确代码,如“RRATCGGTAAA”
from itertools import product
import re

lu = {'R':'AG', 'V':'ACG'}

def get_seqs(seq):
    seqs = []
    nrepl = seq.count('R') + seq.count('V')
    sp_seq = [a for a in re.split(r'(R|V)', seq) if a]
    pr_terms = [lu[a] for a in sp_seq if a in 'RV']

    for cmb in product(*pr_terms):
        seqs.append(''.join(sp_seq).replace('R', '%s').replace('V', '%s') % cmb)
    return seqs

seq = 'AATCRVTAA'

print 'seq: ', seq
print '\n'.join(get_seqs(seq))

seq1 = 'RAATCRVTAAR'
print 'seq: ', seq1
print '\n'.join(get_seqs(seq1))
seq:  AATCRVTAA
AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA
seq:  RAATCRVTAAR
AAATCAATAAA
AAATCAATAAG
AAATCACTAAA
AAATCACTAAG
AAATCAGTAAA
AAATCAGTAAG
AAATCGATAAA
AAATCGATAAG
AAATCGCTAAA
AAATCGCTAAG
AAATCGGTAAA
AAATCGGTAAG
GAATCAATAAA
GAATCAATAAG
GAATCACTAAA
GAATCACTAAG
GAATCAGTAAA
GAATCAGTAAG
GAATCGATAAA
GAATCGATAAG
GAATCGCTAAA
GAATCGCTAAG
GAATCGGTAAA
GAATCGGTAAG