Python 查找不匹配模式的效率

Python 查找不匹配模式的效率,python,python-3.x,algorithm,performance,bioinformatics,Python,Python 3.x,Algorithm,Performance,Bioinformatics,我正在研究一个简单的生物信息学问题。我有一个可行的解决方案,但效率低得离谱。如何提高效率 问题: 在字符串g中查找长度k的模式,因为k-mer最多可能有d不匹配 这些字符串和模式都是基因组的——所以我们可能的字符集是{A,T,C,G} 我将调用函数frequentWordsMatch(g,k,d) 下面是一些有用的例子: frequentWordsMatch('AAAAAAAAAA',2,1)→ ['AA','CA','GA','TA','AC','AG','AT'] 下面是一个较长的示例,如

我正在研究一个简单的生物信息学问题。我有一个可行的解决方案,但效率低得离谱。如何提高效率


问题:

在字符串
g
中查找长度
k
的模式,因为
k
-mer最多可能有
d
不匹配

这些字符串和模式都是基因组的——所以我们可能的字符集是
{A,T,C,G}

我将调用函数
frequentWordsMatch(g,k,d)

下面是一些有用的例子:

frequentWordsMatch('AAAAAAAAAA',2,1)
→ <代码>['AA','CA','GA','TA','AC','AG','AT']

下面是一个较长的示例,如果您实现了此功能并希望进行测试:

frequentWordsMatch('CaCagAgcGcGcGcGcGcGcGcGcGcGcGcGcGcGcAgcGcGcAgcGcGcGcAgcGcGcAgcGcAgcGcGcAgcGcGcAgcCcGcCcGcGcGcGcGcGcGcGcGcGcGcGcGcGcAgcGcGcAgcGcGcGcGcAgcGcGcGcGcAgcGcAgcGcGcAgcGcGcGcGcAgcGcGcAgcGcGcGcAgcGcGcGcAgcAgcGcGcGcGcAgcAgcGcGcGcGcGcGcGGCCGGCACGCC',10,2)
→ <代码>['gcacacac','GCGCACACAC']

根据我的简单解决方案,第二个示例可能需要约60秒,尽管第一个非常快


简单的解决方案:

我的想法是,对于g中的每个k长度段,找到每个可能的“邻居”(例如,其他k长度段最多有d个不匹配),并将这些邻居作为键添加到字典中。然后我计算每个邻居kmer在字符串g中出现的次数,并将其记录在字典中

显然,这样做有点糟糕,因为随着k和d的增加,邻域的数量会急剧增加,并且必须扫描每个邻域的字符串,这使得实现速度非常慢。但是,唉,这就是我寻求帮助的原因

我将把我的代码放在下面。肯定有很多新手的错误需要解包,所以感谢您的时间和关注

def FrequentWordsMismatch(g, k, d):
    '''
    Finds the most frequent k-mer patterns in the string g, given that those 
    patterns can mismatch amongst themselves up to d times

    g (String): Collection of {A, T, C, G} characters
    k (int): Length of desired pattern
    d (int): Number of allowed mismatches
    '''
    counts = {}
    answer = []

    for i in range(len(g) - k + 1):
        kmer = g[i:i+k]
        for neighborkmer in Neighbors(kmer, d):
            counts[neighborkmer] = Count(neighborkmer, g, d)

    maxVal = max(counts.values())

    for key in counts.keys():
        if counts[key] == maxVal:
            answer.append(key)

    return(answer)


def Neighbors(pattern, d):
    '''
    Find all strings with at most d mismatches to the given pattern

    pattern (String): Original pattern of characters
    d (int): Number of allowed mismatches
    '''
    if d == 0:
        return [pattern]

    if len(pattern) == 1:
        return ['A', 'C', 'G', 'T']

    answer = []

    suffixNeighbors = Neighbors(pattern[1:], d)

    for text in suffixNeighbors:
        if HammingDistance(pattern[1:], text) < d:
            for n in ['A', 'C', 'G', 'T']:
                answer.append(n + text)
        else:
            answer.append(pattern[0] + text)

    return(answer)


def HammingDistance(p, q):
    '''
    Find the hamming distance between two strings

    p (String): String to be compared to q
    q (String): String to be compared to p
    '''
    ham = 0 + abs(len(p)-len(q))

    for i in range(min(len(p), len(q))):
        if p[i] != q[i]:
            ham += 1

    return(ham)


def Count(pattern, g, d):
    '''
    Count the number of times that the pattern occurs in the string g, 
    allowing for up to d mismatches

    pattern (String): Pattern of characters
    g (String): String in which we're looking for pattern
    d (int): Number of allowed mismatches
    '''
    return len(MatchWithMismatch(pattern, g, d))

def MatchWithMismatch(pattern, g, d):
    '''
    Find the indicies at which the pattern occurs in the string g, 
    allowing for up to d mismatches

    pattern (String): Pattern of characters
    g (String): String in which we're looking for pattern
    d (int): Number of allowed mismatches
    '''
    answer = []
    for i in range(len(g) - len(pattern) + 1):
        if(HammingDistance(g[i:i+len(pattern)], pattern) <= d):
            answer.append(i)
    return(answer)

仅对问题描述进行,而不是对示例进行(出于我在评论中解释的原因),一种方法是:

s = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGC"\
    "CGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGG"\
    "CCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCG"\
    "GTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACAC"\
    "ACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"

def frequent_words_mismatch(g,k,d):
    def num_misspellings(x,y):
        return sum(xx != yy for (xx,yy) in zip(x,y))

    seen = set()
    for i in range(len(g)-k+1):
        seen.add(g[i:i+k])

    # For each unique sequence, add a (key,bin) pair to the bins dictionary
    #  (The bin is initialized to a list containing only the sequence, for now)
    bins = {seq:[seq,] for seq in seen}
    # Loop again through the unique sequences...
    for seq in seen:
        # Try to fit it in *all* already-existing bins (based on bin key)
        for bk in bins:
            # Don't re-add seq to it's own bin
            if bk == seq: continue
            # Test bin keys, try to find all appropriate bins
            if num_misspellings(seq, bk) <= d:
                bins[bk].append(seq)

    # Get a list of the bin keys (one for each unique sequence) sorted in order of the
    #   number of elements in the corresponding bins
    sorted_keys = sorted(bins, key= lambda k:len(bins[k]), reverse=True)

    # largest_bin_key will be the key of the largest bin (there may be ties, so in fact
    #   this is *a* key of *one of the bins with the largest length*).  That is, it'll
    #   be the sequence (found in the string) that the most other sequences (also found
    #   in the string) are at most d-distance from.
    largest_bin_key = sorted_keys[0]

    # You can return this bin, as your question description (but not examples) indicate:
    return bins[largest_bin_key]

largest_bin = frequent_words_mismatch(s,10,2)
print(len(largest_bin))     # 13
print(largest_bin)
s=“cacagtaggcgcgcgcacacacagcccggccccgggccgcccgggcgcgcgcggccgcgcgcgcgcgcacccggcacagc”\
“CGTACGGCACAGTAGCCGGCCGGCCGGCACACCCGGCACCGGGTACACACACACACACCGGGGCACACACACAGCGCGCGCCGGG”\
“CCCCGGGCCGTACCGGCCGCCGGCGGCCCAGCAGGCCGCCGGCAGCAGTACCCGGCCACACAGCAGTACCCACACACACACAGCGCGCG”\
“GTAGCGGCGCACACACACAGTAGCGCACACAGCCCCACACACACGCCGGCGCACAGCGCGCGGCACAGGGCGCGCGCGCACACACC”\
“ACCGGCAGCAGTAGGGCGGCCGGCGCAGCC”
def频繁字不匹配(g、k、d):
def num_拼写错误(x,y):
邮政编码(x,y)中(xx,yy)的返回金额(xx!=yy)
seen=set()
对于范围内的i(len(g)-k+1):
seen.add(g[i:i+k])
#对于每个唯一序列,将(键,bin)对添加到bin字典中
#(目前,bin初始化为仅包含序列的列表)
bins={seq:[seq,]for seq in seen}
#再次循环通过独特的序列。。。
对于seen中的seq:
#尝试将其放入*所有*已存在的存储箱中(基于存储箱密钥)
对于箱中的bk:
#不要将seq重新添加到它自己的bin中
如果bk==seq:继续
#测试箱子钥匙,尝试找到所有合适的箱子

如果num_拼写错误(seq,bk)那么问题描述在几个方面是不明确的,所以我就看下面的例子。你似乎想要字母表
(A,C,G,T}
中所有
k
长度的字符串,这样到
G
的连续子字符串的匹配数是最大的,其中“匹配”表示每个字符之间的相等,最多有
d
个字符不等

我忽略了你的
HammingDistance()
函数即使在输入长度不同的情况下也能起作用,这主要是因为它对我来说没有多大意义;-),但部分原因是,在你给出的任何示例中,都不需要它来获得你想要的结果

下面的代码在所有示例中生成您想要的结果,即生成您给出的输出列表的排列。如果您想要规范输出,我建议在返回输出列表之前对其进行排序

该算法非常简单,但依赖于
itertools
以“C速度”完成繁重的组合提升。所有示例在一秒钟内运行良好

对于每个长度-<代码> k>代码>连续子串的<代码> g>代码>,考虑所有<代码>组合(k,d)< /代码>集合<代码> d>代码>不同的索引位置.有<代码> 4 **d>代码>用字母从>代码> {a,c,g,t}来填充这些索引位置,每个这样的方式都是“一种模式”将子字符串与最多

d
差异相匹配。通过记住已生成的模式来消除重复项;这比一开始就大胆地只生成唯一的模式要快

因此,总的来说,时间要求是
O(len(g)*k**d*4**d)=O(len(g)*(4*k)**d
,其中
k**d
对于
k
d
的合理小值而言,是二项系数
组合(k,d)的高估标准
。需要注意的重要一点是——毫不奇怪——它在
d
中是指数型的

def fwm(g, k, d):
    from itertools import product, combinations
    from collections import defaultdict

    all_subs = list(product("ACGT", repeat=d))
    all_ixs = list(combinations(range(k), d))
    patcount = defaultdict(int)

    for starti in range(len(g)):
        base = g[starti : starti + k]
        if len(base) < k:
            break
        patcount[base] += 1
        seen = set([base])
        basea = list(base)
        for ixs in all_ixs:
            saved = [basea[i] for i in ixs]
            for newchars in all_subs:
                for i, newchar in zip(ixs, newchars):
                    basea[i] = newchar
                candidate = "".join(basea)
                if candidate not in seen:
                    seen.add(candidate)
                    patcount[candidate] += 1
            for i, ch in zip(ixs, saved):
                basea[i] = ch

    maxcount = max(patcount.values())
    return [p for p, c in patcount.items() if c == maxcount]

在第一个示例中,您的结果包含的序列甚至不在输入字符串中。因此,我认为,很好,它们与“AA”之间的距离都是1。但在第二个示例中,只返回了2个序列。您增加了序列的数量(通过增加输入字符串长度),增加了匹配灵活性(通过将
d
从1增加到2),但结果集的大小不知怎么变小了——为什么?有效序列可以相交?@jedwards:我认为这只是因为我们现在寻找的是10个mer,而不是2个mer- ['CGGCCGCCGG', 'GGGCCGGCGG', 'CGGCCGGCGC', 'AGGCGGCCGG', 'CAGGCGCCGG', 'CGGCCGGCCG', 'CGGTAGCCGG', 'CGGCGGCCGC', 'CGGGCGCCGG', 'CCGGCGCCGG', 'CGGGCCCCGG', 'CCGCCGGCGG', 'GGGCCGCCGG']
def fwm(g, k, d):
    from itertools import product, combinations
    from collections import defaultdict

    all_subs = list(product("ACGT", repeat=d))
    all_ixs = list(combinations(range(k), d))
    patcount = defaultdict(int)

    for starti in range(len(g)):
        base = g[starti : starti + k]
        if len(base) < k:
            break
        patcount[base] += 1
        seen = set([base])
        basea = list(base)
        for ixs in all_ixs:
            saved = [basea[i] for i in ixs]
            for newchars in all_subs:
                for i, newchar in zip(ixs, newchars):
                    basea[i] = newchar
                candidate = "".join(basea)
                if candidate not in seen:
                    seen.add(candidate)
                    patcount[candidate] += 1
            for i, ch in zip(ixs, saved):
                basea[i] = ch

    maxcount = max(patcount.values())
    return [p for p, c in patcount.items() if c == maxcount]
def fwm(g, k, d):
    from collections import defaultdict

    patcount = defaultdict(int)
    alphabet = "ACGT"
    allbut = {ch: tuple(c for c in alphabet if c != ch)
              for ch in alphabet}

    def inner(i, rd):
        if not rd or i == k:
            patcount["".join(base)] += 1
            return
        inner(i+1, rd)
        orig = base[i]
        for base[i] in allbut[orig]:
            inner(i+1, rd-1)
        base[i] = orig

    for i in range(len(g) - k + 1):
        base = list(g[i : i + k])
        inner(0, d)

    maxcount = max(patcount.values())
    return [p for p, c in patcount.items() if c == maxcount]