Python 如何找到两个序列之间的重叠并返回它_Python_Algorithm

Python 如何找到两个序列之间的重叠并返回它

python algorithm

Python 如何找到两个序列之间的重叠并返回它,python,algorithm,Python,Algorithm,我是Python新手，已经花了很多时间解决这个问题，希望有人能帮助我。我需要找到两个序列之间的重叠。重叠在第一个序列的左端和第二个序列的右端。我希望函数找到重叠，并返回它我的顺序是： s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC" s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC" 我的函数应该命名为 def getOverlap(left, right)

我是Python新手，已经花了很多时间解决这个问题，希望有人能帮助我。我需要找到两个序列之间的重叠。重叠在第一个序列的左端和第二个序列的右端。我希望函数找到重叠，并返回它

我的顺序是：

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

我的函数应该命名为

def getOverlap(left, right)

s1

为左序列，

s2

为右序列

结果应该是

‘GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC’

感谢您的帮助

该算法是一种很好的方法，可以在另一个字符串中找到一个字符串（因为我看到了DNA，我猜你会希望它扩展到…数十亿？）

#Knuth-Morris-Pratt字符串匹配
#David Eppstein，加州大学欧文分校，2002年3月1日
从未来导入生成器
def KnuthMorrisPratt（文本、图案）：
''产生文本中模式副本的所有起始位置。
调用约定类似于string.find，但其参数可以是
列表或迭代器，而不仅仅是字符串，它返回所有匹配项，而不仅仅是
第一个，它不需要一次将整个文本存储在内存中。
每当它让步时，它将准确地读到并包括文本
导致产量下降的比赛
#允许索引到模式中，并在屈服期间防止更改
模式=列表（模式）
#建立班次数量表
移位=[1]*（len（模式）+1）
移位=1
对于范围内的位置（透镜（图案））：
当shift=0且模式[matchLen]！=c:
startPos+=移位[matchLen]
matchLen-=移位[matchLen]
matchLen+=1
如果matchLen==len（模式）：
产量起点

（还有一个内置的，由于运行时常数的缘故，对于小问题，它会更快）

为了提高性能，请使用前缀表和字符串的哈希窗口作为基4整数（在生物学中，您可以称它们为k-mers或oligos）。）

祝你好运

编辑：还有一个很好的技巧，可以对包含第一个字符串中的每个前缀（n total）和第二个字符串中的每个前缀（n total）的列表进行排序。如果它们共享最大的公共子序列，则它们必须在排序列表中相邻，因此从排序列表中最接近的其他字符串中查找元素，然后使用完全匹配的最长前缀。：）

您可以使用：

请查看图书馆，更准确地说是：

这是我能想到的最好的方法：left=“cgattccaggctcccagggtaccataactagtagatctc”right=“ggctcccagggtaccataactgactagatctcgtcgtccagagaccctagc”def getOverlap（左，右）：if left==右[：：-1]：return”“其他：左中的i:右中的i：if left[len（左）-i]=（右[：-1]）[len（right）-i]：如果为False：继续返回right[：len（left）-i]@安妮：你应该在你的帖子里放上这条评论，这样它的格式会更具可读性。谢谢。如果我把s1和s2转过来，那么我会得到答案“c”，但我得到的答案是一样的，如果s1是左边的，s2是右边的。问题不是关于常见的子字符串。它只是查看第一个字符串的前缀和第二个字符串的前缀.

def LongestCommonSubstring(S1, S2):
  M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
  longest, x_longest = 0, 0
  for x in xrange(1,1+len(S1)):
    for y in xrange(1,1+len(S2)):
        if S1[x-1] == S2[y-1]:
            M[x][y] = M[x-1][y-1] + 1
            if M[x][y]>longest:
                longest = M[x][y]
                x_longest  = x
        else:
            M[x][y] = 0
  return S1[x_longest-longest: x_longest]

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

from __future__ import generators

def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

d = difflib.SequenceMatcher(None,s1,s2)
>>> match = max(d.get_matching_blocks(),key=lambda x:x[2])
>>> match
Match(a=8, b=0, size=39)
>>> i,j,k = match
>>> d.a[i:i+k]
'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'
>>> d.a[i:i+k] == d.b[j:j+k]
True
>>> d.a == s1
True
>>> d.b == s2
True

import difflib

def get_overlap(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[pos_a:pos_a+size]

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

print(get_overlap(s1, s2)) # GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC