Python Waterman-Eggert算法的实现
我试图实现Waterman-Eggert算法来寻找次优的局部序列比对,但我很难理解如何“分离”不同的比对组。我有基本的史密斯-沃特曼算法运行良好 一个简单的测试,将以下顺序与自身对齐:Python Waterman-Eggert算法的实现,python,bioinformatics,sequence-alignment,Python,Bioinformatics,Sequence Alignment,我试图实现Waterman-Eggert算法来寻找次优的局部序列比对,但我很难理解如何“分离”不同的比对组。我有基本的史密斯-沃特曼算法运行良好 一个简单的测试,将以下顺序与自身对齐: 'HEAGHEAGHEAG' 'HEAGHEAGHEAG' 生成一个fMatrix,如下所示: [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 8. 0. 0. 0. 8. 0.
'HEAGHEAGHEAG'
'HEAGHEAGHEAG'
生成一个fMatrix,如下所示:
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 8. 0. 0. 0. 8. 0. 0. 0. 8. 0. 0. 0.]
[ 0. 0. 13. 0. 0. 0. 13. 0. 0. 0. 13. 0. 0.]
[ 0. 0. 0. 17. 0. 0. 0. 17. 0. 0. 0. 17. 0.]
[ 0. 0. 0. 0. 23. 0. 0. 0. 23. 0. 0. 0. 23.]
[ 0. 8. 0. 0. 0. 31. 0. 0. 0. 31. 0. 0. 0.]
[ 0. 0. 13. 0. 0. 0. 36. 0. 0. 0. 36. 0. 0.]
[ 0. 0. 0. 17. 0. 0. 0. 40. 0. 0. 0. 40. 0.]
[ 0. 0. 0. 0. 23. 0. 0. 0. 46. 0. 0. 0. 46.]
[ 0. 8. 0. 0. 0. 31. 0. 0. 0. 54. 4. 0. 0.]
[ 0. 0. 13. 0. 0. 0. 36. 0. 0. 4. 59. 9. 0.]
[ 0. 0. 0. 17. 0. 0. 0. 40. 0. 0. 9. 63. 13.]
[ 0. 0. 0. 0. 23. 0. 0. 0. 46. 0. 0. 13. 69.]]
# Generates fMatrix.
for i in range(1, length):
for j in range(1, length):
matchScore = fMatrix[i-1][j-1] + simMatrixDict[seq[i-1]+seq[j-1]]
insScore = fMatrix[i][j-1] + gap
delScore = fMatrix[i-1][j] + gap
fMatrix[i][j] = max(0, matchScore, insScore, delScore)
# Generates matrix for backtracking.
if fMatrix[i][j] == matchScore:
backMatrix[i][j] = 2
elif fMatrix[i][j] == insScore:
backMatrix[i][j] = 3 # INSERTION in seq - Horizontal
elif fMatrix[i][j] == delScore:
backMatrix[i][j] = 1 # DELETION in seq - Vertical
if fMatrix[i][j] >= backtrackStart:
backtrackStart = fMatrix[i][j]
endCoords = i, j
return fMatrix, backMatrix, endCoords
为了找到次优路线,例如
'HEAGHEAGHEAG '
' HEAGHEAGHEAG'
您必须首先删除最佳对齐(即沿主对角线),然后重新计算fMatrix;这称为“去集总”,其中路线的“束”定义为其路径相交/共享一对或多对对齐残基的任何路线。除fMatrix外,还有一个次级矩阵,包含有关fMatrix构造方向的信息
构建fMatrix和回溯矩阵的代码片段如下:
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 8. 0. 0. 0. 8. 0. 0. 0. 8. 0. 0. 0.]
[ 0. 0. 13. 0. 0. 0. 13. 0. 0. 0. 13. 0. 0.]
[ 0. 0. 0. 17. 0. 0. 0. 17. 0. 0. 0. 17. 0.]
[ 0. 0. 0. 0. 23. 0. 0. 0. 23. 0. 0. 0. 23.]
[ 0. 8. 0. 0. 0. 31. 0. 0. 0. 31. 0. 0. 0.]
[ 0. 0. 13. 0. 0. 0. 36. 0. 0. 0. 36. 0. 0.]
[ 0. 0. 0. 17. 0. 0. 0. 40. 0. 0. 0. 40. 0.]
[ 0. 0. 0. 0. 23. 0. 0. 0. 46. 0. 0. 0. 46.]
[ 0. 8. 0. 0. 0. 31. 0. 0. 0. 54. 4. 0. 0.]
[ 0. 0. 13. 0. 0. 0. 36. 0. 0. 4. 59. 9. 0.]
[ 0. 0. 0. 17. 0. 0. 0. 40. 0. 0. 9. 63. 13.]
[ 0. 0. 0. 0. 23. 0. 0. 0. 46. 0. 0. 13. 69.]]
# Generates fMatrix.
for i in range(1, length):
for j in range(1, length):
matchScore = fMatrix[i-1][j-1] + simMatrixDict[seq[i-1]+seq[j-1]]
insScore = fMatrix[i][j-1] + gap
delScore = fMatrix[i-1][j] + gap
fMatrix[i][j] = max(0, matchScore, insScore, delScore)
# Generates matrix for backtracking.
if fMatrix[i][j] == matchScore:
backMatrix[i][j] = 2
elif fMatrix[i][j] == insScore:
backMatrix[i][j] = 3 # INSERTION in seq - Horizontal
elif fMatrix[i][j] == delScore:
backMatrix[i][j] = 1 # DELETION in seq - Vertical
if fMatrix[i][j] >= backtrackStart:
backtrackStart = fMatrix[i][j]
endCoords = i, j
return fMatrix, backMatrix, endCoords
为了移除这个最佳对齐,我尝试使用这个回溯矩阵来回溯fMatrix(根据原始的Smith Waterman算法),并将fMatrix[I][j]=0
,但这不会移除整个束,只移除该束中的精确对齐
对于一些背景信息,Smith-Waterman算法的页面解释了fMatrix是如何构造的,并解释了回溯是如何工作的。对Waterman-Eggert算法进行了粗略解释
谢谢。好的。这里有一些代码可以实现您想要的功能。我使用了漂亮的打印库(
pprint
),因此输出看起来很好。(如果矩阵中的数字是一位数,看起来就更好了,但是如果有多位数,对齐就会有点混乱。)
它是如何工作的?
因为你只需要改变主对角线上的数字,以及上面和下面对角线上的数字,我们只需要一个循环matrix[i][i]
始终位于主对角线上,因为它位于i
行下方,而i
列之间matrix[i][i-1]
始终是较低的相邻对角线,因为它位于i
行下方,而i-1
列之间matrix[i-1][i]
始终是上相邻对角线,因为它是i-1
下行,而i
跨行
#!/usr/bin/python
import pprint
matrix = [
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,],
[ 0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0,],
[ 0, 0, 13, 0, 0, 0, 13, 0, 0, 0, 13, 0, 0,],
[ 0, 0, 0, 17, 0, 0, 0, 17, 0, 0, 0, 17, 0,],
[ 0, 0, 0, 0, 23, 0, 0, 0, 23, 0, 0, 0, 23,],
[ 0, 8, 0, 0, 0, 31, 0, 0, 0, 31, 0, 0, 0,],
[ 0, 0, 13, 0, 0, 0, 36, 0, 0, 0, 36, 0, 0,],
[ 0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 0, 40, 0,],
[ 0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 0, 46,],
[ 0, 8, 0, 0, 0, 31, 0, 0, 0, 54, 4, 0, 0,],
[ 0, 0, 13, 0, 0, 0, 36, 0, 0, 4, 59, 9, 0,],
[ 0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 9, 63, 13,],
[ 0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 13, 69,]]
print "Original Matrix"
pprint.pprint(matrix)
print
for i in range(len(matrix)):
matrix[i][i] = 0
if (i > 0) and (i < (len(matrix))):
matrix[i][i-1] = 0
matrix[i-1][i] = 0
print "New Matrix"
pprint.pprint(matrix)
那么,你到底想删除什么?您只是想删除主对角线上的值吗?在这个简化版本中,仅删除沿主对角线或与主对角线直接相邻的大于零的值。从F(1,1)开始,到F(12,12)结束,但包括F(11,12)和F(12,11)等值。我的第一个想法是尝试在一组for/if循环中删除它们,但在实际编写时,它似乎不必要地令人费解。是否有任何理由认为矩阵中的所有数字后面都是句号,而不是逗号?如果这不完全是你的意思,请告诉我更多细节。