Python 查找所有ID和关联字符串/序列之间的相似性
我有一个dataframe,它由两列组成,如下所示。我想使用下面定义的函数计算所有这些序列之间的Smith Water相似性Python 查找所有ID和关联字符串/序列之间的相似性,python,pandas,dataframe,scikit-learn,data-science,Python,Pandas,Dataframe,Scikit Learn,Data Science,我有一个dataframe,它由两列组成,如下所示。我想使用下面定义的函数计算所有这些序列之间的Smith Water相似性 def smith_waterman(seq2, seq1, d=-8): m = len(seq1) n = len(seq2) mat = np.zeros((m+1, n+1)) # Creating empty matrix # Add elements to all rows and columns f
def smith_waterman(seq2, seq1, d=-8):
m = len(seq1)
n = len(seq2)
mat = np.zeros((m+1, n+1)) # Creating empty matrix
# Add elements to all rows and columns
for i in range(1, m + 1):
for j in range(1, n + 1):
diag = mat[i-1][j-1] + sub_cost(seq1[i-1], seq2[j-1])
up = mat[i-1][j] + d
left = mat[i][j-1] + d
mat[i][j] = max(0, diag, up, left)
#print("Matrix:")
#print(mat)
# Finding highest value and storing its location
highest_value = np.where(mat == np.amax(mat))
highest_value_location = list(zip(highest_value[0], highest_value[1]))[0]
traceback_seq1, traceback_seq2 = '', ''
i, j = highest_value_location[0], highest_value_location[1]
# Backward algorithm for getting traceback sequences
while i > 0 or j > 0:
current_score = mat[i][j]
diag_score = mat[i-1][j-1]
left_score = mat[i][j-1]
up_score = mat[i-1][j]
if (current_score==0):
break
if (current_score == diag_score + sub_cost(seq1[i-1], seq2[j-1])):
t1, t2 = seq2[j-1], seq1[i-1]
i,j = i-1,j-1
elif (current_score == up_score + d):
t1, t2 = '-', seq1[i-1]
i -= 1
elif (current_score == left_score + d):
t1, t2 = seq2[j-1], '-'
j -= 1
traceback_seq1 += t1
traceback_seq2 += t2
traceback_seq1 = (traceback_seq1[::-1])
traceback_seq2 = (traceback_seq2[::-1])
#print()
#print("Highest value in matrix: ", np.amax(mat))
#print()
#print("Traceback Sequences for", seq2, "versus", seq1)
#print(traceback_seq1)
#print(traceback_seq2)
return np.amax(mat)
finalDF[['Variant ID','original_sequence']].head()
上述函数接受两个字符串并返回一个数字
如何计算所有这些序列之间的相似性并将最相似的放在一起
在数据帧中
Variants Similarity RANK
0 rs1800872, rs139073251 1
1 rs5743626, rs139352858 2
2 rs139073251, rs139352858, rs141219090 3
....
您可以将其中一种聚类算法与您的度量(分层或
k-medoids
)结合使用,创建一个数据框,每行包含组,然后根据相似性分配一个等级。在您的例子中,组的相似性是组对象与其簇质心之间距离的总和
尽管这种方法有一些缺点:
- 您需要选择自己的集群数量
- 每个序列只能有一个标签
Variants Similarity RANK
0 rs1800872, rs139073251 1
1 rs5743626, rs139352858 2
2 rs139073251, rs139352858, rs141219090 3
....