Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/297.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 查找所有ID和关联字符串/序列之间的相似性_Python_Pandas_Dataframe_Scikit Learn_Data Science - Fatal编程技术网

Python 查找所有ID和关联字符串/序列之间的相似性

Python 查找所有ID和关联字符串/序列之间的相似性,python,pandas,dataframe,scikit-learn,data-science,Python,Pandas,Dataframe,Scikit Learn,Data Science,我有一个dataframe,它由两列组成,如下所示。我想使用下面定义的函数计算所有这些序列之间的Smith Water相似性 def smith_waterman(seq2, seq1, d=-8): m = len(seq1) n = len(seq2) mat = np.zeros((m+1, n+1)) # Creating empty matrix # Add elements to all rows and columns f

我有一个dataframe,它由两列组成,如下所示。我想使用下面定义的函数计算所有这些序列之间的Smith Water相似性

def smith_waterman(seq2, seq1, d=-8):
    m = len(seq1)
    n = len(seq2)
    mat = np.zeros((m+1, n+1))      # Creating empty matrix
    
    # Add elements to all rows and columns
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            diag = mat[i-1][j-1] + sub_cost(seq1[i-1], seq2[j-1])
            up = mat[i-1][j] + d
            left = mat[i][j-1] + d
            mat[i][j] = max(0, diag, up, left)
    
    #print("Matrix:")
    #print(mat)
    
    # Finding highest value and storing its location
    highest_value = np.where(mat == np.amax(mat))
    highest_value_location = list(zip(highest_value[0], highest_value[1]))[0]
    
    traceback_seq1, traceback_seq2 = '', ''
    i, j = highest_value_location[0], highest_value_location[1]
    
    # Backward algorithm for getting traceback sequences
    while i > 0 or j > 0:
        current_score = mat[i][j]
        diag_score = mat[i-1][j-1]
        left_score = mat[i][j-1]
        up_score = mat[i-1][j]
                
        if (current_score==0):
            break
        
        if (current_score == diag_score + sub_cost(seq1[i-1], seq2[j-1])):
            t1, t2 = seq2[j-1], seq1[i-1]
            i,j = i-1,j-1
        elif (current_score == up_score + d):
            t1, t2 = '-', seq1[i-1]
            i -= 1
        elif (current_score == left_score + d):
            t1, t2 = seq2[j-1], '-' 
            j -= 1
        traceback_seq1 += t1
        traceback_seq2 += t2
    
    traceback_seq1 = (traceback_seq1[::-1])
    traceback_seq2 = (traceback_seq2[::-1])
    
    #print()
    #print("Highest value in matrix: ", np.amax(mat))
    #print()
    #print("Traceback Sequences for", seq2, "versus", seq1)
    #print(traceback_seq1)
    #print(traceback_seq2)
    return np.amax(mat)
finalDF[['Variant ID','original_sequence']].head()

上述函数接受两个字符串并返回一个数字

如何计算所有这些序列之间的相似性并将最相似的放在一起 在数据帧中

                Variants                 Similarity RANK
0   rs1800872, rs139073251                   1    
1   rs5743626, rs139352858                   2
2   rs139073251, rs139352858, rs141219090    3
....

您可以将其中一种聚类算法与您的度量(分层或
k-medoids
)结合使用,创建一个数据框,每行包含组,然后根据相似性分配一个等级。在您的例子中,组的相似性是组对象与其簇质心之间距离的总和

尽管这种方法有一些缺点:

  • 您需要选择自己的集群数量

  • 每个序列只能有一个标签


嘿!您在几行上有相同的ID?你的情况正确吗?你想要多少组?@roddar92不,所有的ID都不同,这些组可以有任意数量的相似ID,但是如果我能得到一个ID vs ID矩阵,我想,我可以定义一个阈值。我想使用这个函数来计算相似性,我在上面定义的,就是这样
                Variants                 Similarity RANK
0   rs1800872, rs139073251                   1    
1   rs5743626, rs139352858                   2
2   rs139073251, rs139352858, rs141219090    3
....