Python 3.x python pyspark中数据帧的模糊匹配字符串_Python 3.x_Pyspark_Fuzzy Comparison

Python 3.x python pyspark中数据帧的模糊匹配字符串

python-3.x pyspark

Python 3.x python pyspark中数据帧的模糊匹配字符串,python-3.x,pyspark,fuzzy-comparison,Python 3.x,Pyspark,Fuzzy Comparison,我正在使用Jupyter笔记本中的python pyspark对“name”列中的所有行进行模糊相似性匹配。预期的输出是生成一个具有类似字符串的列，并将每个字符串的分数作为一个新列。我的问题与问题非常相似，只是问题是用R语言，它使用了2个数据集（我的只有1个）。由于我对python还很陌生，所以我很困惑如何做。我还使用了一个具有类似函数的简单代码，但不确定如何为数据帧运行它代码如下： import numpy as np def levenshtein_ratio_and_distance(

我正在使用Jupyter笔记本中的python pyspark对“name”列中的所有行进行模糊相似性匹配。预期的输出是生成一个具有类似字符串的列，并将每个字符串的分数作为一个新列。我的问题与问题非常相似，只是问题是用R语言，它使用了2个数据集（我的只有1个）。由于我对python还很陌生，所以我很困惑如何做。我还使用了一个具有类似函数的简单代码，但不确定如何为数据帧运行它

代码如下：

import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])

#example I do for simple string
    Str1 = "Apple Inc."
    Str2 = "Jo Inc"
    Distance = levenshtein_ratio_and_distance(Str1,Str2)
    print(Distance)
    Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
    print(Ratio)

但是，上述代码仅适用于字符串。我想将数据帧作为输入而不是字符串运行。例如，输入数据是（表示数据集名称是customer）：

预期结果是：

   name      b_name                       dist
   Ace Co    Ace Co.                      0.64762
   Baes      Bayes Inc., Bayes,Bays, Bcy  0.80000,0.86667,0.70000,0.97778
   asd       asdf                         0.08333

直接包含在spark中。直接包含在spark中。

   name      b_name                       dist
   Ace Co    Ace Co.                      0.64762
   Baes      Bayes Inc., Bayes,Bays, Bcy  0.80000,0.86667,0.70000,0.97778
   asd       asdf                         0.08333