Python 逐单元格将一列与整列进行比较'；Pyspark中的s细胞_Python_Pyspark

Python 逐单元格将一列与整列进行比较'；Pyspark中的s细胞

python pyspark

Python 逐单元格将一列与整列进行比较'；Pyspark中的s细胞,python,pyspark,Python,Pyspark,我之所以尝试这样做是因为我试图避免两个不同列的交叉连接，这是我唯一能想到的我有一个有两列的表。我可以比较A列和B列来创建C fine import pandas as pd dfa = pd.DataFrame({ "A":[ 'perfect match!', 'almost perfect match', 'not even close', 'another perfect!', 'zzzzzzzzz

我之所以尝试这样做是因为我试图避免两个不同列的交叉连接，这是我唯一能想到的

我有一个有两列的表。我可以比较A列和B列来创建C fine

import pandas as pd

dfa = pd.DataFrame({
    "A":[
        'perfect match!',
        'almost perfect match',
        'not even close',
        'another perfect!',
        'zzzzzzzzzzz'
],
    "B":[
        'perfect match!',
        'almost perfect',
        'zzzzzzzzzzz',
        'another perfect!',
        'xxxxxxxxxxxxxx'
        ]
})

df = spark.createDataFrame(dfa)
df.select(['A','B',f.col('A')==f.col('B')]).show()

+--------------------+----------------+-------+
|                   A|               B|(A = B)|
+--------------------+----------------+-------+
|      perfect match!|  perfect match!|   true|
|almost perfect match|  almost perfect|  false|
|      not even close|     zzzzzzzzzzz|  false|
|    another perfect!|another perfect!|   true|
|         zzzzzzzzzzz|  xxxxxxxxxxxxxx|  false|
+--------------------+----------------+-------+

我要做的是将A列的每一行与B列的整体进行比较，得到最佳匹配

我有一些想法如何做到这一点（levenshtein，sequence_matcher），但我不知道如何将一行与整列进行比较。如下所示：

+--------------------+----------------+---------+
|                   A|               B|BestMatch|
+--------------------+----------------+---------+
|      perfect match!|  perfect match!|        0|
|almost perfect match|  almost perfect|        1|
|      not even close|     zzzzzzzzzzz|        3|
|    another perfect!|another perfect!|        3|
|         zzzzzzzzzzz|  xxxxxxxxxxxxxx|        2|
+--------------------+----------------+---------+

列A的“zzzz”选择第2行作为最佳匹配