Python 逐单元格将一列与整列进行比较';Pyspark中的s细胞
我之所以尝试这样做是因为我试图避免两个不同列的交叉连接,这是我唯一能想到的 我有一个有两列的表。我可以比较A列和B列来创建C finePython 逐单元格将一列与整列进行比较';Pyspark中的s细胞,python,pyspark,Python,Pyspark,我之所以尝试这样做是因为我试图避免两个不同列的交叉连接,这是我唯一能想到的 我有一个有两列的表。我可以比较A列和B列来创建C fine import pandas as pd dfa = pd.DataFrame({ "A":[ 'perfect match!', 'almost perfect match', 'not even close', 'another perfect!', 'zzzzzzzzz
import pandas as pd
dfa = pd.DataFrame({
"A":[
'perfect match!',
'almost perfect match',
'not even close',
'another perfect!',
'zzzzzzzzzzz'
],
"B":[
'perfect match!',
'almost perfect',
'zzzzzzzzzzz',
'another perfect!',
'xxxxxxxxxxxxxx'
]
})
df = spark.createDataFrame(dfa)
df.select(['A','B',f.col('A')==f.col('B')]).show()
+--------------------+----------------+-------+
| A| B|(A = B)|
+--------------------+----------------+-------+
| perfect match!| perfect match!| true|
|almost perfect match| almost perfect| false|
| not even close| zzzzzzzzzzz| false|
| another perfect!|another perfect!| true|
| zzzzzzzzzzz| xxxxxxxxxxxxxx| false|
+--------------------+----------------+-------+
我要做的是将A列的每一行与B列的整体进行比较,得到最佳匹配
我有一些想法如何做到这一点(levenshtein,sequence_matcher),但我不知道如何将一行与整列进行比较。如下所示:
+--------------------+----------------+---------+
| A| B|BestMatch|
+--------------------+----------------+---------+
| perfect match!| perfect match!| 0|
|almost perfect match| almost perfect| 1|
| not even close| zzzzzzzzzzz| 3|
| another perfect!|another perfect!| 3|
| zzzzzzzzzzz| xxxxxxxxxxxxxx| 2|
+--------------------+----------------+---------+
列A的“zzzz”选择第2行作为最佳匹配