Python Rapidfuzz匹配合并
对此非常陌生,请提供以下建议: 我有一个“项目”数据集,显示了具有项目ID的机构列表:Python Rapidfuzz匹配合并,python,pandas,rapidfuzz,Python,Pandas,Rapidfuzz,对此非常陌生,请提供以下建议: 我有一个“项目”数据集,显示了具有项目ID的机构列表: project_id institution_name 0 somali national university 1 aarhus university 2 bath spa 3 aa school of architecture 4 actionaid uk 我想将其与以下“大学”及其国家代码的数据集进行
project_id institution_name
0 somali national university
1 aarhus university
2 bath spa
3 aa school of architecture
4 actionaid uk
我想将其与以下“大学”及其国家代码的数据集进行模糊匹配合并:
institution_name country_code
a tan kapuja buddhista foiskola HU
aa school of architecture UK
bath spa university UK
aalto-yliopisto FI
aarhus universitet DK
把这个拿回来:
project_id institution_name Match organisation country_code
0 somali national university [] NaN NaN
1 aarhus university [(91)] aarhus universitet DK
2 bath spa [(90)] bath spa university UK
3 aa school of architecture [(100)] aa school of architecture UK
4 actionaid uk [] NaN NaN
使用rapidfuzz:
import pandas as pd
import numpy as np
from rapidfuzz import process, utils as fuzz_utils
def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
# baseFrame: the left table to join
# compareFrame: the right table to join
# baseKey: key column of the left table
# compareKey: key column of the right table
# threshold: how close the matches should be to return a match, based on Levenshtein distance
# limit: the amount of matches that will get returned, these are sorted high to low
# return: dataframe with boths keys and matches
s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}
m1 = baseFrame[baseKey].apply(lambda x: process.extract(
fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
))
baseFrame['Match'] = m1
m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
baseFrame['organisation'] = m2
return baseFrame.merge(compareFrame, on=baseKey, how=how)
Merged = fuzzy_merge(Projects, Universities, 'institution_name', 'institution_name')
Merged
我得到了这个(在匹配栏中有一些额外的文本,但现在不讨论)。这几乎是我想要的,但国家代码只有在100%匹配时才匹配:
project_id institution_name Match organisation country_code
0 somali national university [] NaN NaN
1 aarhus university [(91)] aarhus universitet NaN
2 bath spa [(90)] bath spa university NaN
3 aa school of architecture [(100)] aa school of architecture UK
4 actionaid uk [] NaN NaN
我认为这是一个如何比较basekey和CompareName以创建合并数据集的问题。不过,我无法确定如何将其返回到“Organization”(组织)上——尝试插入会导致不同的错误。没关系,我找到了答案——我没有解释空单元格的原因!用NaN替换它们效果很好
def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}
m1 = baseFrame[baseKey].apply(lambda x: process.extract(
fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
))
baseFrame['Match'] = m1
m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
baseFrame['organisations'] = m2.replace("",np.nan)
return baseFrame.merge(compareFrame, left_on='organisations', right_on=compareKey, how=how)
没关系,我想出来了——我没有解释空电池的原因!用NaN替换它们效果很好
def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}
m1 = baseFrame[baseKey].apply(lambda x: process.extract(
fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
))
baseFrame['Match'] = m1
m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
baseFrame['organisations'] = m2.replace("",np.nan)
return baseFrame.merge(compareFrame, left_on='organisations', right_on=compareKey, how=how)