使用Python实现大数据集上的模糊逻辑_Python_Fuzzy Logic_Fuzzy Comparison_Fuzzywuzzy_Record Linkage

使用Python实现大数据集上的模糊逻辑

python

使用Python实现大数据集上的模糊逻辑,python,fuzzy-logic,fuzzy-comparison,fuzzywuzzy,record-linkage,Python,Fuzzy Logic,Fuzzy Comparison,Fuzzywuzzy,Record Linkage,我的团队一直致力于在两个大型数据集上运行模糊逻辑算法。第一个（子集）是大约180K行，包含我们需要在第二个（超集）中匹配的人的姓名、地址和电子邮件。超集包含250万条记录。两者都具有相同的结构，并且数据已经被清除，即解析地址、规范化名称等联系人ID int 全名varchar（150）地址：varchar（100）电子邮件varchar（100）目标是将子集行中的值与超集中的相应值相匹配，因此输出将结合子集和超集以及每个字段（标记）的相应相似性百分比联系人ID 查找联系人ID

我的团队一直致力于在两个大型数据集上运行模糊逻辑算法。第一个（子集）是大约180K行，包含我们需要在第二个（超集）中匹配的人的姓名、地址和电子邮件。超集包含250万条记录。两者都具有相同的结构，并且数据已经被清除，即解析地址、规范化名称等

联系人ID int
全名varchar（150）
地址：varchar（100）
电子邮件varchar（100）

目标是将子集行中的值与超集中的相应值相匹配，因此输出将结合子集和超集以及每个字段（标记）的相应相似性百分比

联系人ID
查找联系人ID
全名
查找全名
全名
地址:
查找地址
地址相似
电子邮件
查找电子邮件
电子邮件相似性

为了首先简化和测试代码，我们连接了字符串，我们知道代码在非常小的超集上工作；然而，一旦我们增加记录的数量，它就会被卡住。我们尝试了不同的算法，Levenshtein，FuzzyWizzy，等等，但都没有用。在我看来，问题在于Python是一行一行地做的；然而，我不确定。我们甚至尝试在Hadoop集群上使用流媒体运行它；然而，它没有产生任何积极的结果

#!/usr/bin/env python
import sys
from fuzzywuzzy import fuzz
import datetime
import time
import Levenshtein

#init for comparison
with open('normalized_set_record_set.csv') as normalized_records_ALL_file:
# with open('delete_this/xab') as normalized_records_ALL_file:
    normalized_records_ALL_dict = {}
    for line in normalized_records_ALL_file:
        key, value = line.strip('\n').split(':', 1)
        normalized_records_ALL_dict[key] = value
        # normalized_records_ALL_dict[contact_id] = concat_record

def score_it_bag(target_contact_id, target_str, ALL_records_dict):
    '''
    INPUT target_str, ALL_records_dict
    OUTPUT sorted list by highest fuzzy match
    '''
    return sorted([(value_str, contact_id_index_str, fuzz.ratio(target_str, value_str)) 
        for contact_id_index_str, value_str in ALL_records_dict.iteritems()], key=lambda x:x[2])[::-1]

def score_it_closest_match_pandas(target_contact_id, target_str, place_holder_delete):
    '''
    INPUT target_str, ALL_records_dict
    OUTPUT closest match
    '''
    # simply drop this index target_contact_id
    df_score = df_ALL.concat_record.apply(lambda x: fuzz.ratio(target_str, x))

    return df_ALL.concat_record[df_score.idxmax()], df_score.max(), df_score.idxmax()

def score_it_closest_match_L(target_contact_id, target_str, ALL_records_dict_input):
    '''
    INPUT target_str, ALL_records_dict
    OUTPUT closest match tuple (best matching str, score, contact_id of best match str)
    '''
    best_score = 100

    #score it
    for comparison_contactid, comparison_record_str in ALL_records_dict_input.iteritems():
        if target_contact_id != comparison_contactid:
            current_score = Levenshtein.distance(target_str, comparison_record_str)


            if current_score < best_score:
                best_score = current_score 
                best_match_id = comparison_contactid 
                best_match_str = comparison_record_str 

    return (best_match_str, best_score, best_match_id)



def score_it_closest_match_fuzz(target_contact_id, target_str, ALL_records_dict_input):
    '''
    INPUT target_str, ALL_records_dict
    OUTPUT closest match tuple (best matching str, score, contact_id of best match str)
    '''
    best_score = 0

    #score it
    for comparison_contactid, comparison_record_str in ALL_records_dict_input.iteritems():
        if target_contact_id != comparison_contactid:
            current_score = fuzz.ratio(target_str, comparison_record_str)

            if current_score > best_score:
                best_score = current_score 
                best_match_id = comparison_contactid 
                best_match_str = comparison_record_str 

    return (best_match_str, best_score, best_match_id)

def score_it_threshold_match(target_contact_id, target_str, ALL_records_dict_input):
    '''
    INPUT target_str, ALL_records_dict
    OUTPUT closest match tuple (best matching str, score, contact_id of best match str)
    '''
    score_threshold = 95

    #score it
    for comparison_contactid, comparison_record_str in ALL_records_dict_input.iteritems():
        if target_contact_id != comparison_contactid:
            current_score = fuzz.ratio(target_str, comparison_record_str)

            if current_score > score_threshold: 
                return (comparison_record_str, current_score, comparison_contactid)

    return (None, None, None)


def score_it_closest_match_threshold_bag(target_contact_id, target_str, ALL_records_dict):
    '''
    INPUT target_str, ALL_records_dict
    OUTPUT closest match
    '''
    threshold_score = 80
    top_matches_list = []
    #score it
    #iterate through dictionary
    for comparison_contactid, comparison_record_str in ALL_records_dict.iteritems():
        if target_contact_id != comparison_contactid:
            current_score = fuzz.ratio(target_str, comparison_record_str)

            if current_score > threshold_score:
                top_matches_list.append((comparison_record_str, current_score, comparison_contactid))


    if len(top_matches_list) > 0:  return top_matches_list

def score_it_closest_match_threshold_bag_print(target_contact_id, target_str, ALL_records_dict):
    '''
    INPUT target_str, ALL_records_dict
    OUTPUT closest match
    '''
    threshold_score = 80


    #iterate through dictionary
    for comparison_contactid, comparison_record_str in ALL_records_dict.iteritems():
        if target_contact_id != comparison_contactid:

            #score it
            current_score = fuzz.ratio(target_str, comparison_record_str)
            if current_score > threshold_score:
                print target_contact_id + ':' + str((target_str,comparison_record_str, current_score, comparison_contactid))


    pass


#stream in all contacts ie large set
for line in sys.stdin:
    # ERROR DIAG TOOL
    ts = time.time()
    st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
    print >> sys.stderr, line, st

    contact_id, target_str = line.strip().split(':', 1)

    score_it_closest_match_threshold_bag_print(contact_id, target_str, normalized_records_ALL_dict)
    # output = (target_str, score_it_closest_match_fuzz(contact_id, target_str, normalized_records_ALL_dict))
    # output = (target_str, score_it_closest_match_threshold_bag(contact_id, target_str, normalized_records_ALL_dict))
    # print contact_id + ':' + str(output)

#/usr/bin/env python
导入系统
从fuzzyfuzzy导入fuzz
导入日期时间
导入时间
进口Levenshtein
#用于比较的初始化
将open（'normalized_set_record_set.csv'）作为normalized_records_ALL_文件：
#将open（'delete_this/xab'）作为规范化的_records_ALL_文件：
规范化的\u记录\u ALL\u dict={}
对于规范化的\u记录\u所有\u文件中的行：
键，值=line.strip（'\n'）.split（'：'，1）
规范化的\u记录\u所有\u记录[键]=值
#规范化的\u记录\u所有\u记录[联系人\u id]=连续记录
def记分包（目标联系人id、目标str、所有记录）：
'''
输入目标\u str，所有\u记录\u dict
输出按最高模糊匹配排序的列表
'''
返回排序（[（值\u str、联系人\u id\u索引\u str、模糊比率（目标\u str、值\u str））
对于contact_id_index_str，所有_记录中的值_str_dict.iteritems（）]，key=lambda x:x[2]）[：-1]
def得分最接近匹配熊猫（目标联系人id、目标str、位置持有者删除）：
'''
输入目标\u str，所有\u记录\u dict
输出最接近匹配
'''
#只需删除此索引目标\u联系人\u id
df_score=df_ALL.concat_record.apply（lambda x:fuzz.ratio（target_str，x））
返回df_ALL.concat_记录[df_score.idxmax（）]，df_score.max（），df_score.idxmax（）
def分数最接近匹配（目标联系人id、目标str、所有记录输入）：
'''
输入目标\u str，所有\u记录\u dict
输出最近匹配元组（最佳匹配str、分数、最佳匹配str的联系人id）
'''
最佳分数=100
#得分
对于比较\u contactid，请在所有记录\u dict\u input.iteritems（）中比较\u record\u str：
如果目标联系人id！=与contactid的比较：
当前分数=Levenshtein.距离（目标值、比较值、记录值）
如果当前评分<最佳评分：
最佳成绩=当前成绩
最佳匹配标识=比较标识
最佳匹配\u str=比较\u记录\u str
返回（最佳匹配、最佳分数、最佳匹配id）
def评分\u it\u最接近\u匹配\u模糊（目标\u联系人\u id、目标\u str、所有记录\u dict\u输入）：
'''
输入目标\u str，所有\u记录\u dict
输出最近匹配元组（最佳匹配str、分数、最佳匹配str的联系人id）
'''
最佳分数=0
#得分
对于比较\u contactid，请在所有记录\u dict\u input.iteritems（）中比较\u record\u str：
如果目标联系人id！=与contactid的比较：
当前评分=模糊比率（目标评分、比较评分、记录评分）
如果当前评分>最佳评分：
最佳成绩=当前成绩
最佳匹配标识=比较标识
最佳匹配\u str=比较\u记录\u str
返回（最佳匹配、最佳分数、最佳匹配id）
def分数\u it \u阈值\u匹配（目标\u联系人\u id、目标\u str、所有记录\u记录\u输入）：
'''
输入目标\u str，所有\u记录\u dict
输出最近匹配元组（最佳匹配str、分数、最佳匹配str的联系人id）
'''
分数\阈值=95
#得分
对于比较\u contactid，请在所有记录\u dict\u input.iteritems（）中比较\u record\u str：
如果目标联系人id！=与contactid的比较：
当前评分=模糊比率（目标评分、比较评分、记录评分）
如果当前评分>评分阈值：
返回（比较记录、当前分数、比较联系人ID）
返回（无，无，无）
def得分、最接近、匹配、阈值、行李（目标、联系人、目标、所有记录）：
'''
输入目标\u str，所有\u记录\u dict
输出最接近匹配
'''
门槛值=80分
顶级匹配项列表=[]
#得分
#查字典
对于比较\u contactid，比较\u record \u str在所有\u records\u dict.iteritems（）中：
如果目标联系人id！=与contactid的比较：
当前评分=模糊比率（目标评分、比较评分、记录评分）
如果当前评分>阈值评分：
顶部匹配列表。追加（（比较记录、当前分数、比较联系人ID））
如果len（顶级匹配列表）>0：返回顶级匹配列表
def分数\u it\u最接近\u匹配\u阈值\u袋子\u打印（目标\u联系人\u id、目标\u str、所有记录\u dict）：
'''
输入目标\u str，所有\u记录\u dict
输出最接近匹配
'''
门槛值=80分
#查字典
对于比较\u contactid，比较\u record \u str在所有\u records\u dict.iteritems（）中：
如果目标联系人id！=比较