MySQL数据匹配:更好的选择?

MySQL数据匹配:更好的选择?,mysql,record-linkage,nosql,Mysql,Record Linkage,Nosql,我有来自不同来源的客户和潜在客户,我需要弄清楚客户是否已经注册为潜在客户 我使用12个字段进行匹配: address1_clear address2_clear address_clear contact_name_clear email invoice_mobile invoice_phone mobile name_clear phone phone2 taxnum SELECT l.id as lead_id, c.id as customer_id FROM lead l INNER

我有来自不同来源的客户和潜在客户,我需要弄清楚客户是否已经注册为潜在客户

我使用12个字段进行匹配:

address1_clear
address2_clear
address_clear
contact_name_clear
email
invoice_mobile
invoice_phone
mobile
name_clear
phone
phone2
taxnum
SELECT l.id as lead_id, c.id as customer_id FROM lead l
INNER JOIN sync_settings s ON s.account_id = l.account_id
INNER JOIN customers c ON c.setting_id = s.id
LEFT JOIN customers_leads cl ON cl.customer_id = c.id AND cl.lead_id = l.id
WHERE cl.lead_id IS NULL AND
(
    (l.phone IS NOT NULL AND l.phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.mobile IS NOT NULL AND l.mobile != "" AND l.mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.invoice_phone IS NOT NULL AND l.invoice_phone != "" AND l.invoice_phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.invoice_mobile IS NOT NULL AND l.invoice_mobile != "" AND l.invoice_mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.email IS NOT NULL AND l.email != "" AND l.email = c.email) OR
    (l.taxnum IS NOT NULL AND l.taxnum != "" AND l.taxnum = c.taxnum) OR
    (l.contact_name_clear IS NOT NULL AND l.contact_name_clear != "" AND l.contact_name_clear = c.contact_name_clear) OR
    (l.address1_clear IS NOT NULL AND l.address1_clear != "" AND l.address1_clear = c.address_clear) OR
    (l.address2_clear IS NOT NULL AND l.address2_clear != "" AND l.address2_clear = c.address_clear) OR
    (l.name_clear IS NOT NULL AND l.name_clear != "" AND l.name_clear IN (c.contact_name_clear, c.name_clear))
)
\u clear
后缀表示数据为小写,不带空格和标点符号)

  • 线索-300k记录
  • 客户-500k记录
  • 客户线索-460k记录
这是用于执行匹配的查询:

address1_clear
address2_clear
address_clear
contact_name_clear
email
invoice_mobile
invoice_phone
mobile
name_clear
phone
phone2
taxnum
SELECT l.id as lead_id, c.id as customer_id FROM lead l
INNER JOIN sync_settings s ON s.account_id = l.account_id
INNER JOIN customers c ON c.setting_id = s.id
LEFT JOIN customers_leads cl ON cl.customer_id = c.id AND cl.lead_id = l.id
WHERE cl.lead_id IS NULL AND
(
    (l.phone IS NOT NULL AND l.phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.mobile IS NOT NULL AND l.mobile != "" AND l.mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.invoice_phone IS NOT NULL AND l.invoice_phone != "" AND l.invoice_phone IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.invoice_mobile IS NOT NULL AND l.invoice_mobile != "" AND l.invoice_mobile IN (c.phone, c.phone2, c.invoice_phone, c.invoice_mobile)) OR
    (l.email IS NOT NULL AND l.email != "" AND l.email = c.email) OR
    (l.taxnum IS NOT NULL AND l.taxnum != "" AND l.taxnum = c.taxnum) OR
    (l.contact_name_clear IS NOT NULL AND l.contact_name_clear != "" AND l.contact_name_clear = c.contact_name_clear) OR
    (l.address1_clear IS NOT NULL AND l.address1_clear != "" AND l.address1_clear = c.address_clear) OR
    (l.address2_clear IS NOT NULL AND l.address2_clear != "" AND l.address2_clear = c.address_clear) OR
    (l.name_clear IS NOT NULL AND l.name_clear != "" AND l.name_clear IN (c.contact_name_clear, c.name_clear))
)
超重型,响应时间约为4分钟。由于ORs和附加条件,索引没有多大帮助

我想知道:有没有更好的方法?也许使用一些NoSQL数据库来构建一个巨大的哈希表,或者使用一些我无法在谷歌上搜索到的数据匹配技术


另外,我知道我可以为匹配字段创建单独的表,这样会更快,但我还是想知道我的备选方案。

您遇到的问题已被调用,并且没有数据库解决方案可以解决此问题


<>有很多开源项目可以使用,包括(或我是一个主要作者DeDupe)。

< P>另一个要考虑的开源项目是(Python记录链接工具包)。该项目包括记录链接过程概述、初学者代码示例和API文档