Indexing SAS在尝试使用小样本数据集上的哈希表进行模糊匹配时内存不足_Indexing_Sas_Hashtable_Large Data_Fuzzy Comparison

Indexing SAS在尝试使用小样本数据集上的哈希表进行模糊匹配时内存不足

indexing sas

Indexing SAS在尝试使用小样本数据集上的哈希表进行模糊匹配时内存不足,indexing,sas,hashtable,large-data,fuzzy-comparison,Indexing,Sas,Hashtable,Large Data,Fuzzy Comparison,我有一个大约5000000行的姓名、电话号码和地址列表。我试图在每个地址创建一个唯一客户列表，并为每个唯一客户分配一个密钥。客户名称的格式不一致，因此John Smith可能显示为Smith、John或John Smith Jr.等我想遵循的逻辑是：如果一个地址的两个记录具有相同的电话号码，则无论其名称是否不同，都是同一客户，并获得相同的客户密钥如果一个地址的两个记录没有相同的电话号码（或没有电话号码），但其名称上的模糊匹配超过某个阈值，则表明它们是同一客户，并获得相同的客户密钥请注意，

我有一个大约5000000行的姓名、电话号码和地址列表。我试图在每个地址创建一个唯一客户列表，并为每个唯一客户分配一个密钥。客户名称的格式不一致，因此John Smith可能显示为Smith、John或John Smith Jr.等

我想遵循的逻辑是：

如果一个地址的两个记录具有相同的电话号码，则无论其名称是否不同，都是同一客户，并获得相同的客户密钥

如果一个地址的两个记录没有相同的电话号码（或没有电话号码），但其名称上的模糊匹配超过某个阈值，则表明它们是同一客户，并获得相同的客户密钥

请注意，不应为两个单独地址中匹配的相同客户名称+电话号码分配相同的客户密钥

以下是包含示例输入和所需输出的示例表：

我尝试使用此示例数据集的代码如下：


data customer_keys(keep=customer_name customer_key customer_phone clean_address);
   length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $5 customer_phone comp_phone $11.;

   if _N_ = 1 then do;
      declare hash h(multidata:'Y');
      h.defineKey('Comp_Name','comp_address');
      h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('customer_key');
      hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
      hh.defineDone(); 

      _customer_key=0; 
   end;

   set testdat;

   rc=h.find(key:customer_name, key:clean_address);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);
         if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
            then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
            end;
         else if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
            then do;
            rc=hh.find();

            do while (r ne 0);

               dist2=jaroT(customer_name,Comp_name);
               hh.has_next(result:r);

               if r=0 & dist2 ge 0.8 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist2 ge 0.8
                then rc=hh.find_next();

               else if dist2 < 0.8
            then leave;

            end;

         end;

         rc=hi.next();

      end;

      _customer_key+1;
      customer_key=_customer_key;
      h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
      hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);
   end;

   output;
run;


data test customer_keys(keep=customer_name customer_key customer_phone clean_address);
   length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $22 customer_phone comp_phone $14.;

   if _N_ = 1 then do;
      declare hash h(multidata:'Y');
      h.defineKey('Comp_Name','comp_address');
      h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('customer_key');
      hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
      hh.defineDone(); 

      _customer_key=0; 
   end;

   set testdat;

   rc=h.find(key:customer_name, key:clean_address);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);

         if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
            then do;
            rc=hh.find();

            do while (r ne 0);

               hh.has_next(result:r);

               if r=0 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;


            else leave;

            end;

         end;


         if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
            then do;
            rc=hh.find();

            do while (r ne 0);

               dist2=jaroT(customer_name,Comp_name);
               hh.has_next(result:r);

               if r=0 & dist2 ge 0.8 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist2 ge 0.8
                then rc=hh.find_next();

               else if dist2 < 0.8
            then leave;

            end;

         end;


         rc=hi.next();

      end;

      _customer_key+1;
      customer_key=_customer_key;
      h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
      hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);

   end;

   output;
run;

我注意到，如果我完全删除处理电话号码的附加逻辑，我就不会有这个内存问题。但是，我仍然会假设，当我尝试在完整数据集上运行时，由于内存不足，我的方法将以任何方式失败。

在windows搜索函数中搜索“sas.exe-memsize 16G”，以获取新版本的sas程序，该程序的内存将为16G。你也可以在G之前改变数字。还要确保有足够的磁盘空间。干杯。

我设法解决了内存问题，现在代码按预期工作！错误出现在处理匹配电话号码的部分。更正代码如下：


data customer_keys(keep=customer_name customer_key customer_phone clean_address);
   length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $5 customer_phone comp_phone $11.;

   if _N_ = 1 then do;
      declare hash h(multidata:'Y');
      h.defineKey('Comp_Name','comp_address');
      h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('customer_key');
      hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
      hh.defineDone(); 

      _customer_key=0; 
   end;

   set testdat;

   rc=h.find(key:customer_name, key:clean_address);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);
         if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
            then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
            end;
         else if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
            then do;
            rc=hh.find();

            do while (r ne 0);

               dist2=jaroT(customer_name,Comp_name);
               hh.has_next(result:r);

               if r=0 & dist2 ge 0.8 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist2 ge 0.8
                then rc=hh.find_next();

               else if dist2 < 0.8
            then leave;

            end;

         end;

         rc=hi.next();

      end;

      _customer_key+1;
      customer_key=_customer_key;
      h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
      hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);
   end;

   output;
run;


data test customer_keys(keep=customer_name customer_key customer_phone clean_address);
   length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $22 customer_phone comp_phone $14.;

   if _N_ = 1 then do;
      declare hash h(multidata:'Y');
      h.defineKey('Comp_Name','comp_address');
      h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('customer_key');
      hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
      hh.defineDone(); 

      _customer_key=0; 
   end;

   set testdat;

   rc=h.find(key:customer_name, key:clean_address);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);

         if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
            then do;
            rc=hh.find();

            do while (r ne 0);

               hh.has_next(result:r);

               if r=0 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;


            else leave;

            end;

         end;


         if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
            then do;
            rc=hh.find();

            do while (r ne 0);

               dist2=jaroT(customer_name,Comp_name);
               hh.has_next(result:r);

               if r=0 & dist2 ge 0.8 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist2 ge 0.8
                then rc=hh.find_next();

               else if dist2 < 0.8
            then leave;

            end;

         end;


         rc=hi.next();

      end;

      _customer_key+1;
      customer_key=_customer_key;
      h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
      hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);

   end;

   output;
run;


数据测试客户密钥（保持=客户名称客户密钥客户电话清洁地址）；
长度客户名称$50公司名称$20客户密钥8清洁地址公司地址$22客户电话公司电话$14。；
如果_N_=1，则执行；
声明散列h（多数据：'Y'）；
h、 定义键（“公司名称”、“公司地址”）；
h、 定义数据（“公司名称”、“客户密钥”、“公司地址”、“公司电话”）；
h、 defineDone（）；
宣布hiter hi（“h”）；
声明散列hh（多数据：'Y'）；
hh.defineKey（“客户密钥”）；
hh.defineData（“客户密钥”、“公司名称”、“公司地址”、“公司电话”）；
hh.defineDone（）；
_客户密钥=0；
结束；
设置testdat；
rc=h.find（键：客户名称，键：清洁地址）；
如果rc-ne为0，则执行；
rc=hi.first（）；
do while（rc=0）；
如果未丢失（客户电话），则清洁地址=公司地址，客户电话=公司电话
然后做；
rc=hh.find（）；
do-while（r-ne-0）；
hh.has_next（结果：r）；
如果r=0，则执行；
h、 添加（键：客户\名称，键：清洁\地址，数据：客户\名称，数据：客户\键，数据：清洁\地址，数据：客户\电话）；
hh.add（）；
产出；返回；
结束；
否则离开；
结束；
结束；
如果未丢失（客户名称）和清洁地址=公司地址和jaroT（客户名称，公司名称）ge 0.8
然后做；
rc=hh.find（）；
do-while（r-ne-0）；
dist2=jaroT（客户名称、公司名称）；
hh.has_next（结果：r）；
如果r=0&dist2 ge 0.8，则执行；
h、 添加（键：客户\名称，键：清洁\地址，数据：客户\名称，数据：客户\键，数据：清洁\地址，数据：客户\电话）；
hh.add（）；
产出；返回；
结束；
如果r ne 0和dist2 ge 0.8
然后rc=hh.find_next（）；
如果dist2<0.8，则为else
然后离开；
结束；
结束；
rc=hi.next（）；
结束；
_客户密钥+1；
客户密钥=\u客户密钥；
h、 添加（键：客户\名称，键：清洁\地址，数据：客户\名称，数据：客户\键，数据：清洁\地址，数据：客户\电话）；
hh.add（key:customer\u key，data:customer\u key，data:customer\u name，data:clean\u address，data:customer\u phone）；
结束；
产出；
跑

同样在下面，我发布了jaroT函数的代码，以便其他人可以使用它，如果他们愿意的话，我没有编写此函数，您可以在上面的代码中用您喜欢的任何比较算法替换它（请确保也更改阈值）：

函数jaroT（string_1$，string_2$）；
如果字符串_1=字符串_2，则返回（1）；
否则你会；
长度1=长度（字符串_1）；
如果长度1>26，则长度1=26；
长度2=长度（字符串2）；
如果长度2>26，则长度2=26；
范围=（int（最大长度1，长度2）/2）-1；
大=最大值（长度1，长度2）；
短=最小值（长度1，长度2）；
数组String1{26}$1_临时_；
数组String2{26}$1_临时_；
数组String1Match{26}$1_临时_；
数组String2Match{26}$1_临时_uu；
/*以下两个do循环将字符放入标记为string1和string2的数组中。
在这里，我们还设置了第二个相同维度的数组，其中满是零。这将
充当匹配键，从而使值与原始字符串中的值处于相同的相对位置
当我们稍后找到有效的匹配候选项时，将设置为1*/
i=1到长度1乘以1；
String1{i}=substr（string_1，i，1）；
string1匹配{i}='0'；
结束；
i=1到长度2乘以1；
String2{i}=substr（String2，i，1）；
String2Match{i}='0'；
结束；
/*我们引入m，它将跟踪匹配的数量*/
m=0；
/*我们设置了一个循环来比较一个字符串和另一个字符串。我们只需要循环相同数量的
我们的一个字符串中包含字符的时间。因此“do while iSounds对我来说就像你有一个递归错误一样。数据集只有少数几行，因此如果没有某种递归发生，你不可能耗尽内存。尝试在循环的1次迭代后，在2次迭代后转储h
和hh
的内容