Indexing SAS在尝试使用小样本数据集上的哈希表进行模糊匹配时内存不足

Indexing SAS在尝试使用小样本数据集上的哈希表进行模糊匹配时内存不足,indexing,sas,hashtable,large-data,fuzzy-comparison,Indexing,Sas,Hashtable,Large Data,Fuzzy Comparison,我有一个大约5000000行的姓名、电话号码和地址列表。我试图在每个地址创建一个唯一客户列表,并为每个唯一客户分配一个密钥。客户名称的格式不一致,因此John Smith可能显示为Smith、John或John Smith Jr.等 我想遵循的逻辑是: 如果一个地址的两个记录具有相同的电话号码,则无论其名称是否不同,都是同一客户,并获得相同的客户密钥 如果一个地址的两个记录没有相同的电话号码(或没有电话号码),但其名称上的模糊匹配超过某个阈值,则表明它们是同一客户,并获得相同的客户密钥 请注意,

我有一个大约5000000行的姓名、电话号码和地址列表。我试图在每个地址创建一个唯一客户列表,并为每个唯一客户分配一个密钥。客户名称的格式不一致,因此John Smith可能显示为Smith、John或John Smith Jr.等

我想遵循的逻辑是:

如果一个地址的两个记录具有相同的电话号码,则无论其名称是否不同,都是同一客户,并获得相同的客户密钥

如果一个地址的两个记录没有相同的电话号码(或没有电话号码),但其名称上的模糊匹配超过某个阈值,则表明它们是同一客户,并获得相同的客户密钥

请注意,不应为两个单独地址中匹配的相同客户名称+电话号码分配相同的客户密钥

以下是包含示例输入和所需输出的示例表:

我尝试使用此示例数据集的代码如下:


data customer_keys(keep=customer_name customer_key customer_phone clean_address);
   length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $5 customer_phone comp_phone $11.;

   if _N_ = 1 then do;
      declare hash h(multidata:'Y');
      h.defineKey('Comp_Name','comp_address');
      h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('customer_key');
      hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
      hh.defineDone(); 

      _customer_key=0; 
   end;

   set testdat;

   rc=h.find(key:customer_name, key:clean_address);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);
         if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
            then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
            end;
         else if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
            then do;
            rc=hh.find();

            do while (r ne 0);

               dist2=jaroT(customer_name,Comp_name);
               hh.has_next(result:r);

               if r=0 & dist2 ge 0.8 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist2 ge 0.8
                then rc=hh.find_next();

               else if dist2 < 0.8
            then leave;

            end;

         end;

         rc=hi.next();

      end;

      _customer_key+1;
      customer_key=_customer_key;
      h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
      hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);
   end;

   output;
run;


data test customer_keys(keep=customer_name customer_key customer_phone clean_address);
   length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $22 customer_phone comp_phone $14.;

   if _N_ = 1 then do;
      declare hash h(multidata:'Y');
      h.defineKey('Comp_Name','comp_address');
      h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('customer_key');
      hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
      hh.defineDone(); 

      _customer_key=0; 
   end;

   set testdat;

   rc=h.find(key:customer_name, key:clean_address);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);

         if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
            then do;
            rc=hh.find();

            do while (r ne 0);

               hh.has_next(result:r);

               if r=0 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;


            else leave;

            end;

         end;


         if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
            then do;
            rc=hh.find();

            do while (r ne 0);

               dist2=jaroT(customer_name,Comp_name);
               hh.has_next(result:r);

               if r=0 & dist2 ge 0.8 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist2 ge 0.8
                then rc=hh.find_next();

               else if dist2 < 0.8
            then leave;

            end;

         end;


         rc=hi.next();

      end;

      _customer_key+1;
      customer_key=_customer_key;
      h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
      hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);

   end;

   output;
run;


我注意到,如果我完全删除处理电话号码的附加逻辑,我就不会有这个内存问题。但是,我仍然会假设,当我尝试在完整数据集上运行时,由于内存不足,我的方法将以任何方式失败。

在windows搜索函数中搜索“sas.exe-memsize 16G”,以获取新版本的sas程序,该程序的内存将为16G。你也可以在G之前改变数字。还要确保有足够的磁盘空间。干杯。

我设法解决了内存问题,现在代码按预期工作!错误出现在处理匹配电话号码的部分。更正代码如下:


data customer_keys(keep=customer_name customer_key customer_phone clean_address);
   length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $5 customer_phone comp_phone $11.;

   if _N_ = 1 then do;
      declare hash h(multidata:'Y');
      h.defineKey('Comp_Name','comp_address');
      h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('customer_key');
      hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
      hh.defineDone(); 

      _customer_key=0; 
   end;

   set testdat;

   rc=h.find(key:customer_name, key:clean_address);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);
         if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
            then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
            end;
         else if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
            then do;
            rc=hh.find();

            do while (r ne 0);

               dist2=jaroT(customer_name,Comp_name);
               hh.has_next(result:r);

               if r=0 & dist2 ge 0.8 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist2 ge 0.8
                then rc=hh.find_next();

               else if dist2 < 0.8
            then leave;

            end;

         end;

         rc=hi.next();

      end;

      _customer_key+1;
      customer_key=_customer_key;
      h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
      hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);
   end;

   output;
run;


data test customer_keys(keep=customer_name customer_key customer_phone clean_address);
   length customer_name $50 Comp_Name $20 customer_key 8 clean_address comp_address $22 customer_phone comp_phone $14.;

   if _N_ = 1 then do;
      declare hash h(multidata:'Y');
      h.defineKey('Comp_Name','comp_address');
      h.defineData('Comp_Name', 'customer_key','comp_address', 'comp_phone');
      h.defineDone();
      declare hiter hi('h');

      declare hash hh(multidata:'Y');
      hh.defineKey('customer_key');
      hh.defineData('customer_key', 'Comp_Name','comp_address','comp_phone');
      hh.defineDone(); 

      _customer_key=0; 
   end;

   set testdat;

   rc=h.find(key:customer_name, key:clean_address);

   if rc ne 0 then do;
      rc=hi.first();
      do while (rc=0);

         if not missing(customer_phone) and clean_address=comp_address and customer_phone=comp_phone
            then do;
            rc=hh.find();

            do while (r ne 0);

               hh.has_next(result:r);

               if r=0 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;


            else leave;

            end;

         end;


         if not missing(customer_name) and clean_address=comp_address and jaroT(customer_name,Comp_name) ge 0.8
            then do;
            rc=hh.find();

            do while (r ne 0);

               dist2=jaroT(customer_name,Comp_name);
               hh.has_next(result:r);

               if r=0 & dist2 ge 0.8 then do;
                  h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
                  hh.add();
                  output;return;
               end;

               else if r ne 0 & dist2 ge 0.8
                then rc=hh.find_next();

               else if dist2 < 0.8
            then leave;

            end;

         end;


         rc=hi.next();

      end;

      _customer_key+1;
      customer_key=_customer_key;
      h.add(key:customer_name,key:clean_address, data:customer_name, data:customer_key, data:clean_address, data:customer_phone);
      hh.add(key:customer_key, data:customer_key, data:customer_name, data:clean_address, data:customer_phone);

   end;

   output;
run;


数据测试客户密钥(保持=客户名称客户密钥客户电话清洁地址);
长度客户名称$50公司名称$20客户密钥8清洁地址公司地址$22客户电话公司电话$14。;
如果_N_=1,则执行;
声明散列h(多数据:'Y');
h、 定义键(“公司名称”、“公司地址”);
h、 定义数据(“公司名称”、“客户密钥”、“公司地址”、“公司电话”);
h、 defineDone();
宣布hiter hi(“h”);
声明散列hh(多数据:'Y');
hh.defineKey(“客户密钥”);
hh.defineData(“客户密钥”、“公司名称”、“公司地址”、“公司电话”);
hh.defineDone();
_客户密钥=0;
结束;
设置testdat;
rc=h.find(键:客户名称,键:清洁地址);
如果rc-ne为0,则执行;
rc=hi.first();
do while(rc=0);
如果未丢失(客户电话),则清洁地址=公司地址,客户电话=公司电话
然后做;
rc=hh.find();
do-while(r-ne-0);
hh.has_next(结果:r);
如果r=0,则执行;
h、 添加(键:客户\名称,键:清洁\地址,数据:客户\名称,数据:客户\键,数据:清洁\地址,数据:客户\电话);
hh.add();
产出;返回;
结束;
否则离开;
结束;
结束;
如果未丢失(客户名称)和清洁地址=公司地址和jaroT(客户名称,公司名称)ge 0.8
然后做;
rc=hh.find();
do-while(r-ne-0);
dist2=jaroT(客户名称、公司名称);
hh.has_next(结果:r);
如果r=0&dist2 ge 0.8,则执行;
h、 添加(键:客户\名称,键:清洁\地址,数据:客户\名称,数据:客户\键,数据:清洁\地址,数据:客户\电话);
hh.add();
产出;返回;
结束;
如果r ne 0和dist2 ge 0.8
然后rc=hh.find_next();
如果dist2<0.8,则为else
然后离开;
结束;
结束;
rc=hi.next();
结束;
_客户密钥+1;
客户密钥=\u客户密钥;
h、 添加(键:客户\名称,键:清洁\地址,数据:客户\名称,数据:客户\键,数据:清洁\地址,数据:客户\电话);
hh.add(key:customer\u key,data:customer\u key,data:customer\u name,data:clean\u address,data:customer\u phone);
结束;
产出;
跑
同样在下面,我发布了jaroT函数的代码,以便其他人可以使用它,如果他们愿意的话,我没有编写此函数,您可以在上面的代码中用您喜欢的任何比较算法替换它(请确保也更改阈值):

函数jaroT(string_1$,string_2$);
如果字符串_1=字符串_2,则返回(1);
否则你会;
长度1=长度(字符串_1);
如果长度1>26,则长度1=26;
长度2=长度(字符串2);
如果长度2>26,则长度2=26;
范围=(int(最大长度1,长度2)/2)-1;
大=最大值(长度1,长度2);
短=最小值(长度1,长度2);
数组String1{26}$1_临时_;
数组String2{26}$1_临时_;
数组String1Match{26}$1_临时_;
数组String2Match{26}$1_临时_uu;
/*以下两个do循环将字符放入标记为string1和string2的数组中。
在这里,我们还设置了第二个相同维度的数组,其中满是零。这将
充当匹配键,从而使值与原始字符串中的值处于相同的相对位置
当我们稍后找到有效的匹配候选项时,将设置为1*/
i=1到长度1乘以1;
String1{i}=substr(string_1,i,1);
string1匹配{i}='0';
结束;
i=1到长度2乘以1;
String2{i}=substr(String2,i,1);
String2Match{i}='0';
结束;
/*我们引入m,它将跟踪匹配的数量*/
m=0;
/*我们设置了一个循环来比较一个字符串和另一个字符串。我们只需要循环相同数量的

我们的一个字符串中包含字符的时间。因此“do while iSounds对我来说就像你有一个递归错误一样。数据集只有少数几行,因此如果没有某种递归发生,你不可能耗尽内存。尝试在循环的1次迭代后,在2次迭代后转储
h
hh
的内容