Hash SAS哈希联接的形式类似于或=：_Hash_Sas_Datastep

Hash SAS哈希联接的形式类似于或=：

hash sas

Hash SAS哈希联接的形式类似于或=：,hash,sas,datastep,Hash,Sas,Datastep,是否可以对部分子字符串执行SAS哈希查找因此，哈希表键将包含：“LongString”，但我的目标表键有：“LongStr” （目标表键字符串长度可能不同）您必须使用迭代器对象循环遍历键并自行进行匹配 data keys; length key $10 value $1; input key value; cards; LongString A LongOther B ; run; proc sort data=keys; by key; run; data data; length

是否可以对部分子字符串执行SAS哈希查找

因此，哈希表键将包含：“LongString”，但我的目标表键有：“LongStr”

（目标表键字符串长度可能不同）

您必须使用迭代器对象循环遍历键并自行进行匹配

 data keys;
length key  $10 value $1;
input key value;
cards;
LongString A
LongOther B
;
run;

proc sort data=keys;
by key;
run;


data data;
length short_key  $10;
input short_key ;
cards;
LongStr
LongSt
LongOther
LongOth
LongOt
LongO
LongSt
LongOther
;
run;

data match;
    set data;
    length key $20 outvalue value $1;
    drop key value rc;
    if _N_ = 1 then do;
       call missing(key, value);
       declare hash h1(dataset:"work.keys", ordered: 'yes');
       declare hiter iter ('h1');
       h1.defineKey('key');
       h1.defineData('key', 'value');
       h1.defineDone();
    end;
    rc = iter.first();/* reset to beginning */
    do while (rc = 0);/* loop through the long keys and find a match */
        if index(key, trim(short_key)) > 0 then do;
            outvalue = value;
            iter.last(); /* leave after match */
        end;
        rc = iter.next(); 
    end; 
run;

你可以，但它并不漂亮，你可能得不到你想要的性能优势。此外，根据字符串的长度和表的大小，可能无法将所有哈希表元素装入内存

诀窍是首先生成所有可能的子字符串，然后在哈希表上使用“multidata”选项

创建包含我们要匹配的单词的数据集：

data keys;
  length key  $10 value $1;
  input key;
  cards;
LongString
LongOther
;
run;

生成所有可能的子字符串：

data abbreviations;
  length abbrev $10;
  set keys;
  do cnt=1 to length(key);
    abbrev = substr(key,1,cnt);
    output;
  end;
run;

创建包含要搜索的术语的数据集：

data match_attempts;
  length abbrev  $10;
  input abbrev ;
  cards;
L
Long
LongO
LongSt
LongOther
;
run;

执行查找：

data match;
  length abbrev key $10;

  set match_attempts;

  if _n_ = 1 then do;
    declare hash h1(dataset:'abbreviations', multidata: 'y');
    h1.defineKey('abbrev');
    h1.defineData('abbrev', 'key');
    h1.defineDone();

    call missing(abbrev, key);
  end;

  if h1.find() eq 0 then do;
    output;
    h1.has_next(result: r);
    do while(r ne 0);
      h1.find_next();
      output;
      h1.has_next(result: r);
    end;
  end;

  drop r;
run;

输出（注意“Long”返回2个匹配项的时间）：

再多记几句。哈希表不支持类似于

like

运算符的操作的原因是，它在将记录插入哈希表之前对键进行“哈希”。执行查找时，要查找的值将被“哈希化”，然后对哈希值执行匹配。对值进行散列运算时，即使值中的一个小更改也会产生完全不同的结果。以下面的示例为例，对2个几乎相同的字符串进行哈希运算，得到2个完全不同的值：

data _null_;
  length hashed_value $16;
  hashed_value = md5("String");
  put hashed_value= hex32.;
  hashed_value = md5("String1");
  put hashed_value= hex32.;
run;

输出：

hashed_value=27118326006D3829667A400AD23D5D98
hashed_value=0EAB2ADFFF8C9A250BBE72D5BEA16E29

因此，哈希表不能使用

like

运算符

最后，感谢@vasja提供了一些示例数据。

不得不接受vasja基于效率的响应，但是+1提供了一些关于哈希表的有趣见解。我确信这两个代表可以结合起来（例如使用multidata选项而不是hiter）…事实上，一旦表格达到一定的大小，我的代表将更加有效@vasja的解决方案应该以O（N*M/2）的速度运行，因为它必须遍历迭代器的每一行，直到找到匹配项或没有找到匹配项为止。平均而言，它将遍历数据集的一半。Mine的初始开销较高，但会在O（N）时间内找到匹配，而O（N）时间会随着OB数量的增加而加快。如果您想知道这些时间意味着什么，请搜索“大O”符号。对于小型数据集，您最好只使用SQL连接和

之类的

——不要将哈希表过度复杂化。如果需要@vasjas solution 1ms来迭代散列，则平均需要0.5ms才能获得每行的结果。如果大表有10万行，并且SAS遍历10万行也需要30秒，那么总共需要83分钟30秒。在我的解决方案中，搜索时间是将哈希函数应用于搜索项并从内存中检索结果的成本，这基本上是零。因此，对于相同的10M行，我的解决方案的总时间就是SAS遍历10M行的30秒+设置时间。希望这有帮助。

hashed_value=27118326006D3829667A400AD23D5D98
hashed_value=0EAB2ADFFF8C9A250BBE72D5BEA16E29