Sas 计算单词出现的次数

Sas 计算单词出现的次数,sas,Sas,我正在SAS中寻找一种更好的方法来计算某个单词在字符串中出现的次数。例如,在字符串中搜索“wood”: how much wood could a woodchuck chuck if a woodchuck could chuck wood 。。。将返回2的结果 我通常会这样做,但代码太多: data _null_; length sentence word $200; sentence = 'how much wood could a woodchuck chuck if a w

我正在SAS中寻找一种更好的方法来计算某个单词在字符串中出现的次数。例如,在字符串中搜索“wood”:

how much wood could a woodchuck chuck if a woodchuck could chuck wood
。。。将返回
2
的结果

我通常会这样做,但代码太多:

data _null_;
  length sentence word $200;

  sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
  search_term = 'wood';
  found_count = 0;

  cnt=1;
  word = scan(sentence,cnt);
  do while (word ne '');
    num_times_found = sum(num_times_found, word eq search_term);
    cnt = cnt + 1;
    word = scan(sentence,cnt);
  end;

  put num_times_found=;

run;

我可以将其放入
fcmp
函数中,使其更加优雅,但我仍然觉得必须有更友好、更简洁的代码来实现这一点。

尝试使用prxchange删除一些内容,然后计数

data _null_;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' ');
put _all_;
run;

从代码审查的角度来看,上述内容可以有所改进。do循环可以处理
cnt
增量,如果您将其切换到
,直到
,您甚至不必执行初始赋值。您还找到了一个无关的变量
\u count
,不确定这是什么。否则,我认为这是合理的,至少对于非卷积解

data _null_;
  length sentence word $200;

  sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
  search_term = 'wood';

  do cnt=1 by 1 until (word eq '');
    word = scan(sentence,cnt);
    num_times_found = sum(num_times_found, word eq search_term);
  end;

  put num_times_found=;

run;

它也非常快-1e6迭代在我的盒子上花费不到9秒。将
o
添加到字符串选项时的PRX解决方案花费的时间较少(6秒),因此在使用非常大的数据集或大量变量时可能更可取,但我怀疑与I/o时间相比,添加的时间是否重要。FCMP解决方案的时间顺序与此解决方案相同(均约为8-9秒)。最后,FINDW解决方案是最快的,大约需要2秒。

为了完整起见,这里它是一个fcmp函数:

FCMP定义:

options cmplib=work.temp.temp;

proc fcmp outlib=work.temp.temp;

  function word_freq(sentence $, search_term $) ;    
    length sentence word $200;

    do cnt=1 by 1 until (word eq '');
      word = scan(sentence,cnt);
      num_times_found = sum(num_times_found, word eq search_term);
    end;

    return (num_times_found);
  endsub;

run;
data _null_;
  num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood');
  put num_times_found=;
run;
num_times_found=2
用法:

options cmplib=work.temp.temp;

proc fcmp outlib=work.temp.temp;

  function word_freq(sentence $, search_term $) ;    
    length sentence word $200;

    do cnt=1 by 1 until (word eq '');
      word = scan(sentence,cnt);
      num_times_found = sum(num_times_found, word eq search_term);
    end;

    return (num_times_found);
  endsub;

run;
data _null_;
  num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood');
  put num_times_found=;
run;
num_times_found=2
结果:

options cmplib=work.temp.temp;

proc fcmp outlib=work.temp.temp;

  function word_freq(sentence $, search_term $) ;    
    length sentence word $200;

    do cnt=1 by 1 until (word eq '');
      word = scan(sentence,cnt);
      num_times_found = sum(num_times_found, word eq search_term);
    end;

    return (num_times_found);
  endsub;

run;
data _null_;
  num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood');
  put num_times_found=;
run;
num_times_found=2

当FINDW有效地为您扫描时,没有理由扫描所有单词

33         data _null_;
34            length sentence search_term $200;
35            sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
36            search_term = 'wood';
37            cnt=0;
38            do s=findw(sentence,strip(search_term),1) by 0 while(s);
39               cnt+1;
40               s=findw(sentence,strip(search_term),s+1);
41               end;
42            put cnt= search_term=;
43            stop;
44            run;

cnt=2 search_term=wood

我在这里发布了这篇文章,而不是在codereview上,因为我不认为codereview会有任何SAS受众。countW不是这样做的吗?@data\u null\u不-这也是我的第一个想法,但是
countW()
只是计算单词的总数,而不是特定单词出现的次数。哦!嗯。用FINDW代替扫描每个单词怎么样。在每次点击时移动开始列,直到它返回0,同时保持计数,就像现在一样。相同但不同,迭代次数较少。确实,SAS的受众有限;我真的很想让SAS的问题出现在那里,这样我就值得花时间去查看这个网站了。。。也就是说,我认为这个问题属于这里。代码审查更多的是针对更大的事情,即完整的代码片段,而不仅仅是一个函数,询问结构和设计之类的问题,虽然从技术上讲,这可能是关于这个主题的。从技术上来说,这将把土拨鼠
翻译成了
土拨鼠,当然,但这并不影响结果。而且-这就是我所说的“复杂的解决方案”-不是因为它是错误的,而是它不是那么简单,并且根据这个原则可以避免(因为其他人更难看到你在做什么)。你可以在你的prx中添加
o
选项,否则,运行多个迭代需要相当长的时间。这是一个不错的方法。这个解决方案赢得了“最简洁”奖,但我不得不同意Joe的观点,它没有那么“友好”,所以我将他的答案标记为可接受的答案。肯定比扫描方法快很多。