Sas 计算单词出现的次数
我正在SAS中寻找一种更好的方法来计算某个单词在字符串中出现的次数。例如,在字符串中搜索“wood”:Sas 计算单词出现的次数,sas,Sas,我正在SAS中寻找一种更好的方法来计算某个单词在字符串中出现的次数。例如,在字符串中搜索“wood”: how much wood could a woodchuck chuck if a woodchuck could chuck wood 。。。将返回2的结果 我通常会这样做,但代码太多: data _null_; length sentence word $200; sentence = 'how much wood could a woodchuck chuck if a w
how much wood could a woodchuck chuck if a woodchuck could chuck wood
。。。将返回2
的结果
我通常会这样做,但代码太多:
data _null_;
length sentence word $200;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
search_term = 'wood';
found_count = 0;
cnt=1;
word = scan(sentence,cnt);
do while (word ne '');
num_times_found = sum(num_times_found, word eq search_term);
cnt = cnt + 1;
word = scan(sentence,cnt);
end;
put num_times_found=;
run;
我可以将其放入
fcmp
函数中,使其更加优雅,但我仍然觉得必须有更友好、更简洁的代码来实现这一点。尝试使用prxchange删除一些内容,然后计数
data _null_;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' ');
put _all_;
run;
从代码审查的角度来看,上述内容可以有所改进。do循环可以处理
cnt
增量,如果您将其切换到,直到
土拨鼠,当然,但这并不影响结果。而且-这就是我所说的“复杂的解决方案”-不是因为它是错误的,而是它不是那么简单,并且根据这个原则可以避免(因为其他人更难看到你在做什么)。你可以在你的prx中添加,您甚至不必执行初始赋值。您还找到了一个无关的变量
翻译成了,不确定这是什么。否则,我认为这是合理的,至少对于非卷积解\u count
data _null_; length sentence word $200; sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; search_term = 'wood'; do cnt=1 by 1 until (word eq ''); word = scan(sentence,cnt); num_times_found = sum(num_times_found, word eq search_term); end; put num_times_found=; run;
它也非常快-1e6迭代在我的盒子上花费不到9秒。将添加到字符串选项时的PRX解决方案花费的时间较少(6秒),因此在使用非常大的数据集或大量变量时可能更可取,但我怀疑与I/o时间相比,添加的时间是否重要。FCMP解决方案的时间顺序与此解决方案相同(均约为8-9秒)。最后,FINDW解决方案是最快的,大约需要2秒。为了完整起见,这里它是一个fcmp函数: FCMP定义:o
options cmplib=work.temp.temp; proc fcmp outlib=work.temp.temp; function word_freq(sentence $, search_term $) ; length sentence word $200; do cnt=1 by 1 until (word eq ''); word = scan(sentence,cnt); num_times_found = sum(num_times_found, word eq search_term); end; return (num_times_found); endsub; run;
data _null_; num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood'); put num_times_found=; run;
用法:num_times_found=2
options cmplib=work.temp.temp; proc fcmp outlib=work.temp.temp; function word_freq(sentence $, search_term $) ; length sentence word $200; do cnt=1 by 1 until (word eq ''); word = scan(sentence,cnt); num_times_found = sum(num_times_found, word eq search_term); end; return (num_times_found); endsub; run;
data _null_; num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood'); put num_times_found=; run;
结果:num_times_found=2
options cmplib=work.temp.temp; proc fcmp outlib=work.temp.temp; function word_freq(sentence $, search_term $) ; length sentence word $200; do cnt=1 by 1 until (word eq ''); word = scan(sentence,cnt); num_times_found = sum(num_times_found, word eq search_term); end; return (num_times_found); endsub; run;
data _null_; num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood'); put num_times_found=; run;
num_times_found=2
当FINDW有效地为您扫描时,没有理由扫描所有单词33 data _null_; 34 length sentence search_term $200; 35 sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood'; 36 search_term = 'wood'; 37 cnt=0; 38 do s=findw(sentence,strip(search_term),1) by 0 while(s); 39 cnt+1; 40 s=findw(sentence,strip(search_term),s+1); 41 end; 42 put cnt= search_term=; 43 stop; 44 run; cnt=2 search_term=wood
我在这里发布了这篇文章,而不是在codereview上,因为我不认为codereview会有任何SAS受众。countW不是这样做的吗?@data\u null\u不-这也是我的第一个想法,但是只是计算单词的总数,而不是特定单词出现的次数。哦!嗯。用FINDW代替扫描每个单词怎么样。在每次点击时移动开始列,直到它返回0,同时保持计数,就像现在一样。相同但不同,迭代次数较少。确实,SAS的受众有限;我真的很想让SAS的问题出现在那里,这样我就值得花时间去查看这个网站了。。。也就是说,我认为这个问题属于这里。代码审查更多的是针对更大的事情,即完整的代码片段,而不仅仅是一个函数,询问结构和设计之类的问题,虽然从技术上讲,这可能是关于这个主题的。从技术上来说,这将把土拨鼠countW()
选项,否则,运行多个迭代需要相当长的时间。这是一个不错的方法。这个解决方案赢得了“最简洁”奖,但我不得不同意Joe的观点,它没有那么“友好”,所以我将他的答案标记为可接受的答案。肯定比扫描方法快很多。o