SAS如何从字符串中提取多个单词
我有一个包含学位类型(例如,PhD)的多长度学位课程列表,我想删除学位类型,只保留课程名称。e、 g:SAS如何从字符串中提取多个单词,sas,Sas,我有一个包含学位类型(例如,PhD)的多长度学位课程列表,我想删除学位类型,只保留课程名称。e、 g: Master of Science in Building Performance and Diagnostics Master of Science in Computational Design Master of Science in Sustainable Design Master of Urban Design PhD in Architecture 我试图使用扫描来分
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
我试图使用扫描来分割“in”上的字符串并提取后面的所有文本,但我不理解得到的结果。当我使用-1(从右边开始)作为起点时,我得到:
data want;
format new_prog old_prog $200.;
set have (rename = (program = old_prog));
if count(old_prog, " in ") ge 1 then new_prog = scan(old_prog, -1, "in ");
run;
new_prog old_prog
tecture Master of Science in Architecture
g Master of Science in Sustainable Design
cs Master of Science in Building Performance and Diagnostics
t Master of Science in Architecture-Engineering and Construction Management
我认为这无论如何都不会起作用,因为我希望整个字符串在“in”之后,而不仅仅是下一个单词,但即使我使用scan(old_prog,2,“in”),我也希望它能给我下一个单词,但它似乎给了我随机的东西,例如:
program old_prog
Bu PhD in Building Performance and Diagnostics
of Master of Science in Architecture-Engineering and Construction Management
Computat PhD in Computational Design
of Master of Science in Sustainable Design
下面是如何使用substr和index完成此操作
data want;
format new_prog old_prog $200.;
infile datalines dsd missover;
input old_prog :$200.;
if count(old_prog, " in ") ge 1 then new_prog = substr(old_prog,index(old_prog,"in") + 3);
datalines;
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
;
run;
Index将在字符串中找到“in”的位置,并将其传递给substr,以开始将变量从此位置+3剪切到字符串的末尾。考虑一个数据步骤,并使用and函数处理sql解决方案:
data want;
set have;
if count(old_prog, " in ") ge 1
then new_prog = substr(old_prog, index(old_prog, "in")+3);
run;
proc sql;
create table want as
select *,
case when index(old_prog, "in") > 0
then substr(old_prog, index(old_prog, "in")+3)
else old_prog
end as new_prog
from want;
run;
有数据输入@1 old_prog$60
如果find(old_prog,'in'),那么new_prog=substr(old_prog,1,find(old_prog,'in'))
else new_prog=old_prog
数据线
建筑性能和诊断科学硕士
计算设计理学硕士
可持续设计理学硕士
城市设计硕士
建筑学博士学位
;
运行
proc print data=have
运行
Obs旧项目新项目
1名建筑性能和诊断科学硕士科学硕士
2计算设计理学硕士理学硕士
3可持续设计理学硕士理学硕士
4城市设计硕士城市设计硕士
5建筑学博士博士
你有许多其他人建议的有效选项。我能推荐一种正则表达式方法来得到你想要的吗 我注意到您的示例数据中有三种模式:
/* Dropping pattern ids because they are not useful in data */
data have (drop=pattern_in pattern_of);
/* Reading in the raw data from datalines */
input @1 old_prog $60.;
/* Compiling first sample based on "in " pattern. */
pattern_in = prxparse('/in ([\w\s]*)/');
/* Compiling first sample based on "of " pattern. */
pattern_of = prxparse('/of ([\w\s]*)/');
/*If the string satisfied the patter with "in " */
if prxmatch(pattern_in,old_prog) then
/* Then extract capture buffer after "in " pattern */
new_prog=prxposn(pattern_in,1,old_prog);
/*If the string satisfied the patter with "of " after it didn't find patter "in "*/
else if prxmatch(pattern_of,old_prog) then
/* Then extract capture buffer after "of " pattern */
new_prog=prxposn(pattern_of,1,old_prog);
datalines;
Master of Scinence in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
;
PROC PRINT DATA=have;
run;
结果:
据我所知,您在扫描函数中使用的参数“in”将使用所有字符作为单独的分隔符。它不会像你期望的那样工作。