SAS如何从字符串中提取多个单词

SAS如何从字符串中提取多个单词,sas,Sas,我有一个包含学位类型(例如,PhD)的多长度学位课程列表,我想删除学位类型,只保留课程名称。e、 g: Master of Science in Building Performance and Diagnostics Master of Science in Computational Design Master of Science in Sustainable Design Master of Urban Design PhD in Architecture 我试图使用扫描来分

我有一个包含学位类型(例如,PhD)的多长度学位课程列表,我想删除学位类型,只保留课程名称。e、 g:

Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 
我试图使用扫描来分割“in”上的字符串并提取后面的所有文本,但我不理解得到的结果。当我使用-1(从右边开始)作为起点时,我得到:

data want; 
    format new_prog old_prog $200.; 
    set have (rename = (program = old_prog)); 
    if count(old_prog, " in ") ge 1 then new_prog = scan(old_prog, -1, "in "); 
run; 


new_prog  old_prog
tecture   Master of Science in Architecture 
g         Master of Science in Sustainable Design 
cs        Master of Science in Building Performance and Diagnostics 
t         Master of Science in Architecture-Engineering and Construction Management 
我认为这无论如何都不会起作用,因为我希望整个字符串在“in”之后,而不仅仅是下一个单词,但即使我使用scan(old_prog,2,“in”),我也希望它能给我下一个单词,但它似乎给了我随机的东西,例如:

program  old_prog 
Bu       PhD in Building Performance and Diagnostics 
of       Master of Science in Architecture-Engineering and Construction Management 
Computat PhD in Computational Design 
of       Master of Science in Sustainable Design 

下面是如何使用substr和index完成此操作

data want;
format new_prog old_prog $200.;
infile datalines dsd missover;
input old_prog :$200.;

if count(old_prog, " in ") ge 1 then new_prog = substr(old_prog,index(old_prog,"in") + 3); 

datalines;
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 
;
run;

Index将在字符串中找到“in”的位置,并将其传递给substr,以开始将变量从此位置+3剪切到字符串的末尾。

考虑一个数据步骤,并使用and函数处理sql解决方案:

data want;
    set have;
    if count(old_prog, " in ") ge 1 
       then new_prog = substr(old_prog, index(old_prog, "in")+3);
run;


proc sql;
    create table want as
    select *, 
    case when index(old_prog, "in") > 0 
         then substr(old_prog, index(old_prog, "in")+3)
         else old_prog
    end as new_prog
    from want;
run;
有数据
输入@1 old_prog$60
如果find(old_prog,'in'),那么new_prog=substr(old_prog,1,find(old_prog,'in'))
else new_prog=old_prog
数据线
建筑性能和诊断科学硕士
计算设计理学硕士
可持续设计理学硕士
城市设计硕士
建筑学博士学位
;
运行
proc print data=have
运行

Obs旧项目新项目
1名建筑性能和诊断科学硕士科学硕士
2计算设计理学硕士理学硕士
3可持续设计理学硕士理学硕士
4城市设计硕士城市设计硕士

5建筑学博士博士

你有许多其他人建议的有效选项。我能推荐一种正则表达式方法来得到你想要的吗

我注意到您的示例数据中有三种模式:

  • 典型的分隔符是您试图在代码示例中使用的“in”
  • 如果未使用典型分离器,则使用另一个分离器“of”
  • 学位类型可以有不同的拼写(理学硕士、理学硕士、博士) 在处理文本中的模式时,REGEX非常有用,因为您可以定义要查找的文本模式,并在模式为true时提取文本

    有关更多信息,请参见代码中的注释:

    /* Dropping pattern ids because they are not useful in data */
    data have (drop=pattern_in pattern_of);
        /* Reading in the raw data from datalines */
        input @1 old_prog $60.;
    
        /* Compiling first sample based on "in " pattern. */
        pattern_in = prxparse('/in ([\w\s]*)/');
    
        /* Compiling first sample based on "of " pattern. */
        pattern_of = prxparse('/of ([\w\s]*)/');
    
        /*If the string satisfied the patter with "in " */
        if prxmatch(pattern_in,old_prog) then 
        /* Then extract capture buffer after "in " pattern */
        new_prog=prxposn(pattern_in,1,old_prog);
    
        /*If the string satisfied the patter with "of " after it didn't find patter "in "*/
        else if prxmatch(pattern_of,old_prog) then 
        /* Then extract capture buffer after "of " pattern */
        new_prog=prxposn(pattern_of,1,old_prog);
        datalines;
    Master of Scinence in Building Performance and Diagnostics
    Master of Science in Computational Design 
    Master of Science in Sustainable Design 
    Master of Urban Design 
    PhD in Architecture 
    ;
    
    PROC PRINT DATA=have;
    run;
    
    结果:

    据我所知,您在扫描函数中使用的参数“in”将使用所有字符作为单独的分隔符。它不会像你期望的那样工作。