SAS如何从字符串中提取多个单词_Sas

SAS如何从字符串中提取多个单词

sas

SAS如何从字符串中提取多个单词,sas,Sas,我有一个包含学位类型（例如，PhD）的多长度学位课程列表，我想删除学位类型，只保留课程名称。e、 g: Master of Science in Building Performance and Diagnostics Master of Science in Computational Design Master of Science in Sustainable Design Master of Urban Design PhD in Architecture 我试图使用扫描来分

我有一个包含学位类型（例如，PhD）的多长度学位课程列表，我想删除学位类型，只保留课程名称。e、 g:

Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture

我试图使用扫描来分割“in”上的字符串并提取后面的所有文本，但我不理解得到的结果。当我使用-1（从右边开始）作为起点时，我得到：

data want; 
    format new_prog old_prog $200.; 
    set have (rename = (program = old_prog)); 
    if count(old_prog, " in ") ge 1 then new_prog = scan(old_prog, -1, "in "); 
run; 


new_prog  old_prog
tecture   Master of Science in Architecture 
g         Master of Science in Sustainable Design 
cs        Master of Science in Building Performance and Diagnostics 
t         Master of Science in Architecture-Engineering and Construction Management

我认为这无论如何都不会起作用，因为我希望整个字符串在“in”之后，而不仅仅是下一个单词，但即使我使用scan（old_prog，2，“in”），我也希望它能给我下一个单词，但它似乎给了我随机的东西，例如：

program  old_prog 
Bu       PhD in Building Performance and Diagnostics 
of       Master of Science in Architecture-Engineering and Construction Management 
Computat PhD in Computational Design 
of       Master of Science in Sustainable Design

下面是如何使用substr和index完成此操作

data want;
format new_prog old_prog $200.;
infile datalines dsd missover;
input old_prog :$200.;

if count(old_prog, " in ") ge 1 then new_prog = substr(old_prog,index(old_prog,"in") + 3); 

datalines;
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 
;
run;

Index将在字符串中找到“in”的位置，并将其传递给substr，以开始将变量从此位置+3剪切到字符串的末尾。

考虑一个数据步骤，并使用and函数处理sql解决方案：

data want;
    set have;
    if count(old_prog, " in ") ge 1 
       then new_prog = substr(old_prog, index(old_prog, "in")+3);
run;


proc sql;
    create table want as
    select *, 
    case when index(old_prog, "in") > 0 
         then substr(old_prog, index(old_prog, "in")+3)
         else old_prog
    end as new_prog
    from want;
run;

有数据
输入@1 old_prog$60
如果find（old_prog，'in'），那么new_prog=substr（old_prog，1，find（old_prog，'in'））
else new_prog=old_prog
数据线
建筑性能和诊断科学硕士
计算设计理学硕士
可持续设计理学硕士
城市设计硕士
建筑学博士学位
;
运行
proc print data=have
运行

Obs旧项目新项目
1名建筑性能和诊断科学硕士科学硕士
2计算设计理学硕士理学硕士
3可持续设计理学硕士理学硕士
4城市设计硕士城市设计硕士

5建筑学博士博士
你有许多其他人建议的有效选项。我能推荐一种正则表达式方法来得到你想要的吗
我注意到您的示例数据中有三种模式：

典型的分隔符是您试图在代码示例中使用的“in”

如果未使用典型分离器，则使用另一个分离器“of”
学位类型可以有不同的拼写（理学硕士、理学硕士、博士）在处理文本中的模式时，REGEX非常有用，因为您可以定义要查找的文本模式，并在模式为true时提取文本
有关更多信息，请参见代码中的注释：

/* Dropping pattern ids because they are not useful in data */ data have (drop=pattern_in pattern_of); /* Reading in the raw data from datalines */ input @1 old_prog $60.; /* Compiling first sample based on "in " pattern. */ pattern_in = prxparse('/in ([\w\s]*)/'); /* Compiling first sample based on "of " pattern. */ pattern_of = prxparse('/of ([\w\s]*)/'); /*If the string satisfied the patter with "in " */ if prxmatch(pattern_in,old_prog) then /* Then extract capture buffer after "in " pattern */ new_prog=prxposn(pattern_in,1,old_prog); /*If the string satisfied the patter with "of " after it didn't find patter "in "*/ else if prxmatch(pattern_of,old_prog) then /* Then extract capture buffer after "of " pattern */ new_prog=prxposn(pattern_of,1,old_prog); datalines; Master of Scinence in Building Performance and Diagnostics Master of Science in Computational Design Master of Science in Sustainable Design Master of Urban Design PhD in Architecture ; PROC PRINT DATA=have; run;
结果:
据我所知，您在扫描函数中使用的参数“in”将使用所有字符作为单独的分隔符。它不会像你期望的那样工作。