String 提取部分文本

String 提取部分文本,string,sas,String,Sas,我有三个相同字符串的版本: "123 aa456 aa678" "123 aa99 aa678" "45 aa28 aa234" 如何仅提取aa之间的值 我尝试使用iflength(string)>15然后string=substr(string,8,8),但它仅适用于第一个版本…您可以使用find+substr函数: data g; d= "123 aa456 aa678"; d_pos1=find(d,"aa"); d_pos2=find(d,"aa",d_pos1+2); d_b

我有三个相同字符串的版本:

 "123 aa456 aa678"
 "123 aa99 aa678"
 "45 aa28 aa234" 
如何仅提取aa之间的值


我尝试使用if
length(string)>15
然后
string=substr(string,8,8)
,但它仅适用于第一个版本…

您可以使用find+substr函数:

data g;
d= "123 aa456 aa678";
d_pos1=find(d,"aa");
d_pos2=find(d,"aa",d_pos1+2);
d_between_aa1=substr(d,d_pos1+2,d_pos2-d_pos1-2);
run;
data g;
d= "123 aa456 aa678";

re= prxparse('/aa([\w\d\s\t]*)aa/');
pos = prxmatch(re, d) ;
text = prxposn(re, 1, d) ;
run;
或使用prx功能:

data g;
d= "123 aa456 aa678";
d_pos1=find(d,"aa");
d_pos2=find(d,"aa",d_pos1+2);
d_between_aa1=substr(d,d_pos1+2,d_pos2-d_pos1-2);
run;
data g;
d= "123 aa456 aa678";

re= prxparse('/aa([\w\d\s\t]*)aa/');
pos = prxmatch(re, d) ;
text = prxposn(re, 1, d) ;
run;

您可以使用
prxnext
功能(第十二页)。在您的例子中,它看起来像(您需要初始化num个值,在我的示例中是3):

拥有数据集:

+=================+
|     string      |
+=================+
| 123 aa456 aa678 |
+-----------------+
| 123 aa99 aa678  |
+-----------------+
| 45 aa28 aa234   |
+-----------------+
+=====+=====+=====+=================+
| id1 | id2 | id3 |     string      |
+=====+=====+=====+=================+
| 123 | 456 | 678 | 123 aa456 aa678 |
+-----+-----+-----+-----------------+
| 123 |  99 | 678 | 123 aa99 aa678  |
+-----+-----+-----+-----------------+
|  45 |  28 | 234 | 45 aa28 aa234   |
+-----+-----+-----+-----------------+
检查下一个数据集:

+=================+
|     string      |
+=================+
| 123 aa456 aa678 |
+-----------------+
| 123 aa99 aa678  |
+-----------------+
| 45 aa28 aa234   |
+-----------------+
+=====+=====+=====+=================+
| id1 | id2 | id3 |     string      |
+=====+=====+=====+=================+
| 123 | 456 | 678 | 123 aa456 aa678 |
+-----+-----+-----+-----------------+
| 123 |  99 | 678 | 123 aa99 aa678  |
+-----+-----+-----+-----------------+
|  45 |  28 | 234 | 45 aa28 aa234   |
+-----+-----+-----+-----------------+

DLMSTR INFLE语句选项示例

data test;
   infile cards dlmstr=' aa';
   input v1-v3;
   line = _infile_;
   cards;
123 aa456 aa678 
123 aa99 aa678
45 aa28 aa234
;;;;
   run;

假设要分析的数据位于变量中(或
\u infle\u

带有
SCAN
终止条件的循环可以提取分隔符之间的文本段(单词)——因为字符串分隔符是
'aa'
,所以SCAN函数可以使用字母
'A'
作为字符分隔符(因为扫描默认操作是相邻分隔符之间的空字段(
'aa'
)不被视为可提取的部分

可以使用
INPUT
功能将每个提取的文本片段转换为数值

如果您不知道可以扫描出的项目数量,请首先输出一个“垂直”列表,并将其转置

data lines;
  input;
  line = _infile_;
datalines;
123 aa456 aa678
123 aa99 aa678
45 aa28 aa234
45 aa28 aa234 aa 999
45 aa this is wrong aa -234 aa 999
run;

data ids;
  set lines;

  rownum + 1;
  do _n_ = 1 by 1 while (scan(line, _n_, 'a') ne '');
    id = input ( scan(line, _n_, 'a'), ??best12. );
    output;
  end;
run;

proc transpose data=ids out=want(drop=_name_) prefix=id;
  by rownum;
  var id;
run;
创建输出

rownum    id1    id2     id3    id4

   1      123    456     678      .
   2      123     99     678      .
   3       45     28     234      .
   4       45     28     234    999
   5       45      .    -234    999