String 提取部分文本
我有三个相同字符串的版本:String 提取部分文本,string,sas,String,Sas,我有三个相同字符串的版本: "123 aa456 aa678" "123 aa99 aa678" "45 aa28 aa234" 如何仅提取aa之间的值 我尝试使用iflength(string)>15然后string=substr(string,8,8),但它仅适用于第一个版本…您可以使用find+substr函数: data g; d= "123 aa456 aa678"; d_pos1=find(d,"aa"); d_pos2=find(d,"aa",d_pos1+2); d_b
"123 aa456 aa678"
"123 aa99 aa678"
"45 aa28 aa234"
如何仅提取aa之间的值
我尝试使用if
length(string)>15
然后string=substr(string,8,8)
,但它仅适用于第一个版本…您可以使用find+substr函数:
data g;
d= "123 aa456 aa678";
d_pos1=find(d,"aa");
d_pos2=find(d,"aa",d_pos1+2);
d_between_aa1=substr(d,d_pos1+2,d_pos2-d_pos1-2);
run;
data g;
d= "123 aa456 aa678";
re= prxparse('/aa([\w\d\s\t]*)aa/');
pos = prxmatch(re, d) ;
text = prxposn(re, 1, d) ;
run;
或使用prx功能:
data g;
d= "123 aa456 aa678";
d_pos1=find(d,"aa");
d_pos2=find(d,"aa",d_pos1+2);
d_between_aa1=substr(d,d_pos1+2,d_pos2-d_pos1-2);
run;
data g;
d= "123 aa456 aa678";
re= prxparse('/aa([\w\d\s\t]*)aa/');
pos = prxmatch(re, d) ;
text = prxposn(re, 1, d) ;
run;
您可以使用
prxnext
功能(第十二页)。在您的例子中,它看起来像(您需要初始化num个值,在我的示例中是3):
拥有数据集:
+=================+
| string |
+=================+
| 123 aa456 aa678 |
+-----------------+
| 123 aa99 aa678 |
+-----------------+
| 45 aa28 aa234 |
+-----------------+
+=====+=====+=====+=================+
| id1 | id2 | id3 | string |
+=====+=====+=====+=================+
| 123 | 456 | 678 | 123 aa456 aa678 |
+-----+-----+-----+-----------------+
| 123 | 99 | 678 | 123 aa99 aa678 |
+-----+-----+-----+-----------------+
| 45 | 28 | 234 | 45 aa28 aa234 |
+-----+-----+-----+-----------------+
检查下一个数据集:
+=================+
| string |
+=================+
| 123 aa456 aa678 |
+-----------------+
| 123 aa99 aa678 |
+-----------------+
| 45 aa28 aa234 |
+-----------------+
+=====+=====+=====+=================+
| id1 | id2 | id3 | string |
+=====+=====+=====+=================+
| 123 | 456 | 678 | 123 aa456 aa678 |
+-----+-----+-----+-----------------+
| 123 | 99 | 678 | 123 aa99 aa678 |
+-----+-----+-----+-----------------+
| 45 | 28 | 234 | 45 aa28 aa234 |
+-----+-----+-----+-----------------+
DLMSTR INFLE语句选项示例
data test;
infile cards dlmstr=' aa';
input v1-v3;
line = _infile_;
cards;
123 aa456 aa678
123 aa99 aa678
45 aa28 aa234
;;;;
run;
假设要分析的数据位于变量中(或
\u infle\u
)
带有SCAN
终止条件的循环可以提取分隔符之间的文本段(单词)——因为字符串分隔符是'aa'
,所以SCAN函数可以使用字母'A'
作为字符分隔符(因为扫描默认操作是相邻分隔符之间的空字段('aa'
)不被视为可提取的部分
可以使用INPUT
功能将每个提取的文本片段转换为数值
如果您不知道可以扫描出的项目数量,请首先输出一个“垂直”列表,并将其转置
data lines;
input;
line = _infile_;
datalines;
123 aa456 aa678
123 aa99 aa678
45 aa28 aa234
45 aa28 aa234 aa 999
45 aa this is wrong aa -234 aa 999
run;
data ids;
set lines;
rownum + 1;
do _n_ = 1 by 1 while (scan(line, _n_, 'a') ne '');
id = input ( scan(line, _n_, 'a'), ??best12. );
output;
end;
run;
proc transpose data=ids out=want(drop=_name_) prefix=id;
by rownum;
var id;
run;
创建输出
rownum id1 id2 id3 id4
1 123 456 678 .
2 123 99 678 .
3 45 28 234 .
4 45 28 234 999
5 45 . -234 999