在SAS中对列进行计数
我在SAS中有一个数据集,其中个人作为行,每个周期的变量作为列。它看起来像这样:在SAS中对列进行计数,sas,Sas,我在SAS中有一个数据集,其中个人作为行,每个周期的变量作为列。它看起来像这样: data have; input individual t1 t2 t3; cards; 1 112 111 123 2 112 111 123 3 111 111 123 4 112 112 111 ; run; 我想让SAS计算每个时间段每个数字的数量。所以我想得到类似的东西: data want; input count t1 t2 t3; cards; 111 1 3 1 112 3 1 0 123 0
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
我想让SAS计算每个时间段每个数字的数量。所以我想得到类似的东西:
data want;
input count t1 t2 t3;
cards;
111 1 3 1
112 3 1 0
123 0 0 3
;
run;
我可以用proc freq来实现这一点,但是当我有很多列的时候,输出它的效果并不好
谢谢如果您的计算机内存足够大,可以容纳整个输出,那么哈希可能是一个可行的解决方案:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
data _null_;
if _n_=1 then
do;
/*This is to construct a Hash, where count is tracked and t1-t3 is maintained*/
declare hash h(ordered:'a');
h.definekey('count');
h.definedata('count', 't1','t2','t3');
h.definedone();
call missing(count, t1,t2,t3);
end;
set have(rename=(t1-t3=_t1-_t3))
/*rename to avoid conflict between input data and Hash object*/
end=last;
array _t(*) _t:;
array t(*) t:;
/*The key is to set up two arrays, one is for input data,
another is for Hash feed, and maneuver their index variable accordingly*/
do i=1 to dim(_t);
count=_t(i);
rc=h.find(); /*search the Hash and bring back data elements if found*/
/*If there is a match, then corresponding 't' will increase by '1'*/
if rc=0 then
t(i)+1;
else
do;
/*If there is no match, then corresponding 't' will be initialized as '1',
and all of the other 't' reset to '0'*/
do j=1 to dim(t);
t(j)=0;
end;
t(i)=1;
end;
rc=h.replace(); /*Update the Hash*/
end;
if last then
rc=h.output(dataset:'want');
run;
首先,重新构造数据,使其更为垂直。这将更容易处理。我们还希望创建一个标志,稍后将其用作计数器
data have2;
set have;
array arr[*] t1-t3;
flag = 1;
do period=lbound(arr) to hbound(arr);
val = arr[period];
output;
end;
keep period val flag;
run;
对数据进行汇总,以便我们知道该值在每个期间出现的次数
proc sql noprint;
create table smry as
select val,
period,
sum(flag) as count
from have3
group by 1,2
order by 1,2
;
quit;
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;;;;
run;
proc print;
run;
proc summary data=have chartype;
class t:;
ways 1;
output out=want;
run;
proc print;
run;
data want;
set want;
p = findc(_type_,'1');
c = coalesce(of t1-t3);
run;
proc print;
run;
proc summary data=want nway completetypes;
class c p;
freq _freq_;
output out=final;
run;
proc print;
run;
proc transpose data=final out=morefinal(drop=_name_) prefix=t;
by c;
id p;
var _freq_;
run;
proc print;
run;
对数据进行转置,使每个值有一行,然后是之后每个周期的计数:
proc transpose data=smry out=want(drop=_name_);
by val;
id period;
var count;
run;
请注意,当您在第一步中定义数组时,可以使用这种表示法,它允许动态的周期数:
array arr[*] t:;
这假设数据集中以“t”开头的每个变量都应该进入数组。尝试以下操作:
%macro freq(dsn);
proc sql;
select name into:name separated by ' ' from dictionary.columns where libname='WORK' and memname='HAVE' and name like 't%';
quit;
%let ncol=%sysfunc(countw(&name,%str( )));
%do i=1 %to &ncol;
%let col=%scan(&name,&i);
proc freq data=have;
table &col/out=col_&i(keep=&col count rename=(&col=count count=&col));
run;
%end;
data temp;
merge
%do i=1 %to &ncol;
col_&i
%end;
;
by count;
run;
data want;
set temp;
array vars t:;
do over vars;
if missing(vars) then vars=0;
end;
run;
%mend;
%freq(have)
一般来说,在元数据中包含数据是一个坏主意,因为在这里,周期被编码到Tn变量中,您真的希望它成为一个组。说了这句话,你还是可以吃蛋糕的 PROC SUMMARY可以快速获得每个Tn的计数,然后您就可以处理更小的数据集。这里有一种方法应该在很多时间段都能很好地工作
proc sql noprint;
create table smry as
select val,
period,
sum(flag) as count
from have3
group by 1,2
order by 1,2
;
quit;
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;;;;
run;
proc print;
run;
proc summary data=have chartype;
class t:;
ways 1;
output out=want;
run;
proc print;
run;
data want;
set want;
p = findc(_type_,'1');
c = coalesce(of t1-t3);
run;
proc print;
run;
proc summary data=want nway completetypes;
class c p;
freq _freq_;
output out=final;
run;
proc print;
run;
proc transpose data=final out=morefinal(drop=_name_) prefix=t;
by c;
id p;
var _freq_;
run;
proc print;
run;
\u TYPE_uuu和\u FREQ_u_uu正被SO转换成斜体指示器。添加\在前面。“一般来说,在元数据中包含数据是个坏主意”。为什么?因为这使得它更难使用,在这个例子中,时间段和值(111 112)都占据相同的空间。如果时间段是一个变量,那么它可以直接在类中使用。我在第二个过程总结中做了这件事。