SAS中N个变量频率组合的交叉表显示
我得到的是:SAS中N个变量频率组合的交叉表显示,sas,Sas,我得到的是: SAS中包含20行的表(最初为100k) 各种二进制属性(列) 我希望得到的是: 显示属性组合频率的交叉表 像这样: Attribute1 Attribute2 Attribute3 Attribute4 Attribute1 5 0 1 2 Attribute2 0 3 0 3 Attribute3
- SAS中包含20行的表(最初为100k)
- 各种二进制属性(列)
- 显示属性组合频率的交叉表
Attribute1 Attribute2 Attribute3 Attribute4
Attribute1 5 0 1 2
Attribute2 0 3 0 3
Attribute3 2 0 5 4
Attribute4 1 2 0 10
*组合的实际总和是组成的,可能不是100%符合逻辑
我目前拥有的代码:
/*create dummy data*/
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
output;
end;
run;
我想这可以做得更聪明些,但这似乎奏效了。首先,我创建了一个表,其中应包含所有频率:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;output;output;output;output;
run;
然后我循环所有组合,将计数插入交叉表:
%macro lup();
%do i=1 %to 4;
%do j=&i %to 4;
proc sql noprint;
select count(*) into :Antall&i&j
from monthly_sales (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
%macro lup();
%do i=1 %to 10;
%do j=&i %to 10;
proc sql noprint;
select sum(count) into :Antall&i&j
from frequency_table (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
请注意,由于(i,j)=(j,i)的频率计数,您不需要同时执行这两项操作。我建议使用内置的SAS工具来执行这类操作,并且可能会以稍微不同的方式显示数据,除非您真的需要对角表。e、 g
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
count = 1;
output;
end;
run;
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 / out = frequency_table;
run;
proc summary nway data = monthly_sales;
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
这两种方法中的任何一种都为数据中属性的每个贡献提供了一个表,表中有1行,这与您所要求的略有不同,但传递的信息相同。通过使用proc summary语句中的completetypes
选项,可以强制proc summary包含数据中不存在的类变量组合的行
如果您在SAS中进行统计分析,那么花时间熟悉proc summary绝对是值得的——您可以包含额外的输出统计信息,并以最小的额外代码和处理开销处理多个变量
更新:可以不借助宏逻辑生成所需的表,尽管这是一个相当复杂的过程:
proc summary data = monthly_sales completetypes;
ways 1 2; /*Calculate only 1 and 2-way summaries*/
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
/*Eliminate unnecessary output rows*/
data summary_table;
set summary_table;
array a{*} attribute:;
sum = sum(of a[*]);
missing = 0;
do i = 1 to dim(a);
missing + missing(a[i]);
a[i] = a[i] * count;
end;
/*We want rows where two attributes are both 1 (sum = 2),
or one attribute is 1 and the others are all missing*/
if sum = 2 or (sum = 1 and missing = dim(a) - 1);
drop i missing sum;
edge = _n_;
run;
/*Transpose into long format - 1 row per combination of vars*/
proc transpose data = summary_table out = tr_table(where = (not(missing(col1))));
by edge;
var attribute:;
run;
/*Use cartesian join to produce table containing desired frequencies (still not in the right shape)*/
option linesize = 150;
proc sql noprint _method _tree;
create table diagonal as
select a._name_ as aname,
b._name_ as bname,
a.col1 as count
from tr_table a, tr_table b
where a.edge = b.edge
group by a.edge
having (count(a.edge) = 4 and aname ne bname) or count(a.edge) = 1
order by aname, bname
;
quit;
/*Transpose the table into the right shape*/
proc transpose data = diagonal out = want(drop = _name_);
by aname;
id bname;
var count;
run;
/*Re-order variables and set missing values to zero*/
data want;
informat aname attribute1-attribute4;
set want;
array a{*} attribute:;
do i = 1 to dim(a);
a[i] = sum(a[i],0);
end;
drop i;
run;
是的,user667489是对的,我只是添加了一些额外的代码来让交叉频率表看起来很好。首先,我创建了一个包含1000万行和10个变量的表:
data monthly_sales (drop=i);
do i=1 to 10000000;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
Attribute5=rand("Normal")>0.5;
Attribute6=rand("Normal")>0.5;
Attribute7=rand("Normal")>0.5;
Attribute8=rand("Normal")>0.5;
Attribute9=rand("Normal")>0.5;
Attribute10=rand("Normal")>0.5;
output;
end;
run;
创建一个空的10x10交叉表:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;Attribute5=.;Attribute6=.;Attribute7=.;Attribute8=.;Attribute9=.;Attribute10=.;
output;output;output;output;output;output;output;output;output;output;
run;
使用proc freq创建频率表:
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 * attribute5 * attribute6 * attribute7 * attribute8 * attribute9 * attribute10
/ out = frequency_table;
run;
循环遍历所有属性组合,并对“count”变量求和。将其插入交叉表:
%macro lup();
%do i=1 %to 4;
%do j=&i %to 4;
proc sql noprint;
select count(*) into :Antall&i&j
from monthly_sales (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
%macro lup();
%do i=1 %to 10;
%do j=&i %to 10;
proc sql noprint;
select sum(count) into :Antall&i&j
from frequency_table (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
我做了很多工作。我最终构建了一个宏,它读取输入文件的内容,定位列的名称,对列进行计数,然后对它们进行迭代……顺便说一句。您是否计算了较低的运行时间限制?这样,我在| Attributes | ^2*行。。。(顺便说一句,目前我不知道如何做得更好)尝试使用SAS内置的统计过程一次性完成处理,然后转置结果。这样,您只需读取输入数据集一次,而不是O(n^2)次!嗨,很高兴它成功了。只要您只有10万行和四个属性变量,运行时就非常快,因此我不必费心寻找更有效的方法。即使是1000万行和10个属性,它也在两分钟内运行。但是,看看另一个建议,先使用proc freq,然后对count变量执行相同的循环会更快@user667489为什么是O(n^2)而不是O(k^2*n)?summary函数确实显示了“两个元组”信息。然而,我无法找到一种方法来省略不需要的信息,因为这正是我面临的挑战。(即使使用类型和方式,或者我在这里遗漏了什么)这是可能的,只是比我最初想象的要复杂得多。不过,解决这个问题很有趣!