SAS中N个变量频率组合的交叉表显示_Sas

SAS中N个变量频率组合的交叉表显示

sas

SAS中N个变量频率组合的交叉表显示,sas,Sas,我得到的是： SAS中包含20行的表（最初为100k）各种二进制属性（列）我希望得到的是：显示属性组合频率的交叉表像这样： Attribute1 Attribute2 Attribute3 Attribute4 Attribute1 5 0 1 2 Attribute2 0 3 0 3 Attribute3

我得到的是：

SAS中包含20行的表（最初为100k）
各种二进制属性（列）

我希望得到的是：

显示属性组合频率的交叉表

像这样：

          Attribute1    Attribute2  Attribute3  Attribute4
Attribute1    5              0          1            2
Attribute2    0              3          0            3
Attribute3    2              0          5            4
Attribute4    1              2          0            10

*组合的实际总和是组成的，可能不是100%符合逻辑

我目前拥有的代码：

    /*create dummy data*/

    data monthly_sales (drop=i);
        do i=1 to 20;
            Attribute1=rand("Normal")>0.5;
            Attribute2=rand("Normal")>0.5;
            Attribute3=rand("Normal")>0.5;
            Attribute4=rand("Normal")>0.5;
            output;
        end;
    run;

我想这可以做得更聪明些，但这似乎奏效了。首先，我创建了一个表，其中应包含所有频率：

data crosstable;
  Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;output;output;output;output;
run;

然后我循环所有组合，将计数插入交叉表：

%macro lup();
%do i=1 %to 4;
  %do j=&i %to 4;
    proc sql noprint;
      select count(*) into :Antall&i&j
      from monthly_sales (where=(Attribute&i and Attribute&j));
    quit;
    data crosstable;
      set crosstable;
      if _n_=&j then Attribute&i=&&Antall&i&j;
      if _n_=&i then Attribute&j=&&Antall&i&j;
    run;
  %end;
%end;
%mend;
%lup;

%macro lup();
%do i=1 %to 10;
  %do j=&i %to 10;
    proc sql noprint;
      select sum(count) into :Antall&i&j
      from frequency_table (where=(Attribute&i and Attribute&j));
    quit;
    data crosstable;
      set crosstable;
      if _n_=&j then Attribute&i=&&Antall&i&j;
      if _n_=&i then Attribute&j=&&Antall&i&j;
    run;
  %end;
%end;
%mend;
%lup;

请注意，由于（i，j）=（j，i）的频率计数，您不需要同时执行这两项操作。

我建议使用内置的SAS工具来执行这类操作，并且可能会以稍微不同的方式显示数据，除非您真的需要对角表。e、 g

   data monthly_sales (drop=i);
        do i=1 to 20;
            Attribute1=rand("Normal")>0.5;
            Attribute2=rand("Normal")>0.5;
            Attribute3=rand("Normal")>0.5;
            Attribute4=rand("Normal")>0.5;
            count = 1;
            output;
        end;
    run;

proc freq data = monthly_sales noprint;
    table  attribute1 * attribute2 * attribute3 * attribute4 / out = frequency_table;
run;

proc summary nway data = monthly_sales;
    class attribute1 attribute2 attribute3 attribute4;
    var count;
    output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;

这两种方法中的任何一种都为数据中属性的每个贡献提供了一个表，表中有1行，这与您所要求的略有不同，但传递的信息相同。通过使用proc summary语句中的

completetypes

选项，可以强制proc summary包含数据中不存在的类变量组合的行

如果您在SAS中进行统计分析，那么花时间熟悉proc summary绝对是值得的——您可以包含额外的输出统计信息，并以最小的额外代码和处理开销处理多个变量

更新：可以不借助宏逻辑生成所需的表，尽管这是一个相当复杂的过程：

proc summary data = monthly_sales completetypes;
    ways 1 2; /*Calculate only 1 and 2-way summaries*/
    class attribute1 attribute2 attribute3 attribute4;
    var count;
    output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;

/*Eliminate unnecessary output rows*/
data summary_table;
    set summary_table;
    array a{*} attribute:;
    sum = sum(of a[*]);
    missing = 0;
    do i = 1 to dim(a);
        missing + missing(a[i]);
        a[i] = a[i] * count;
    end;
    /*We want rows where two attributes are both 1 (sum = 2),
        or one attribute is 1 and the others are all missing*/
    if sum = 2 or (sum = 1 and missing = dim(a) - 1);
    drop i missing sum;
    edge = _n_;
run;

/*Transpose into long format - 1 row per combination of vars*/
proc transpose data = summary_table out = tr_table(where = (not(missing(col1))));
    by edge;
    var attribute:;
run;

/*Use cartesian join to produce table containing desired frequencies (still not in the right shape)*/
option linesize = 150;
proc sql noprint _method _tree;
    create table diagonal as
        select  a._name_ as aname, 
                        b._name_ as bname,
                        a.col1 as count
        from tr_table a, tr_table b
            where a.edge = b.edge
            group by a.edge
            having (count(a.edge) = 4 and aname ne bname) or count(a.edge) = 1
            order by aname, bname
            ;
quit;

/*Transpose the table into the right shape*/
proc transpose data = diagonal out = want(drop = _name_);
    by aname;
    id bname;
    var count;
run;

/*Re-order variables and set missing values to zero*/
data want;
    informat aname attribute1-attribute4;
    set want;
    array a{*} attribute:;
    do i = 1 to dim(a);
        a[i] = sum(a[i],0);
    end;
    drop i;
run;

是的，user667489是对的，我只是添加了一些额外的代码来让交叉频率表看起来很好。首先，我创建了一个包含1000万行和10个变量的表：

data monthly_sales (drop=i);
        do i=1 to 10000000;
            Attribute1=rand("Normal")>0.5;
            Attribute2=rand("Normal")>0.5;
            Attribute3=rand("Normal")>0.5;
            Attribute4=rand("Normal")>0.5;
            Attribute5=rand("Normal")>0.5;
            Attribute6=rand("Normal")>0.5;
            Attribute7=rand("Normal")>0.5;
            Attribute8=rand("Normal")>0.5;
            Attribute9=rand("Normal")>0.5;
            Attribute10=rand("Normal")>0.5;
            output;
        end;
    run;

创建一个空的10x10交叉表：

data crosstable;
  Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;Attribute5=.;Attribute6=.;Attribute7=.;Attribute8=.;Attribute9=.;Attribute10=.;
  output;output;output;output;output;output;output;output;output;output;
run;

使用proc freq创建频率表：

proc freq data = monthly_sales noprint;
    table  attribute1 * attribute2 * attribute3 * attribute4 * attribute5 * attribute6 * attribute7 * attribute8 * attribute9 * attribute10
            / out = frequency_table;
run;

循环遍历所有属性组合，并对“count”变量求和。将其插入交叉表：

%macro lup();
%do i=1 %to 4;
  %do j=&i %to 4;
    proc sql noprint;
      select count(*) into :Antall&i&j
      from monthly_sales (where=(Attribute&i and Attribute&j));
    quit;
    data crosstable;
      set crosstable;
      if _n_=&j then Attribute&i=&&Antall&i&j;
      if _n_=&i then Attribute&j=&&Antall&i&j;
    run;
  %end;
%end;
%mend;
%lup;

%macro lup();
%do i=1 %to 10;
  %do j=&i %to 10;
    proc sql noprint;
      select sum(count) into :Antall&i&j
      from frequency_table (where=(Attribute&i and Attribute&j));
    quit;
    data crosstable;
      set crosstable;
      if _n_=&j then Attribute&i=&&Antall&i&j;
      if _n_=&i then Attribute&j=&&Antall&i&j;
    run;
  %end;
%end;
%mend;
%lup;

我做了很多工作。我最终构建了一个宏，它读取输入文件的内容，定位列的名称，对列进行计数，然后对它们进行迭代……顺便说一句。您是否计算了较低的运行时间限制？这样，我在| Attributes | ^2*行。。。（顺便说一句，目前我不知道如何做得更好）尝试使用SAS内置的统计过程一次性完成处理，然后转置结果。这样，您只需读取输入数据集一次，而不是O（n^2）次！嗨，很高兴它成功了。只要您只有10万行和四个属性变量，运行时就非常快，因此我不必费心寻找更有效的方法。即使是1000万行和10个属性，它也在两分钟内运行。但是，看看另一个建议，先使用proc freq，然后对count变量执行相同的循环会更快@user667489为什么是O（n^2）而不是O（k^2*n）？summary函数确实显示了“两个元组”信息。然而，我无法找到一种方法来省略不需要的信息，因为这正是我面临的挑战。（即使使用类型和方式，或者我在这里遗漏了什么）这是可能的，只是比我最初想象的要复杂得多。不过，解决这个问题很有趣！