Dataframe SAS汇总观察值不在一个组中,由多个组进行

Dataframe SAS汇总观察值不在一个组中,由多个组进行,dataframe,sas,Dataframe,Sas,这篇文章如下: 遗憾的是,我的最小示例有点太小了,我无法在我的数据上使用它 这是一个完整的案例示例,我所拥有的是: data have; input group1 group2 group3 $ value; datalines; 1 A X 2 1 A X 4 1 A Y 1 1 A Y 3 1 B Z 2 1 B Z 1 1 C Y 1 1 C Y 6 1 C Z 7 2 A Z 3 2 A Z 9 2 A Y 2 2 B X 8 2 B X 5 2 B X 5 2 B Z 7

这篇文章如下: 遗憾的是,我的最小示例有点太小了,我无法在我的数据上使用它

这是一个完整的案例示例,我所拥有的是:

data have;
   input group1 group2 group3 $ value;
   datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;
对于每个组,我需要一个新变量“sum”,该变量包含相同子组(group1和group2)列中所有值的总和,观察值所在的组(group3)除外

data want;
   input group1 group2 group3 $ value $ sum;
   datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;
我的目标是使用datasteps或ProcSQL(在大约3000万次观测和ProcMeans上进行此操作,SAS中的这类操作似乎比以前类似计算的速度要慢)

我在链接帖子中提供的解决方案的问题是使用列的总值,我不知道如何通过使用子组中的总值来改变这一点。
有什么想法吗?

SQL解决方案会将所有数据加入聚合选择:

proc sql;
  create table want as 
  select have.group1, have.group2, have.group3, have.value
    , aggregate.sum - value as sum 
  from 
    have
  join 
    (select group1, group2, sum(value) as sum
     from have
     group by group1, group2
    ) aggregate
  on
    aggregate.group1 = have.group1
  & aggregate.group2 = have.group2
;
SQL可能比哈希解决方案慢,但SQL代码比那些理解涉及哈希的SAS数据步骤的人更容易理解(这可能比SQL更快)

SAS文档涉及并具有

问题不涉及这一概念:

  • 对于每一行,计算不包括此行所在的第3层的第2层总和
基于散列的解决方案需要跟踪每个两级和三级总和:


data want2;
  if 0 then set have; * prep pdv;

  declare hash T2 (suminc:'value');   * hash for two (T)iers;
  T2.defineKey('group1', 'group2');   * one hash record per combination of group1, group2;
  T2.defineDone();

  declare hash T3 (suminc:'value');             * hash for three (T)iers;
  T3.defineKey('group1', 'group2', 'group3');   * one hash record per combination of group1, group2, group3;
  T3.defineDone();

  do while (not hash_loaded);
    set have end=hash_loaded;
    T2.ref();                * adds value to internal sum of hash data record;
    T3.ref();
  end;

  T2_cardinality = T2.num_items;
  T3_cardinality = T3.num_items;    

  put 'NOTE: |T2| = ' T2_cardinality;
  put 'NOTE: |T3| = ' T3_cardinality;

  do while (not last_have);
    set have end=last_have;
    T2.sum(sum:t2_sum);         
    T3.sum(sum:t3_sum);
    sum = t2_sum - t3_sum;
    output;
  end;

  stop;

  drop t2_: t3:;
run;

SQL解决方案将所有数据连接到聚合选择:

proc sql;
  create table want as 
  select have.group1, have.group2, have.group3, have.value
    , aggregate.sum - value as sum 
  from 
    have
  join 
    (select group1, group2, sum(value) as sum
     from have
     group by group1, group2
    ) aggregate
  on
    aggregate.group1 = have.group1
  & aggregate.group2 = have.group2
;
SQL可能比哈希解决方案慢,但SQL代码比那些理解涉及哈希的SAS数据步骤的人更容易理解(这可能比SQL更快)

SAS文档涉及并具有

问题不涉及这一概念:

  • 对于每一行,计算不包括此行所在的第3层的第2层总和
基于散列的解决方案需要跟踪每个两级和三级总和:


data want2;
  if 0 then set have; * prep pdv;

  declare hash T2 (suminc:'value');   * hash for two (T)iers;
  T2.defineKey('group1', 'group2');   * one hash record per combination of group1, group2;
  T2.defineDone();

  declare hash T3 (suminc:'value');             * hash for three (T)iers;
  T3.defineKey('group1', 'group2', 'group3');   * one hash record per combination of group1, group2, group3;
  T3.defineDone();

  do while (not hash_loaded);
    set have end=hash_loaded;
    T2.ref();                * adds value to internal sum of hash data record;
    T3.ref();
  end;

  T2_cardinality = T2.num_items;
  T3_cardinality = T3.num_items;    

  put 'NOTE: |T2| = ' T2_cardinality;
  put 'NOTE: |T3| = ' T3_cardinality;

  do while (not last_have);
    set have end=last_have;
    T2.sum(sum:t2_sum);         
    T3.sum(sum:t3_sum);
    sum = t2_sum - t3_sum;
    output;
  end;

  stop;

  drop t2_: t3:;
run;

感谢您在我的脑海中,我的“想要”数据库是一个错误,您提出的第二个解决方案正是我所需要的。我处理这件事很痛苦,你解决了,非常感谢!感谢您在我的脑海中,我的“想要”数据库是一个错误,您提出的第二个解决方案正是我所需要的。我处理这件事很痛苦,你解决了,非常感谢!