Sas 根据分组的特征保留或删除一组观察结果
几分钟前我回答了SAS的一个问题,并意识到有一个概括可能比那个更有用。我在StackOverflow中还没有看到这个问题 一般的问题是:你如何处理和保持一个完整的分组,基于分组的某些特征,你可能不知道,直到你看了分组中的所有观察结果 使用与前面问题类似的输入数据:Sas 根据分组的特征保留或删除一组观察结果,sas,Sas,几分钟前我回答了SAS的一个问题,并意识到有一个概括可能比那个更有用。我在StackOverflow中还没有看到这个问题 一般的问题是:你如何处理和保持一个完整的分组,基于分组的某些特征,你可能不知道,直到你看了分组中的所有观察结果 使用与前面问题类似的输入数据: * For some reason, we are tasked with keeping only observations that * are in groups of ID_1 and ID_2 that contain at
* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
我的答案可能不是最有效的,尤其是对于大型数据集,我想看看其他可能的答案。这是:
* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
* I realize the data are already sorted, but I think it is better
* not to assume they are.;
proc sort data=x;
by id id_2;
run;
data obstokeep;
keep id id_2 value;
retain startptr haszero;
* This SET statement reads through the dataset in sequence and
* uses the CUROBS option to obtain the observation number. In
* most situations, this will be the same as the _N_ automatic
* variable, but CUROBS is probably safer.;
set x curobs=myptr;
by id id_2;
* When this is the first observation in a BY-group, save the
* current observation number (pointer).
* Also initialize a flag variable that will become 1 if any
* obs contains a VALUE of 0;
* The variables are in a RETAIN statement, so they keep their
* values as the SET statement above is executed for each obs
* in the BY-group.;
if first.id_2
then do;
startptr=myptr;
haszero=0;
end;
* This statement is executed for each observation. We check
* whether VALUE is 0 and, if so, record that fact.;
if value = 0
then haszero=1;
* At the end of the BY-group, we check to see if there were
* any observations with VALUE = 0. If so, we go back using
* another SET statement, re-read them via direct access, and
* write them to the output dataset.
* (Note that if VALUE order is not relevant, you can gain a bit
* more efficiency by writing the current obs first, then going
* back to get the rest.);
if last.id_2 and haszero
then do;
* When LAST and FIRST at the same time, there is only one
* obs, so no need to backtrack, just output and go on.;
if first.id_2
then output obstokeep;
else do;
* Here we assume that the observations are sequential
* (which they will be for a sequential SET statement),
* so we re-read these observations using another SET
* statement with the POINT option for direct access
* starting with the first obs of the by-group (the
* saved pointer) and ending with the current one (the
* current pointer).;
do i=startptr to myptr;
set x point=i;
output obstokeep;
end;
end;
end;
run;
双道回路解决方案:
data have;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
data want;
do _n_ = 1 by 1 until(last.id_2);
set have;
by id id_2;
flag = sum(flag,value=0);
end;
do _n_ = 1 to _n_;
set have;
if flag then output;
end;
drop flag;
run;
我已经使用~55m行对点方法进行了测试,并没有发现明显的性能差异。使用的数据集:
data have;
do ID = 1 to 10000000;
do id_2 = 1 to ceil(ranuni(1)*10);
do value = floor(ranuni(2) * 5);
output;
end;
end;
end;
run;
你应该看看道琼斯循环谢谢你,谢谢@Reeza给我介绍道琼斯循环。哎呀,错过了编辑时间限制。不管怎样,@user667489,你的代码很有效,我很欣赏它的优雅,但我认为它不是最有效的。因为规范中说,我们要么保留要么拒绝整个by group,所以没有必要对have数据集进行两次迭代。这就是为什么我使用POINT=选项,所以它只在需要时访问第二个SET语句。我还有一个问题:SUMflag,value=0是否比IF语句和简单的1到flag赋值更有效?我怀疑您会发现额外完整的顺序读取与通过点的随机访问成本相同或更低。sum在这里很方便,因为它在每个by组的开始处默认缺少值为0。我检查了sequential read a DATA NULL、SET和RUN的精简版本,该版本针对精简的DoW循环使用相同的代码,但在EOF为真时结束的循环中,SET x END=EOF除外。我在大约135M的观测数据上运行它,传统的顺序读取持续比DoW loop数据步骤花费7%的时间。因此,看起来道指天生比传统方法更有效,这可能是因为每次都要创建PDV。我还比较了顺序集合对集合和点=选项的效率。同样使用约135M obs,直接访问方法所用时间是顺序方法所用时间的3倍多。显然,对于资源密集型流程不建议这样做。感谢您提供此SQL解决方案。然而,它有一个小问题。它返回的行太多,因为有时BY组有多个值为0的行。在本例中,它为id=B、id_2=3返回了两个额外的行。我对它进行了调整,使其能够工作:codeproc sql;从select id中选择a.id,a.id_2,b.value,id_2,count*from have where value=0按id分组,id_2左连接a.id=b.id和a.id_2=b.id_2上的b;退出codeI通过在2750万行上运行此代码和@user667489代码来检查效率,PROC SQL占用51.49 21.01 CPU秒,而double DoW方法仅占用13.37 4.87 CPU秒。procsql正在对数据进行排序,这可能会造成很大的差异。
proc sql;
select a.*,b.value from (select id,id_2 from have where value=0)a left join have b
on a.id=b.id and a.id_2=b.id_2;
quit;