R 纵向数据的无替换随机抽样_R_Sas_Sas Macro_Statistical Sampling

R 纵向数据的无替换随机抽样

r sas

R 纵向数据的无替换随机抽样,r,sas,sas-macro,statistical-sampling,R,Sas,Sas Macro,Statistical Sampling,我的数据是纵向的 VISIT ID VAR1 1 001 ... 1 002 ... 1 003 ... 1 004 ... ... 2 001 ... 2 002 ... 2 003 ... 2 004 ... 我们的最终目标是每次访问抽取10%进行测试。我尝试使用proc SURVEYSELECT进行SRS而不进行替换，并使用“访问”作为层。但是最终的样本会有重复的ID。例如，ID=001可能在VISIT=

我的数据是纵向的

VISIT ID   VAR1
1     001  ...
1     002  ...
1     003  ...
1     004  ...
...
2     001  ...
2     002  ...
2     003  ...
2     004  ...

我们的最终目标是每次访问抽取10%进行测试。我尝试使用proc SURVEYSELECT进行SRS而不进行替换，并使用“访问”作为层。但是最终的样本会有重复的ID。例如，ID=001可能在VISIT=1和VISIT=2中都被选择

有没有办法使用SURVEYSELECT或其他程序（R也可以）来实现这一点？非常感谢。

通过一些相当有创意的数据步编程，这是可能的。下面的代码使用贪婪的方法，依次从每次访问中取样，只对以前未取样的ID进行取样。如果已经对一次访问的ID进行了90%以上的采样，则输出的ID少于10%。在极端情况下，如果已对访问的每个id进行采样，则不会为该访问输出任何行

/*Create some test data*/
data test_data;
  call streaminit(1);
  do visit = 1 to 1000;
    do id = 1 to ceil(rand('uniform')*1000);
      output;
    end;
  end;
run;


data sample;
  /*Create a hash object to keep track of unique IDs not sampled yet*/
  if 0 then set test_data;
  call streaminit(0);
  if _n_ = 1 then do;
    declare hash h();
    rc = h.definekey('id');
    rc = h.definedata('available');
    rc = h.definedone();
  end;
  /*Find out how many not-previously-sampled ids there are for the current visit*/
  do ids_per_visit = 1 by 1 until(last.visit);
    set test_data;
    by visit;
    if h.find() ne 0 then do;
      available = 1;
      rc = h.add();
    end;
    available_per_visit = sum(available_per_visit,available);
  end;
  /*Read through the current visit again, randomly sampling from the not-yet-sampled ids*/
  samprate = 0.1;
  number_to_sample = round(available_per_visit * samprate,1);
  do _n_ = 1 to ids_per_visit;
    set test_data;
    if available_per_visit > 0 then do;
      rc = h.find();
      if available = 1 then do;
        if rand('uniform') < number_to_sample / available_per_visit then do;
          available = 0;
          rc = h.replace();
          samples_per_visit = sum(samples_per_visit,1);
          output;
          number_to_sample = number_to_sample - 1;
        end;
        available_per_visit = available_per_visit - 1;
      end;
    end;
  end;
run;

/*Check that there are no duplicate IDs*/
proc sort data = sample out = sample_dedup nodupkey;
by id;
run;

/*创建一些测试数据*/
数据测试数据；
调用streaminit（1）；
访问次数=1至1000次；
do id=1至ceil（兰特（‘统一’）*1000）；
产出；
结束；
结束；
跑
数据样本；
/*创建哈希对象以跟踪尚未采样的唯一ID*/
如果为0，则设置测试数据；
调用streaminit（0）；
如果_n_=1，则执行；
声明散列h（）；
rc=h.definekey（'id'）；
rc=h.definedata（“可用”）；
rc=h.definedone（）；
结束；
/*了解当前访问中有多少以前未采样的ID*/
每次访问ID=1乘1直到（最后一次访问）；
设置测试数据；
通过访问；
如果h.find（）为零，则执行；
可用=1；
rc=h.add（）；
结束；
每次就诊可用人数=总和（每次就诊可用人数，可用人数）；
结束；
/*再次通读当前访问，从尚未取样的ID中随机取样*/
桑普拉特=0.1；
样本数量=四舍五入（每次访问可用*1个样本）；
每次访问do=1到ID；
设置测试数据；
如果每次访问次数>0，则执行；
rc=h.find（）；
如果可用=1，则执行；
如果兰德（‘统一’）<每次访问的样本数量/可用数量，则执行；
可用=0；
rc=h.替换（）；
每次就诊样本数=总和（每次就诊样本数，1）；
产出；
编号\u至\u样本=编号\u至\u样本-1；
结束；
每次就诊可用人数=每次就诊可用人数-1；
结束；
结束；
结束；
跑
/*检查是否没有重复的ID*/
proc sort data=sample out=sample\U重复数据消除节点密钥；
按身份证；
跑

因此，您希望每次访问都获得10%的收益，但最终数据集中的所有

ID

都应该是唯一的？是的。正如您所说。只要访问时ID是唯一的，您就可以使用ave:

dat$picked@Imo。但这并不能确保最终的数据集具有唯一的ID。您的约束可能意味着，在上次访问采样时，没有剩余的ID尚未为上次访问采样。如果发生这种情况，你想怎么办？