Sas 我如何选择前5个关于重复的观察结果？_Sas

Sas 我如何选择前5个关于重复的观察结果？

sas

Sas 我如何选择前5个关于重复的观察结果？,sas,Sas,我有一个大数据集，包含超过8000000行，按“姓名”和“收入”排序（姓名和收入都有重复项）。对于第一个名字，我希望有5个最低的收入。对于第二个名字，我希望有5个最低的收入（但第一个名字的收入将被取消入选资格）。依此类推，直到姓氏（如果当时还有收入的话）。您首先要在姓名中对收入进行排名。因此： proc rank data=yourdata out=temp ties=low; by name; var income; ranks incomerank; run; 然后，您要

我有一个大数据集，包含超过8000000行，按“姓名”和“收入”排序（姓名和收入都有重复项）。对于第一个名字，我希望有5个最低的收入。对于第二个名字，我希望有5个最低的收入（但第一个名字的收入将被取消入选资格）。依此类推，直到姓氏（如果当时还有收入的话）。

您首先要在姓名中对收入进行排名。因此：

proc rank data=yourdata out=temp ties=low;
   by name;
   var income;
   ranks incomerank;
run;

然后，您要按姓名筛选5个最低收入者，因此：

proc sql;
create table want as
select distinct *
from temp 
where incomerank < 6;
quit;

proc-sql；
创建所需的表作为
选择不同的*
临时工
其中incomerank<6；
退出

以下是我对您的问题的解释和解决方案

假设您的数据的简化版本如下所示，并且您希望每个姓名有2个最低收入。为简单起见，我使用一个数值变量

作为名称，但字符变量也可以使用

data have;
input n income;
datalines;
1 100
1 200
1 300
2 400
2 100
2 500
3 600
3 200
3 500
;

根据这些数据，我猜你的逻辑是这样的：

从n=1开始
以最低收入（100和200）输出2个观察值
转到下一个名称（n=2）
输出收入最低的两个观测值，这两个观测值尚未输出（300和400）。在n=1组中已输出200
……等等

这将产生以下预期结果：

data want;
input n income;
datalines;
1 100
1 200
2 300
2 400
3 500
;

尝试下面的解决方案，并验证您是否得到了上面公布的结果

data want(drop=c);

   if _N_ = 1 then do;
      dcl hash h(ordered : 'a', multidata : 'y');
      h.definekey('income');
      h.definedone();
      dcl hiter i('h');

      dcl hash inc();
      inc.definekey('income');
      inc.definedone();
   end;

   do until (last.n);
      set have;
      by n;
      h.add();
   end;

   do c = 0 by 0 while (i.next() = 0);
      if inc.add() = 0 then do;
         c + 1;
         output;
      end;
      if c = 2 then leave;
   end;

   _N_ = i.first();
   _N_ = i.prev();
   h.clear();    

run;

最后，让我们使用8000万obs创建可表示的示例数据。如果c=2，我更改

，然后离开语句添加到，然后离开返回到实际问题
下面的代码在我的系统上运行约45秒，一次处理数据。让我知道它是否适合您：-）
你需要对收入进行分类和跟踪

使用数组
对名称
的最低五个收入
进行排序和跟踪
使用散列
来跟踪和检查收入
是否符合输出
的条件，从而不符合以后的名称输出的条件

例如：
使用了一种符合条件的低值收入的插入排序，由于只有5个项目，因此插入速度很快
data have;
  call streaminit(1234);
  do name = 1 to 1e6;
    do seq = 1 to rand('integer', 20);
      income = rand('integer', 20000, 1000000);
      output;
    end;
  end;
run;

data
  want (label='Lowest 5 incomes (first occurring over all names) of each name')
  want_barren(keep=name label='Names whose all incomes were previously output for earlier names')
;
  array X(5) _temporary_;

  if _n_ = 1 then do;
    if 0 then set have;
    declare hash incomes();
    incomes.defineKey('income');
    incomes.defineDone();
  end;

  _maxmin5 = 1e15;
  x(1) = 1e15;
  x(2) = 1e15;
  x(3) = 1e15;
  x(4) = 1e15;
  x(5) = 1e15;

  do _n_ = 1 by 1 until (last.name);
    set have;
    by name;

    if incomes.check() = 0 then continue;

    * insert sort - lowest five not observed previously;

    if income > _maxmin5 then continue;

    do _i_ = 1 to 5;
      if income < x(_i_) then do;
        do _j_ = 5 to _i_+1 by -1;
          x(_j_) = x(_j_-1);
        end;
        x(_i_) = income;
        _maxmin5 = x(5);
        incomes.add();
        leave;
      end;
    end;
  end;

  _outflag = 0;
  do _n_ = 1 to _n_;
    set have;

    if income in x then do;
      _outflag = 1;
      OUTPUT want;
    end;
  end;

  if not _outflag then 
    OUTPUT want_barren;

  drop _:;
run;

数据已经存在；
调用streaminit（1234）；
do名称=1至1e6；
do-seq=1到rand（'integer'，20）；
收入=兰特（'integer'，20000，1000000）；
产出；
结束；
结束；
跑
数据
want（label='每个名字的最低5个收入（在所有名字中首先出现）'）
want_barning（keep=name label='name，其所有收入以前都是为早期姓名输出的姓名'）
;
阵列X（5）u临时u；
如果_n_=1，则执行；
如果为0，则设置为have；
申报散列收入（）；
收入。定义为（“收入”）；
收入。定义为一（）；
结束；
_maxmin5=1e15；
x（1）=1e15；
x（2）=1e15；
x（3）=1e15；
x（4）=1e15；
x（5）=1e15；
do _n_=1乘1直到（last.name）；
集有；
按姓名
如果.check（）=0，则继续；
*插入排序-之前未观察到的最低五个；
如果收入>最大值5，则继续；
do _i_=1到5；
如果收入你能举一个你的数据的例子吗？你希望结果是什么？所谓的“平局”是指“平局”？那就是你想要最低的5个不同的收入水平吗？是的，我的意思是与！当收入存在联系时，你得到的观察结果是否重要？如果是的话，只选择一个观察的标准是什么？第一个人的收入值是150、200、300、400、500和600。第二个人的收入值为100、300、600、700、900、1100、1400和1500。然后我希望第1个人收到150，200，300，400，500，第2个人收到100，600，700，900和1100。这是否更清楚了呢？：）非常感谢。除了文件中的其他变量（除了姓名和收入）出现错误之外，这似乎是可行的。例如，我在前5行上得到相同的地址变量，在下5行上得到另一个地址变量，必须添加另一个答案以适应此情况。当第一个答案变得很长时，你是否在另一个答案中回答了这个问题：-）是的，非常有效！！再次感谢！任何时候：-）请记住关闭线程。
data have;
  call streaminit(1234);
  do name = 1 to 1e6;
    do seq = 1 to rand('integer', 20);
      income = rand('integer', 20000, 1000000);
      output;
    end;
  end;
run;

data
  want (label='Lowest 5 incomes (first occurring over all names) of each name')
  want_barren(keep=name label='Names whose all incomes were previously output for earlier names')
;
  array X(5) _temporary_;

  if _n_ = 1 then do;
    if 0 then set have;
    declare hash incomes();
    incomes.defineKey('income');
    incomes.defineDone();
  end;

  _maxmin5 = 1e15;
  x(1) = 1e15;
  x(2) = 1e15;
  x(3) = 1e15;
  x(4) = 1e15;
  x(5) = 1e15;

  do _n_ = 1 by 1 until (last.name);
    set have;
    by name;

    if incomes.check() = 0 then continue;

    * insert sort - lowest five not observed previously;

    if income > _maxmin5 then continue;

    do _i_ = 1 to 5;
      if income < x(_i_) then do;
        do _j_ = 5 to _i_+1 by -1;
          x(_j_) = x(_j_-1);
        end;
        x(_i_) = income;
        _maxmin5 = x(5);
        incomes.add();
        leave;
      end;
    end;
  end;

  _outflag = 0;
  do _n_ = 1 to _n_;
    set have;

    if income in x then do;
      _outflag = 1;
      OUTPUT want;
    end;
  end;

  if not _outflag then 
    OUTPUT want_barren;

  drop _:;
run;

data have;
   do n = 1 to 8e5;
      do _N_ = 1 to 100;
         income = ceil(rand('uniform') * 1e4);
         address = cats('Address_', _N_);
         output;
      end;
   end;
run;

data want(drop=c);

   if _N_ = 1 then do;
      dcl hash h(dataset : 'have(obs=0)', ordered : 'a', multidata : 'y');
      h.definekey('income');
      h.definedata(all : 'y');
      h.definedone();
      dcl hiter i('h');

      dcl hash inc();
      inc.definekey('income');
      inc.definedone();
   end;

   do until (last.n);
      set have;
      by n;
      h.add();
   end;

   do c = 0 by 0 while (i.next() = 0);
      if inc.add() = 0 then do;
         c + 1;
         output;
      end;
      if c = 5 then leave;
   end;

   _N_ = i.first();
   _N_ = i.prev();
   h.clear();    

run;