将SAS数据集按行并行拆分为互斥数据集？_Sas_Sas Ds2

将SAS数据集按行并行拆分为互斥数据集？

sas

将SAS数据集按行并行拆分为互斥数据集？,sas,sas-ds2,Sas,Sas Ds2,但是随着procds2的出现，必须有一种使用线程的方法来实现这一点吗我编写了以下数据步骤，将数据集拆分为块。块。我试图在proc ds中编写相同的代码，但失败了。我对proc ds2相当陌生，因此，为对数据步骤有良好理解的人提供一个简单的解释将是理想的选择数据步骤代码 %macro output_chunks(in, out, by, chunks); data %do i = 1 %to &chunks.; &out.&i.(compress=char dr

但是随着

procds2

的出现，必须有一种使用线程的方法来实现这一点吗

我编写了以下数据步骤，将数据集拆分为

块。

块。我试图在

proc ds

中编写相同的代码，但失败了。我对

proc ds2

相当陌生，因此，为对数据步骤有良好理解的人提供一个简单的解释将是理想的选择

数据步骤代码

%macro output_chunks(in, out, by, chunks);
data %do i = 1 %to &chunks.;
    &out.&i.(compress=char drop = i)
    %end;
;
    set &in.;
    by &by.;
    retain i 0;
    if first.&by. then do;
        i = i + 1;      
        if i = &chunks.+1 then i = 1;
    end;

    %do i = 1 %to &chunks.;
        if i = &i. then do;
            output  &out.&i.;
        end;
    %end;
run;
%mend;

proc ds2; 
  thread split/overwrite=yes; 
    method run(); 
      set in_data; 
      thisThread=_threadid_; 
      /* can make below into macro but I can't seem to get it to work */
      if thisThread = 1 then do;
        output ds1;
      end;
      else if thisThread = 2 then do;
        output ds2;
      end;
    end; 
    method term();
      put '**Thread' _threadid_ 'processed'  count 'rows:';
    end;
  endthread; 
  run; 
quit;

过程ds2代码

%macro output_chunks(in, out, by, chunks);
data %do i = 1 %to &chunks.;
    &out.&i.(compress=char drop = i)
    %end;
;
    set &in.;
    by &by.;
    retain i 0;
    if first.&by. then do;
        i = i + 1;      
        if i = &chunks.+1 then i = 1;
    end;

    %do i = 1 %to &chunks.;
        if i = &i. then do;
            output  &out.&i.;
        end;
    %end;
run;
%mend;

proc ds2; 
  thread split/overwrite=yes; 
    method run(); 
      set in_data; 
      thisThread=_threadid_; 
      /* can make below into macro but I can't seem to get it to work */
      if thisThread = 1 then do;
        output ds1;
      end;
      else if thisThread = 2 then do;
        output ds2;
      end;
    end; 
    method term();
      put '**Thread' _threadid_ 'processed'  count 'rows:';
    end;
  endthread; 
  run; 
quit;

因此，从某种意义上说，DS/2在这里可能会有所帮助，这是对的。然而，我怀疑这有点复杂

DS/2将愉快地执行数据步骤，但更具挑战性的是写入多个不同的数据集。这是因为如果不使用宏语言，就没有一种很好的方法来构造输出数据集名称，据我所知，宏语言对线程处理的效果并不好（尽管我在这里不是专家）

下面是一个使用线程的示例：

PROC DS2;

     thread in_thread/overwrite=yes;
     dcl bigint count;
     drop count;
        method init();
           count=0;
        end;
         method run();
             set in_data;
             count+1;
             output;             
         end;
         method term();      
           put 'Thread' _threadid_ ' processed' count 'observations.';
         end;
     endthread;
     run;

     data out_data/overwrite=yes;
         dcl thread in_thread t_in; /* instance of the thread */
         method run();
           set from t_in threads=4;
           output;
         end;
     enddata;
     run;
quit;

但这只是写出一个数据集，如果将

threads=4

更改为1，实际上不需要更长的时间。两者在速度上都不错，但实际上比常规数据步慢（我的速度大约是1.8倍）。在访问SAS数据集时，DS/2使用比SAS的基本数据步骤慢得多的方法来访问引擎盖下的数据；当您通过SQL或类似工具在RDBMS中工作时，DS/2的速度增益真正发挥作用

但是，没有好的方法来并行驱动输出。这是上面的版本变成了4个数据集。请注意，输出位置的实际选择在主、非线程数据步骤中

PROC DS2;

     thread in_thread/overwrite=yes;
     dcl bigint count;
     dcl bigint thisThread;
     drop count;
        method init();
           count=0;
        end;
         method run();
             set in_data;
             count+1;
             thisThread = _threadid_;
             output;

         end;
         method term();      
           put 'Thread' _threadid_ ' processed' count 'observations.';
         end;
     endthread;
     run;

     data a b c d/overwrite=yes;
         dcl thread in_thread t_in; /* instance of the thread */
         method run();
           set from t_in threads=4;
           select(thisThread);
             when (1) output a;
             when (2) output b;
             when (3) output c;
             when (4) output d;
             otherwise;
           end;
         end;
     enddata;
     run;
quit;

所以它实际上比非线程版本慢很多。哎呀

实际上，这里的问题是磁盘i/o是主要问题，而不是CPU。您的CPU在这里几乎不工作。DS/2在某些边缘情况下可能会有所帮助，在这种情况下，您的SAN速度非常快，可以同时进行大量写入操作，但根据您的i/o限制，读取这些百万条记录最终需要X倍的时间，写入一百万条记录也需要X倍的时间，而且并行化很有可能没有帮助

哈希表将添加更多我怀疑的内容，并且肯定可以在这里与DS/2一起使用；有关数据步骤版本，请参见我在OP中链接的另一个问题上的新答案。DS/2可能不会使解决方案更快，更可能更慢；但是如果需要，您可以在DS/2中实现大致相同的功能，然后子线程就可以自己输出，而不需要涉及主线程

如果您是在Teradata或其他应用程序中执行此操作，DS/2将非常有用，您可以在数据库端使用SAS的in-database accelerator来执行此代码。这将使事情更有效率。然后，您可以使用类似于我上面的代码的东西，或者更好的哈希解决方案。

用户定义的DS2包使用HoH方法拆分数据集的示例，最大的缺点是由于DS2中变量列表的实用性非常有限，无法按键命名数据集而无需进行大量的伪造，因此，我选择了更简单的命名约定：

data cars;
set sashelp.cars;
run;

proc ds2;

package hashSplit / overwrite=yes;

declare package hash  h  ();
declare package hash  hs ();
declare package hiter hi;

/**
  * create a child multidata hash object
  */
private method mHashSub(varlist k, varlist d) returns package hash;
  hs = _new_ [this] hash();
  hs.keys(k);
  hs.data(d);
  hs.multidata('Y');
  hs.defineDone();
  return hs;
end;

/**
  * constructor, create the parent and child hash objects
  */
method hashSplit(varlist k);
  h = _new_ [this] hash();
  h.keys(k);
  h.definedata('hs');
  h.defineDone();
end;

/**
  * adds key values to parent hash, if necessary
  * adds key values and data values to child hash
  * consilidates the FIND, ADD and nested ADD methods
  */
method add(varlist k, varlist d);
  declare double rc;

  rc = h.find();
  if rc ^= 0 then do;
    hs = mHashSub(k, d);
      h.add();
  end;
  hs.add();
end;

/**
  * outputs the child hashes to data sets with a fixed naming convention
  *
  * SAS needs to add more support for using variable lists with functions/methods besides hash
  */
method output();
  declare double rc;
  declare int i;

  hi = _new_ hiter('h');

  rc = hi.first();
  do i = 1 to h.num_items by 1 while (rc = 0);
    hs.output(catx('_', 'hashSplit', i));
      rc = hi.next();
  end;
end;

endpackage;
run;
quit;

/**
  * example of using the hashSplit package
  */
proc ds2;
data _null_;
varlist k [origin];
varlist d [_all_];
declare package hashSplit split(k);

method run();
  set cars;
  split.add(k, d);
end;

method term();
  split.output();
end;
enddata;
run;
quit;

只是想澄清一下：您想在这里具体讨论解决方案吗？或者，您是否希望找到一个比常规数据步骤更有效的通用解决方案？（例如，有一种使用哈希表的方法比我在该线程中建议的解决方案更有效。）注意：我在该问题中添加了哈希解决方案。您是否安装了SAS/CONNECT？这可能是另一种可能的方法。这实际上使用线程吗？如果是这样，它会提高性能吗？