Loops SAS循环:根据条件垂直求和观测值

Loops SAS循环:根据条件垂直求和观测值,loops,sas,Loops,Sas,我有一个数据集,看起来像: 邮政编码车辆总数 11111 3 11111 4 23232 1 443310 44331 10 18860 6 18860 6 18860 6 188608 有300多万行像这样,有不同的拉链。我需要对每个邮政编码的总汽车数求和,这样生成的表如下所示 邮政编码车辆总数 11111 7 23232 1 44331 10 18860 26 . . 考虑到数据集的大小,手动将ZIP输入到代码中不是一个选项。想法 要求和的变量是“ZipCodes”,因此将进入“C

我有一个数据集,看起来像:

邮政编码车辆总数

  • 11111 3
  • 11111 4
  • 23232 1
  • 443310
  • 44331 10
  • 18860 6
  • 18860 6
  • 18860 6
  • 188608
有300多万行像这样,有不同的拉链。我需要对每个邮政编码的总汽车数求和,这样生成的表如下所示

邮政编码车辆总数

  • 11111 7
  • 23232 1
  • 44331 10
  • 18860 26 . .
考虑到数据集的大小,手动将ZIP输入到代码中不是一个选项。想法

  • 要求和的变量是“ZipCodes”,因此将进入“Class”部分
  • 您需要对总车数求和,因此将进入“var”部分
  • 输入表和输出表是自解释的
  • /*代码/

       proc summary data=Input_table;
                   class ZipCodes;
                   var Total_cars;
            output out=Output_table
            sum()=;
            run;
    
  • 要求和的变量是“ZipCodes”,因此将进入“Class”部分
  • 您需要对总车数求和,因此将进入“var”部分
  • 输入表和输出表是自解释的
  • /*代码/

       proc summary data=Input_table;
                   class ZipCodes;
                   var Total_cars;
            output out=Output_table
            sum()=;
            run;
    

    您可以使用procsql。这是一个非常简单的步骤

    proc sql;
    
    create table new as
    
    select Zipcodes, sum(Total Cars) as total_cars from table_have group by Zipcodes
    
    ;
    

    退出

    您可以使用proc-sql。这是一个非常简单的步骤

    proc sql;
    
    create table new as
    
    select Zipcodes, sum(Total Cars) as total_cars from table_have group by Zipcodes
    
    ;
    

    退出

    到目前为止,两种答案都是可以的,但下面是两种可能方法的更详细解释:

    过程SQL方法

    PROC SQL;
      CREATE TABLE output_table AS
      SELECT ZipCodes,
      SUM(Total_Cars) as Total_Cars
      FROM input_table
      GROUP BY ZipCodes;
    QUIT;
    
    PROC SUMMARY DATA=input_table NWAY;
                 CLASS ZipCodes;
                 VAR Total_Cars;
                 OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
    RUN;
    
    PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
      BY ZipCodes;
    RUN;
    
    DATA output_table (DROP=TC);
      SET _temp;
      BY ZipCodes;
      IF first.ZipCodes THEN Total_Cars = 0;
      Total_Cars+tc;
      IF last.ZipCodes THEN OUTPUT;
    RUN;
    
    proc tabulate data=testin out=testout
    
        /*drop extra created vars and rename as needed*/
        (drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
    
        /*grouping variable, also used to sort output in ascending order*/
        class Zip;
    
        /* variable to be analyzed*/
        var Cars;
    
        /*sum cars by zip code*/
        table Zip, Cars*(sum);
    run;
    
    groupby
    子句也可以写入
    groupby 1
    ,省略
    ZipCodes
    ,因为这是指
    SELECT
    子句中的第一列

    过程汇总方法

    PROC SQL;
      CREATE TABLE output_table AS
      SELECT ZipCodes,
      SUM(Total_Cars) as Total_Cars
      FROM input_table
      GROUP BY ZipCodes;
    QUIT;
    
    PROC SUMMARY DATA=input_table NWAY;
                 CLASS ZipCodes;
                 VAR Total_Cars;
                 OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
    RUN;
    
    PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
      BY ZipCodes;
    RUN;
    
    DATA output_table (DROP=TC);
      SET _temp;
      BY ZipCodes;
      IF first.ZipCodes THEN Total_Cars = 0;
      Total_Cars+tc;
      IF last.ZipCodes THEN OUTPUT;
    RUN;
    
    proc tabulate data=testin out=testout
    
        /*drop extra created vars and rename as needed*/
        (drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
    
        /*grouping variable, also used to sort output in ascending order*/
        class Zip;
    
        /* variable to be analyzed*/
        var Cars;
    
        /*sum cars by zip code*/
        table Zip, Cars*(sum);
    run;
    
    这个方法类似于这个问题的另一个答案,但我补充道:

    • NWAY
      -只给出了最大的汇总级别,这里没有那么重要,因为您只有一个
      变量,这意味着只有一个汇总级别。但是,如果不使用
      NWAY
      ,您将获得一个额外的行,显示整个数据集中
      total_Cars
      的总值,这不是您在问题中要求的

    • DROP=\u TYPE\u\u FREQ\u
      -这将删除自动变量:

      • \u TYPE \
        -显示汇总级别(见上文注释),该列仅包含值
        1
      • \u FREQ\u
        -给出了
        ZipCodes
        的频率计数,虽然它很有用,但在您的问题中并不是您想要的
    数据步进法

    PROC SQL;
      CREATE TABLE output_table AS
      SELECT ZipCodes,
      SUM(Total_Cars) as Total_Cars
      FROM input_table
      GROUP BY ZipCodes;
    QUIT;
    
    PROC SUMMARY DATA=input_table NWAY;
                 CLASS ZipCodes;
                 VAR Total_Cars;
                 OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
    RUN;
    
    PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
      BY ZipCodes;
    RUN;
    
    DATA output_table (DROP=TC);
      SET _temp;
      BY ZipCodes;
      IF first.ZipCodes THEN Total_Cars = 0;
      Total_Cars+tc;
      IF last.ZipCodes THEN OUTPUT;
    RUN;
    
    proc tabulate data=testin out=testout
    
        /*drop extra created vars and rename as needed*/
        (drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
    
        /*grouping variable, also used to sort output in ascending order*/
        class Zip;
    
        /* variable to be analyzed*/
        var Cars;
    
        /*sum cars by zip code*/
        table Zip, Cars*(sum);
    run;
    

    这只是出于完整性考虑,并不像预排序那样有效。

    到目前为止,两种答案都可以,但下面是两种可能方法的更详细说明:

    过程SQL方法

    PROC SQL;
      CREATE TABLE output_table AS
      SELECT ZipCodes,
      SUM(Total_Cars) as Total_Cars
      FROM input_table
      GROUP BY ZipCodes;
    QUIT;
    
    PROC SUMMARY DATA=input_table NWAY;
                 CLASS ZipCodes;
                 VAR Total_Cars;
                 OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
    RUN;
    
    PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
      BY ZipCodes;
    RUN;
    
    DATA output_table (DROP=TC);
      SET _temp;
      BY ZipCodes;
      IF first.ZipCodes THEN Total_Cars = 0;
      Total_Cars+tc;
      IF last.ZipCodes THEN OUTPUT;
    RUN;
    
    proc tabulate data=testin out=testout
    
        /*drop extra created vars and rename as needed*/
        (drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
    
        /*grouping variable, also used to sort output in ascending order*/
        class Zip;
    
        /* variable to be analyzed*/
        var Cars;
    
        /*sum cars by zip code*/
        table Zip, Cars*(sum);
    run;
    
    groupby
    子句也可以写入
    groupby 1
    ,省略
    ZipCodes
    ,因为这是指
    SELECT
    子句中的第一列

    过程汇总方法

    PROC SQL;
      CREATE TABLE output_table AS
      SELECT ZipCodes,
      SUM(Total_Cars) as Total_Cars
      FROM input_table
      GROUP BY ZipCodes;
    QUIT;
    
    PROC SUMMARY DATA=input_table NWAY;
                 CLASS ZipCodes;
                 VAR Total_Cars;
                 OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
    RUN;
    
    PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
      BY ZipCodes;
    RUN;
    
    DATA output_table (DROP=TC);
      SET _temp;
      BY ZipCodes;
      IF first.ZipCodes THEN Total_Cars = 0;
      Total_Cars+tc;
      IF last.ZipCodes THEN OUTPUT;
    RUN;
    
    proc tabulate data=testin out=testout
    
        /*drop extra created vars and rename as needed*/
        (drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
    
        /*grouping variable, also used to sort output in ascending order*/
        class Zip;
    
        /* variable to be analyzed*/
        var Cars;
    
        /*sum cars by zip code*/
        table Zip, Cars*(sum);
    run;
    
    这个方法类似于这个问题的另一个答案,但我补充道:

    • NWAY
      -只给出了最大的汇总级别,这里没有那么重要,因为您只有一个
      变量,这意味着只有一个汇总级别。但是,如果不使用
      NWAY
      ,您将获得一个额外的行,显示整个数据集中
      total_Cars
      的总值,这不是您在问题中要求的

    • DROP=\u TYPE\u\u FREQ\u
      -这将删除自动变量:

      • \u TYPE \
        -显示汇总级别(见上文注释),该列仅包含值
        1
      • \u FREQ\u
        -给出了
        ZipCodes
        的频率计数,虽然它很有用,但在您的问题中并不是您想要的
    数据步进法

    PROC SQL;
      CREATE TABLE output_table AS
      SELECT ZipCodes,
      SUM(Total_Cars) as Total_Cars
      FROM input_table
      GROUP BY ZipCodes;
    QUIT;
    
    PROC SUMMARY DATA=input_table NWAY;
                 CLASS ZipCodes;
                 VAR Total_Cars;
                 OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
    RUN;
    
    PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
      BY ZipCodes;
    RUN;
    
    DATA output_table (DROP=TC);
      SET _temp;
      BY ZipCodes;
      IF first.ZipCodes THEN Total_Cars = 0;
      Total_Cars+tc;
      IF last.ZipCodes THEN OUTPUT;
    RUN;
    
    proc tabulate data=testin out=testout
    
        /*drop extra created vars and rename as needed*/
        (drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
    
        /*grouping variable, also used to sort output in ascending order*/
        class Zip;
    
        /* variable to be analyzed*/
        var Cars;
    
        /*sum cars by zip code*/
        table Zip, Cars*(sum);
    run;
    

    这只是为了完整性而包括在内,它的效率不如需要预排序的效率。

    为了补充@mjsqu的答案,为了(更多)完整性:

    data testin;
        input Zip Cars;
        datalines;
    11111 3
    11111 4
    23232 1
    44331 0
    44331 10
    18860 6
    18860 6
    18860 6
    18860 8
    ;
    
    过程制表方法

    PROC SQL;
      CREATE TABLE output_table AS
      SELECT ZipCodes,
      SUM(Total_Cars) as Total_Cars
      FROM input_table
      GROUP BY ZipCodes;
    QUIT;
    
    PROC SUMMARY DATA=input_table NWAY;
                 CLASS ZipCodes;
                 VAR Total_Cars;
                 OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
    RUN;
    
    PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
      BY ZipCodes;
    RUN;
    
    DATA output_table (DROP=TC);
      SET _temp;
      BY ZipCodes;
      IF first.ZipCodes THEN Total_Cars = 0;
      Total_Cars+tc;
      IF last.ZipCodes THEN OUTPUT;
    RUN;
    
    proc tabulate data=testin out=testout
    
        /*drop extra created vars and rename as needed*/
        (drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
    
        /*grouping variable, also used to sort output in ascending order*/
        class Zip;
    
        /* variable to be analyzed*/
        var Cars;
    
        /*sum cars by zip code*/
        table Zip, Cars*(sum);
    run;
    
    如果使用企业指南,将生成一个数据集和一个结果表。要抑制结果并仅输出数据集,请在“proc TABLATE”之前包含此行:

    在“运行”之后:


    为了补充@mjsqu的答案,为了(更多)完整性:

    data testin;
        input Zip Cars;
        datalines;
    11111 3
    11111 4
    23232 1
    44331 0
    44331 10
    18860 6
    18860 6
    18860 6
    18860 8
    ;
    
    过程制表方法

    PROC SQL;
      CREATE TABLE output_table AS
      SELECT ZipCodes,
      SUM(Total_Cars) as Total_Cars
      FROM input_table
      GROUP BY ZipCodes;
    QUIT;
    
    PROC SUMMARY DATA=input_table NWAY;
                 CLASS ZipCodes;
                 VAR Total_Cars;
                 OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
    RUN;
    
    PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
      BY ZipCodes;
    RUN;
    
    DATA output_table (DROP=TC);
      SET _temp;
      BY ZipCodes;
      IF first.ZipCodes THEN Total_Cars = 0;
      Total_Cars+tc;
      IF last.ZipCodes THEN OUTPUT;
    RUN;
    
    proc tabulate data=testin out=testout
    
        /*drop extra created vars and rename as needed*/
        (drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
    
        /*grouping variable, also used to sort output in ascending order*/
        class Zip;
    
        /* variable to be analyzed*/
        var Cars;
    
        /*sum cars by zip code*/
        table Zip, Cars*(sum);
    run;
    
    如果使用企业指南,将生成一个数据集和一个结果表。要抑制结果并仅输出数据集,请在“proc TABLATE”之前包含此行:

    在“运行”之后:


    试着在回答中更具体一些,更好地解释代码段在做什么。试着在回答中更具体一些,更好地解释代码段在做什么。你能突出显示并按Ctrl+K设置答案中的代码格式吗?嗨,mjsqu-我已经更新了格式。。。希望它现在更容易理解。你能通过突出显示并按Ctrl+K设置答案中的代码格式吗?嗨,mjsqu-我已经更新了格式。。。希望现在更容易理解,因为这是一个新概念,我建议您研究一下如何编写SQL教程,比如下面的教程:。然后,您将能够详细介绍@mjsqu在下面提供的SQL答案。由于对这一点还不熟悉,我建议您研究一下如何编写SQL教程,例如下面的教程:。然后,您将能够详细说明@mjsqu在下面提供的SQL答案。