以R为单位计算每个月的百分比_R

以R为单位计算每个月的百分比

以R为单位计算每个月的百分比,r,R,我有以下数据集，包含200万个观测值。数据为2008年4月至2010年4月期间的数据 > head(df) Empst Gender Age Agegroup Marst Education State Year Month 1 Employed Female 58 50-60 Married Some college or associate degree AL 2008

我有以下数据集，包含200万个观测值。数据为2008年4月至2010年4月期间的数据

> head(df)
               Empst Gender Age Agegroup   Marst                         Education State Year Month
1           Employed Female  58    50-60 Married  Some college or associate degree    AL 2008    12
2 Not in labor force   Male  63      61+ Married   Less than a high school diploma    AL 2008    12
3           Employed   Male  60    50-60  Single  Some college or associate degree    AL 2008    12
4 Not in labor force   Male  55    50-60  Single High school graduates, no college    AL 2008    12
5           Employed   Male  36    30-39  Single  Some college or associate degree    AL 2008    12
6           Employed Female  42    40-49 Married       Bachelor's degree or higher    AL 2008    12
  YYYYMM   Weight
1 200812 1876.356
2 200812 2630.503
3 200812 2763.981
4 200812 2693.110
5 200812 2905.784
6 200812 3511.313

我想计算和绘制每月的失业率。为了计算失业率，我将失业者的权重之和除以就业者和失业者的权重之和：

    sum(df[df$Empst=="Unemployed",]$Weight) / 
    sum(df[df$Empst %in% c("Employed","Unemployed"),]$Weight)

要计算每月失业率，我使用for循环：

UnR<-vector()
for(i in levels(factor(df$YYYYMM))){
  temp<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
        sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
  UnR<-append(UnR,temp)
  rm(temp)
}

您是否考虑过使用该软件包，特别是ddply？您可以将数据帧放入其中，以唯一的时间戳为轴心。所以你会得到这样的结果：

unemployment_rate.df <- ddply(.data = df,
                              .variables = "YYYYMM",
                              .fun = function(x){
                                return(sum(x$weight[x$Empst== "unemployed"])/sum(x$weight[|x$Empst== "Not in labor force"]))

如果目标是加速for循环，那么另一种实现方法（您通常应该将此应用于for循环）是预先指定输出向量的长度（如果您知道的话）。因此，使用这个例子，您知道您将有一个与unique（df$yyyyymm）长度相同的输出向量-因此，如果您提前指定，那么循环应该移动得更快，因为R不再需要每次迭代都扩展向量-它只是修改一个现有的（空白）元素

您还可以通过只分配给输出_向量[i]，避免以这种方式分配/追加，这也会占用时间-R会话必须为每次迭代节省一些空间。所以，通过这个例子，你会得到类似的东西

#Create an output vector. We can specify length, because we know there'll
#be one entry for each unique value in the YYYYMM column.
#That saves time because it means R just modifies the vector in place.
UnR <- numeric(length(unique(df$YYYYMM))

#And now, the for loop.
for(i in levels(factor(df$YYYYMM))){

  #Instead of creating a temporary object (which takes time), and then appending
  #(which takes time), we can just assign the result to the Ith element of the
  #output vector.
  UnR[i]<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
        sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
}

#创建一个输出向量。我们可以指定长度，因为我们知道
#YYYYMM列中的每个唯一值只能有一个条目。
#这节省了时间，因为这意味着R只是在适当的位置修改向量。
UnR您可以使用dplyr
执行此操作，这与plyr
方法有些类似
require(dplyr)
df %.%
    group_by(YYYYMM) %.%
    summarize(UnR = sum(Weight[Empst == "Employed"]) /
                    sum(Weight[Empst %in% c("Employed", "Unemployed")]))

dplyr
几乎肯定会比plyr
快，但除非您的数据非常大，否则您可能不会注意到差异。
dplyr现在比plyr更受欢迎，是的：）。我在mo以plyr为中心，因为基于debian的系统还没有R3。呸。有选择总是好的：）。让我们看看是否有人发布了data.table解决方案…@Ironholds我在ubuntu下有R3；哪个ubuntu版本？我不在xubuntu上，也不在我们的服务器（运行LTS.Bah）上。您可以在ddply
调用中使用summary
，以避免所有x$
的问题，是否存在未考虑的数据？（“不在劳动力市场”）参见@shujaa的dplyr
摘要中的类似代码@Ironholds你是什么意思？在计算失业率时，不考虑“非劳动力”人口。您是指答案中的代码可能由于不考虑“不在劳动力市场”而产生错误结果吗？不，我是在回答Henrik的评论，即摘要将包括“不在劳动力市场”，因为这些条目在每个子集中：）。
#Create an output vector. We can specify length, because we know there'll
#be one entry for each unique value in the YYYYMM column.
#That saves time because it means R just modifies the vector in place.
UnR <- numeric(length(unique(df$YYYYMM))

#And now, the for loop.
for(i in levels(factor(df$YYYYMM))){

  #Instead of creating a temporary object (which takes time), and then appending
  #(which takes time), we can just assign the result to the Ith element of the
  #output vector.
  UnR[i]<-sum(df[df$Empst=="Unemployed" & df$YYYYMM == i,]$Weight) /
        sum(df[df$Empst %in% c("Employed","Unemployed") & df$YYYYMM == i,]$Weight)
}

require(dplyr)
df %.%
    group_by(YYYYMM) %.%
    summarize(UnR = sum(Weight[Empst == "Employed"]) /
                    sum(Weight[Empst %in% c("Employed", "Unemployed")]))