R/使用矢量化检查df中是否存在列_R_Vectorization_Sapply

R/使用矢量化检查df中是否存在列

R/使用矢量化检查df中是否存在列,r,vectorization,sapply,R,Vectorization,Sapply,我已经定义了以下函数来检查一个数据框是否包含多个列，如果不包含，则包括它们 CheckFullCohorts <- function(df) { # Checks if year/cohort df contains all necessary columns # Args: # df: year/cohort df # Return: # df: df, corrected if necessary foo <- function(mydf, m

我已经定义了以下函数来检查一个数据框是否包含多个列，如果不包含，则包括它们

CheckFullCohorts <- function(df) {
  # Checks if year/cohort df contains all necessary columns 
  # Args:
  #  df: year/cohort df

  # Return:
  #  df: df, corrected if necessary 

  foo <- function(mydf, mystring) {
    if(!(mystring %in% names(mydf))) {
      mydf[mystring] <- 0
    }
    mydf
  }

  df <- foo(df, "age.16.20")
  df <- foo(df, "age.21.24")
  df <- foo(df, "age.25.49")
  df <- foo(df, "age.50.57")
  df <- foo(df, "age.58.65")
  df <- foo(df, "age.66.70")

  df
}

checkfullcourts您可以轻松将其矢量化：
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
               "age.50.57", "age.58.65", "age.66.70")

test[musthaves[!(musthaves %in% names(test))]] <- 0
#  age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1         x     y         0         0         0         0         0
#2         x     y         0         0         0         0         0
#3         x     y         0         0         0         0         0
#4         x     y         0         0         0         0         0
#5         x     y         0         0         0         0         0

test您可以轻松地将其矢量化：
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
               "age.50.57", "age.58.65", "age.66.70")

test[musthaves[!(musthaves %in% names(test))]] <- 0
#  age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1         x     y         0         0         0         0         0
#2         x     y         0         0         0         0         0
#3         x     y         0         0         0         0         0
#4         x     y         0         0         0         0         0
#5         x     y         0         0         0         0         0

测试将字符串向量S
传递给checkfullcourt
，然后将有问题的行替换为for（S中的S）{df当然可以。这是否意味着这是循环比矢量化解决方案更有效的情况之一？如果是，我仍然想知道我的sapply
。循环是否有效取决于是否在每次交互时复制数据帧，我不知道是否会是这样这里。但是关于循环效率不高的讨论经常被夸大了：这一步是你代码中的瓶颈吗？如果不是，那你就不应该把精力花在优化上。至于sapply
，好问题——我倾向于使用plyr
，对于这些事情，界面对我来说更有意义。另外，@Roland下面的答案也很有用nd不需要函数！您的sapply
尝试的问题是，您应该在col.list
上循环，而不是在df
上循环。将字符串向量S
传递给checkfullcourt
，然后将有问题的行替换为for（S in S）{df当然可以。这是否意味着这是循环比矢量化解决方案更有效的情况之一？如果是，我仍然想知道我的sapply
。循环是否有效取决于是否在每次交互时复制数据帧，我不知道是否会是这样这里。但是关于循环效率不高的讨论经常被夸大了：这一步是你代码中的瓶颈吗？如果不是，那你就不应该把精力花在优化上。至于sapply
，好问题——我倾向于使用plyr
，对于这些事情，界面对我来说更有意义。另外，@Roland下面的答案也很有用nd不需要函数！你的sapply
尝试的问题是你应该循环col.list
而不是df。哇，这真的很优雅。总的来说，我同意NA注释-在这个特殊情况下，0是我想要的。哇，这真的很优雅。总的来说，我同意NA注释在这个特定的例子中，0是我要寻找的。
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25"))

Warning messages:
1: In if (!(mystring %in% names(mydf))) { :
  the condition has length > 1 and only the first element will be used
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
  invalid factor level, NA generated
3: In if (!(mystring %in% names(mydf))) { :
  the condition has length > 1 and only the first element will be used
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
  invalid factor level, NA generated
> test
          age.16.20 lorem
          "x"       "y"  
          "x"       "y"  
          "x"       "y"  
          "x"       "y"  
          "x"       "y"  
age.16.20 NA        NA   
age.20.25 NA        NA  

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
               "age.50.57", "age.58.65", "age.66.70")

test[musthaves[!(musthaves %in% names(test))]] <- 0
#  age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1         x     y         0         0         0         0         0
#2         x     y         0         0         0         0         0
#3         x     y         0         0         0         0         0
#4         x     y         0         0         0         0         0
#5         x     y         0         0         0         0         0