R/使用矢量化检查df中是否存在列
我已经定义了以下函数来检查一个数据框是否包含多个列,如果不包含,则包括它们R/使用矢量化检查df中是否存在列,r,vectorization,sapply,R,Vectorization,Sapply,我已经定义了以下函数来检查一个数据框是否包含多个列,如果不包含,则包括它们 CheckFullCohorts <- function(df) { # Checks if year/cohort df contains all necessary columns # Args: # df: year/cohort df # Return: # df: df, corrected if necessary foo <- function(mydf, m
CheckFullCohorts <- function(df) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- foo(df, "age.16.20")
df <- foo(df, "age.21.24")
df <- foo(df, "age.25.49")
df <- foo(df, "age.50.57")
df <- foo(df, "age.58.65")
df <- foo(df, "age.66.70")
df
}
checkfullcourts您可以轻松将其矢量化:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
"age.50.57", "age.58.65", "age.66.70")
test[musthaves[!(musthaves %in% names(test))]] <- 0
# age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1 x y 0 0 0 0 0
#2 x y 0 0 0 0 0
#3 x y 0 0 0 0 0
#4 x y 0 0 0 0 0
#5 x y 0 0 0 0 0
test您可以轻松地将其矢量化:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
"age.50.57", "age.58.65", "age.66.70")
test[musthaves[!(musthaves %in% names(test))]] <- 0
# age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1 x y 0 0 0 0 0
#2 x y 0 0 0 0 0
#3 x y 0 0 0 0 0
#4 x y 0 0 0 0 0
#5 x y 0 0 0 0 0
测试将字符串向量S
传递给checkfullcourt
,然后将有问题的行替换为for(S中的S){df当然可以。这是否意味着这是循环比矢量化解决方案更有效的情况之一?如果是,我仍然想知道我的sapply
。循环是否有效取决于是否在每次交互时复制数据帧,我不知道是否会是这样这里。但是关于循环效率不高的讨论经常被夸大了:这一步是你代码中的瓶颈吗?如果不是,那你就不应该把精力花在优化上。至于sapply
,好问题——我倾向于使用plyr
,对于这些事情,界面对我来说更有意义。另外,@Roland下面的答案也很有用nd不需要函数!您的sapply
尝试的问题是,您应该在col.list
上循环,而不是在df
上循环。将字符串向量S
传递给checkfullcourt
,然后将有问题的行替换为for(S in S){df当然可以。这是否意味着这是循环比矢量化解决方案更有效的情况之一?如果是,我仍然想知道我的sapply
。循环是否有效取决于是否在每次交互时复制数据帧,我不知道是否会是这样这里。但是关于循环效率不高的讨论经常被夸大了:这一步是你代码中的瓶颈吗?如果不是,那你就不应该把精力花在优化上。至于sapply
,好问题——我倾向于使用plyr
,对于这些事情,界面对我来说更有意义。另外,@Roland下面的答案也很有用nd不需要函数!你的sapply
尝试的问题是你应该循环col.list
而不是df
。哇,这真的很优雅。总的来说,我同意NA注释-在这个特殊情况下,0是我想要的。哇,这真的很优雅。总的来说,我同意NA注释在这个特定的例子中,0是我要寻找的。
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25"))
Warning messages:
1: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
3: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
> test
age.16.20 lorem
"x" "y"
"x" "y"
"x" "y"
"x" "y"
"x" "y"
age.16.20 NA NA
age.20.25 NA NA
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
musthaves <- c("age.16.20", "age.21.24", "age.25.49",
"age.50.57", "age.58.65", "age.66.70")
test[musthaves[!(musthaves %in% names(test))]] <- 0
# age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70
#1 x y 0 0 0 0 0
#2 x y 0 0 0 0 0
#3 x y 0 0 0 0 0
#4 x y 0 0 0 0 0
#5 x y 0 0 0 0 0