R基于辅助数据帧更改数据帧值
我正在寻找一种更有效的方法来进行一些替换/查找 我当前的方法是使用R基于辅助数据帧更改数据帧值,r,performance,dataframe,replace,R,Performance,Dataframe,Replace,我正在寻找一种更有效的方法来进行一些替换/查找 我当前的方法是使用paste0创建一个查找值,然后匹配该值进行筛选 给定x x <- data.frame(var1 = c("AA","BB","CC","DD"), var2 = c("--","AA","AA","--"), val1 = c(1,2,1,4), val2 = c(5,5,7,8)) var1 var2 val1
paste0
创建一个查找值,然后匹配该值进行筛选
给定x
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))
var1 var2 val1 val2
1 AA -- 1 5
2 BB AA 2 5
3 CC AA 1 7
4 DD -- 4 8
这确实回报了我所期望的
> x
var1 var2 val1 val2
1 AA -- 1 5
2 BB AA 2 NA
3 CC AA NA 7
4 DD -- 4 8
然而,在实践中,当分析代码时,大部分时间都花在粘贴上,而这并不是最有效的方法
我真正的数据集是数百万行和大约25列,运行时间大约为60秒。我认为有一种方法可以进行逻辑矩阵替换,而不是单独访问每一列。不过我想不出来
非常感谢您的帮助。谢谢
编辑——基准
na.replace.orig <- function(x) {
lookup.df <- x %>% filter(var2 == "--")
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2,x[[column]])
var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
return(x)
}
# pulled out the lookup table since it causes a lot of overhead
na.replace.orig.no.lookup <- function(x) {
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2,x[[column]])
var1.lookup <- paste0(lookup.df$var1,lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
return(x)
}
na.replace.1 <- function(x) {
inx <- match(x$var2, x$var1)
jnx <- which(!is.na(inx))
inx <- inx[!is.na(inx)]
knx <- grep("^val", names(x))
for(i in seq_along(inx))
for(k in knx)
if(x[[k]][inx[i]] == x[[k]][jnx[i]]) x[[k]][jnx[i]] <- NA
return(x)
}
na.replace.2 <- function(x) {
for(col in c("val1","val2")) {
x[x[,'var2'] %in% x[,'var1'] & x[,col] %in% lookup.df[,col] , col] <- NA
}
return(x)
}
> microbenchmark::microbenchmark(na.replace.orig(x), na.replace.orig.no.lookup(x), na.replace.1(x), na.replace.2(x), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
na.replace.orig(x) 1267.23 1274.2 1441.9 1408.8 1609.8 1762.8 10
na.replace.orig.no.lookup(x) 217.43 228.9 270.9 239.2 296.6 394.2 10
na.replace.1(x) 98.46 106.3 133.0 123.9 136.6 239.2 10
na.replace.2(x) 117.74 147.7 162.9 166.6 183.0 189.9 10
我的代码仍然适用于这种情况
x[,c("val1","val2")] <- lapply(c("val1","val2"), function(column) {
var2.lookup <- paste0(x$var2, x$var3, x[[column]])
var1.lookup <- paste0(lookup.df$var1, x$var3, lookup.df[[column]])
x[[column]][var2.lookup %in% var1.lookup] <- NA
return(x[[column]])
})
x[,c(“val1”,“val2”)]我发现下面的解决方案有点让人困惑(我想出了它!),但它确实有效。
与流行的观点相反,for
循环并不比*apply
系列慢多少
inx <- match(x$var2, x$var1)
jnx <- which(!is.na(inx))
inx <- inx[!is.na(inx)]
knx <- grep("^val", names(x))
for(i in seq_along(inx))
for(k in knx)
if(x[[k]][inx[i]] == x[[k]][jnx[i]]) x[[k]][jnx[i]] <- NA
x
# var1 var2 val1 val2
#1 AA -- 1 5
#2 BB AA 2 NA
#3 CC AA NA 7
#4 DD -- 4 8
inx以下解决方案仅使用矢量化逻辑。它使用您已经创建的查找表。我认为这比鲁伊的解决方案还要快
library(dplyr)
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))
lookup.df <- x[ x[,'var2'] == "--", ]
x[x[,'var2'] %in% x[,'var1'] & x[,'val1'] %in% lookup.df[,'val1'] , 'val1'] <- NA
x[x[,'var2'] %in% x[,'var1'] & x[,'val2'] %in% lookup.df[,'val2'] , 'val2'] <- NA
x
#> var1 var2 val1 val2
#> 1 AA -- 1 5
#> 2 BB AA 2 NA
#> 3 CC AA NA 7
#> 4 DD -- 4 8
哦,您需要在您的数据集上测试它,以查看在更大的数据帧上,这两种方法的伸缩性如何不同
编辑2:执行Rui对查找表的建议。按照从最慢到最快的基准的顺序:
lookup.df <- x %>% filter(var2 == "--")
lookup.df <- filter(x, var2 == "--")
lookup.df <- x[x[,'var2'] == "--", ]
lookup.df%过滤器(var2==“--”)
但请注意:管道比索引速度慢,请尝试lookup.df。谢谢您的提示。显然,过滤器与子集之间存在争议,但管道无疑会减慢操作速度
inx <- match(x$var2, x$var1)
jnx <- which(!is.na(inx))
inx <- inx[!is.na(inx)]
knx <- grep("^val", names(x))
for(i in seq_along(inx))
for(k in knx)
if(x[[k]][inx[i]] == x[[k]][jnx[i]]) x[[k]][jnx[i]] <- NA
x
# var1 var2 val1 val2
#1 AA -- 1 5
#2 BB AA 2 NA
#3 CC AA NA 7
#4 DD -- 4 8
library(dplyr)
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))
lookup.df <- x[ x[,'var2'] == "--", ]
x[x[,'var2'] %in% x[,'var1'] & x[,'val1'] %in% lookup.df[,'val1'] , 'val1'] <- NA
x[x[,'var2'] %in% x[,'var1'] & x[,'val2'] %in% lookup.df[,'val2'] , 'val2'] <- NA
x
#> var1 var2 val1 val2
#> 1 AA -- 1 5
#> 2 BB AA 2 NA
#> 3 CC AA NA 7
#> 4 DD -- 4 8
set.seed(4)
microbenchmark::microbenchmark(na.replace.orig(x), na.replace.1(x), na.replace.2(x), times = 50)
#> Unit: microseconds
#> expr min lq mean median uq max
#> na.replace.orig(x) 184.348 192.410 348.4430 202.1615 223.375 6206.546
#> na.replace.1(x) 68.127 86.621 281.3503 89.8715 93.381 9693.029
#> na.replace.2(x) 95.885 105.858 210.7638 113.2060 118.668 4993.849
#> neval
#> 50
#> 50
#> 50
lookup.df <- x %>% filter(var2 == "--")
lookup.df <- filter(x, var2 == "--")
lookup.df <- x[x[,'var2'] == "--", ]