R:A";“整洁”;函数的版本比原始版本慢得多,I';我想知道为什么
我有来自具有唯一ID的受试者的数据,这些ID来自多次访问,每次访问都位于数据框的单独一行。有些信息,如性别或出生年份,可能只在一次访问中收集,但在任何访问中都是相关的。对于未收集信息的访问,该字段将为NA。因此,我创建了一个函数,可以将给定字段的主题信息复制到所有访问,从而替换NAs。它工作了,但是代码很笨拙,现在我正在学习整洁的数据争用,我想合并它以使代码更干净。我也希望它能加快进程,但事实并非如此 首先,这里是一些玩具数据:R:A";“整洁”;函数的版本比原始版本慢得多,I';我想知道为什么,r,for-loop,dplyr,tidyr,R,For Loop,Dplyr,Tidyr,我有来自具有唯一ID的受试者的数据,这些ID来自多次访问,每次访问都位于数据框的单独一行。有些信息,如性别或出生年份,可能只在一次访问中收集,但在任何访问中都是相关的。对于未收集信息的访问,该字段将为NA。因此,我创建了一个函数,可以将给定字段的主题信息复制到所有访问,从而替换NAs。它工作了,但是代码很笨拙,现在我正在学习整洁的数据争用,我想合并它以使代码更干净。我也希望它能加快进程,但事实并非如此 首先,这里是一些玩具数据: data <- tibble(record_id = c(r
data <- tibble(record_id = c(rep(LETTERS[1:4], 3)),
year1 = c(NA, NA, 2000, 2001, 2002, rep(NA, 7)),
year2 = c(rep(NA, 5), 2003, 2004, 2005, 2006, rep(NA, 3)))
在我整理之前,我创建了这个代码,它工作得很好
mash.old <- function(data, variable){
x <- data[!is.na(data[,variable]),] %>%
distinct(record_id, .keep_all = T)
x <- as.data.frame(x)
for(i in 1:nrow(data)){
if(is.na(data[i,variable]) &
data[i, "record_id"] %in% x$record_id){
id <- data[i, "record_id"]
data[i,variable] <- x[x$record_id == as.character(id),
variable]
}else{
next
}
}
rm(x, id, i)
return(data)
}
最大的改进是
groupby()
一次。现在,您正在进行12次分组和解分组,这会增加很多不必要的开销。另外,新函数将所有内容重新分配回自身-如果我们在year1
上,就没有理由弄乱year2
或report\u id
library(dplyr)
library(zoo)
data%>%
arrange(record_id)%>%
group_by(record_id)%>%
mutate_at(vars(-group_cols()), function(x) zoo::na.locf(x[order(x)], na.rm = F))%>%
ungroup()
# A tibble: 12 x 3
record_id year1 year2
<chr> <dbl> <dbl>
1 A 2002 2006
2 A 2002 2006
3 A 2002 2006
4 B NA 2003
5 B NA 2003
6 B NA 2003
7 C 2000 2004
8 C 2000 2004
9 C 2000 2004
10 D 2001 2005
11 D 2001 2005
12 D 2001 2005
它也是最快的
Unit: milliseconds
expr min lq mean median uq max neval
cole_dplyr 3.2388 3.39800 3.588391 3.47175 3.62610 6.6420 100
cole_dt2 1.6135 1.83535 2.082963 1.96230 2.07435 6.7179 100
mashing_old 4.6119 4.86305 5.175244 4.94930 5.10220 9.1026 100
mashing_new 16.1860 16.82445 18.610696 17.30585 18.01270 101.6192 100
OP_non_mashing 15.1633 15.57970 16.914889 16.10400 16.97860 46.5837 100
我所有的代码——基准都在底部:
library(tidyverse)
data <- tibble(record_id = c(rep(LETTERS[1:4], 3)),
year1 = c(NA, NA, 2000, 2001, 2002, rep(NA, 7)),
year2 = c(rep(NA, 5), 2003, 2004, 2005, 2006, rep(NA, 3)))
data <- tibble(record_id = c(rep(LETTERS[1:4], 3)),
year1 = c(NA, NA, 2000, 2001, 2002, rep(NA, 7)),
year2 = c(rep(NA, 5), 2003, 2004, 2005, 2006, 2002, rep(NA, 2)))
data
library(data.table)
dt <- as.data.table(data)
vars_n <- names(dt)[-1] #included if you want to make a function later
dt[,lapply(.SD, function(x) zoo::na.locf(x[order(x)], na.rm = F)), keyby = record_id, .SDcols = vars_n]
data%>%
arrange(record_id)%>%
group_by(record_id)%>%
mutate_at(vars(-group_cols()), function(x) zoo::na.locf(x[order(x)], na.rm = F))%>%
ungroup()
mash.old <- function(data, variable){
x <- data[!is.na(data[,variable]),] %>%
distinct(record_id, .keep_all = T)
x <- as.data.frame(x)
for(i in 1:nrow(data)){
if(is.na(data[i,variable]) &
data[i, "record_id"] %in% x$record_id){
id <- data[i, "record_id"]
data[i,variable] <- x[x$record_id == as.character(id),
variable]
}else{
next
}
}
rm(x, id, i)
return(data)
}
mash.new <- function(data, variables, grouping.var = record_id){
for(i in variables){
data <- data %>%
group_by(!!enquo(grouping.var)) %>%
arrange((!!sym(i)), .by_group = T) %>%
fill(!!sym(i)) %>%
ungroup()
}
return(data)
}
library(microbenchmark)
microbenchmark(
cole_dplyr = {
data %>%
arrange(record_id)%>%
group_by(record_id)%>%
mutate_at(vars(-group_cols()), function(x) zoo::na.locf(x[order(x)], na.rm = F))%>%
ungroup()
}
,
# cole_dt = {
# dt1 <- copy(dt)
#
# vars_n <- names(dt1)[-1]
# dt1[, (vars_n) := lapply(.SD, function(x) zoo::na.locf(sort(x))), keyby = record_id]
# },
cole_dt2 = {
dt[,lapply(.SD, function(x) zoo::na.locf(x[order(x)], na.rm = F)), keyby = record_id]
},
mashing_old = {
data1 <- data
data1 <- mash.old(data1, 'year1')
data1 <- mash.old(data1, 'year2')
}
,
mashing_new = {
mash.new(data, c('year1', 'year2'))
}
, OP_non_mashing = {
data %>%
group_by(record_id) %>%
arrange(year1, .by_group = T) %>%
fill(year1) %>%
arrange(year2) %>%
fill(year2)
}
)
库(tidyverse)
数据%
解组()
糖化。旧的%
分组人(记录id)%>%
mutate_at(vars(-group_cols()),function(x)zoo::na.locf(x[order(x)],na.rm=F))%>%
解组()
}
,
#科尔_dt={
#dt1%
填充(第2年)
}
)
数据%>%groupby(记录id)%%>%fill(-记录id)%%>%fill(-记录id,.direction='up')
?填充通常相当缓慢。我会用zoo::na.locf
来替换它,看看会发生什么好答案。我建议将as.data.table()
从'measured in the benchmark'函数中移出,并可能使其成为一个简单的copy()
(随着数据的变化)。如果我这样做,差距将进一步扩大,有利于data.table
。
mash <- function(data, variables, grouping.var = record_id){
data <- data %>%
arrange(!!enquo(grouping.var)) %>%
group_by(!!enquo(grouping.var)) %>%
mutate_at(vars(!!!variables),
function(x) zoo::na.locf(x[order(x)], na.rm = F)) %>%
ungroup()
return(data)
}
#Note that if there are two different entries for a given subject in a
#variable, this will fill with the data that comes last in the sort order
library(dplyr)
library(zoo)
data%>%
arrange(record_id)%>%
group_by(record_id)%>%
mutate_at(vars(-group_cols()), function(x) zoo::na.locf(x[order(x)], na.rm = F))%>%
ungroup()
# A tibble: 12 x 3
record_id year1 year2
<chr> <dbl> <dbl>
1 A 2002 2006
2 A 2002 2006
3 A 2002 2006
4 B NA 2003
5 B NA 2003
6 B NA 2003
7 C 2000 2004
8 C 2000 2004
9 C 2000 2004
10 D 2001 2005
11 D 2001 2005
12 D 2001 2005
library(data.table)
library(zoo)
dt <- as.data.table(data)
vars_n <- names(dt)[-1] #included if you want to make a function later
dt[,lapply(.SD, function(x) zoo::na.locf(x[order(x)], na.rm = F)), keyby = record_id, .SDcols = vars_n]
Unit: milliseconds
expr min lq mean median uq max neval
cole_dplyr 3.2388 3.39800 3.588391 3.47175 3.62610 6.6420 100
cole_dt2 1.6135 1.83535 2.082963 1.96230 2.07435 6.7179 100
mashing_old 4.6119 4.86305 5.175244 4.94930 5.10220 9.1026 100
mashing_new 16.1860 16.82445 18.610696 17.30585 18.01270 101.6192 100
OP_non_mashing 15.1633 15.57970 16.914889 16.10400 16.97860 46.5837 100
library(tidyverse)
data <- tibble(record_id = c(rep(LETTERS[1:4], 3)),
year1 = c(NA, NA, 2000, 2001, 2002, rep(NA, 7)),
year2 = c(rep(NA, 5), 2003, 2004, 2005, 2006, rep(NA, 3)))
data <- tibble(record_id = c(rep(LETTERS[1:4], 3)),
year1 = c(NA, NA, 2000, 2001, 2002, rep(NA, 7)),
year2 = c(rep(NA, 5), 2003, 2004, 2005, 2006, 2002, rep(NA, 2)))
data
library(data.table)
dt <- as.data.table(data)
vars_n <- names(dt)[-1] #included if you want to make a function later
dt[,lapply(.SD, function(x) zoo::na.locf(x[order(x)], na.rm = F)), keyby = record_id, .SDcols = vars_n]
data%>%
arrange(record_id)%>%
group_by(record_id)%>%
mutate_at(vars(-group_cols()), function(x) zoo::na.locf(x[order(x)], na.rm = F))%>%
ungroup()
mash.old <- function(data, variable){
x <- data[!is.na(data[,variable]),] %>%
distinct(record_id, .keep_all = T)
x <- as.data.frame(x)
for(i in 1:nrow(data)){
if(is.na(data[i,variable]) &
data[i, "record_id"] %in% x$record_id){
id <- data[i, "record_id"]
data[i,variable] <- x[x$record_id == as.character(id),
variable]
}else{
next
}
}
rm(x, id, i)
return(data)
}
mash.new <- function(data, variables, grouping.var = record_id){
for(i in variables){
data <- data %>%
group_by(!!enquo(grouping.var)) %>%
arrange((!!sym(i)), .by_group = T) %>%
fill(!!sym(i)) %>%
ungroup()
}
return(data)
}
library(microbenchmark)
microbenchmark(
cole_dplyr = {
data %>%
arrange(record_id)%>%
group_by(record_id)%>%
mutate_at(vars(-group_cols()), function(x) zoo::na.locf(x[order(x)], na.rm = F))%>%
ungroup()
}
,
# cole_dt = {
# dt1 <- copy(dt)
#
# vars_n <- names(dt1)[-1]
# dt1[, (vars_n) := lapply(.SD, function(x) zoo::na.locf(sort(x))), keyby = record_id]
# },
cole_dt2 = {
dt[,lapply(.SD, function(x) zoo::na.locf(x[order(x)], na.rm = F)), keyby = record_id]
},
mashing_old = {
data1 <- data
data1 <- mash.old(data1, 'year1')
data1 <- mash.old(data1, 'year2')
}
,
mashing_new = {
mash.new(data, c('year1', 'year2'))
}
, OP_non_mashing = {
data %>%
group_by(record_id) %>%
arrange(year1, .by_group = T) %>%
fill(year1) %>%
arrange(year2) %>%
fill(year2)
}
)