dplyr对行子集上的多个列进行变异/替换_R_Data.table_Dplyr

dplyr对行子集上的多个列进行变异/替换

dplyr对行子集上的多个列进行变异/替换,r,data.table,dplyr,R,Data.table,Dplyr,我正在尝试一个基于dplyr的工作流（而不是使用我习惯使用的data.table），我遇到了一个无法找到等效dplyr解决方案的问题。我通常会遇到这样的情况：我需要根据单个条件有条件地更新/替换多个列。下面是一些示例代码，以及我的data.table解决方案： library(data.table) # Create some sample data set.seed(1) dt <- data.table(site = sample(1:6, 50, replace=T),

我正在尝试一个基于dplyr的工作流（而不是使用我习惯使用的data.table），我遇到了一个无法找到等效dplyr解决方案的问题。我通常会遇到这样的情况：我需要根据单个条件有条件地更新/替换多个列。下面是一些示例代码，以及我的data.table解决方案：

library(data.table)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit', 
         `:=`(qty.exit = qty,
              cf = 0,
              delta.watts = 13)]

库（data.table）
#创建一些示例数据
种子（1）
dt您可以使用magrittr的双向管道%%
：
库（dplyr）
图书馆（magrittr）
dt[dt$measure==“退出”，]%%变异（数量退出=数量，
cf=0，
增量瓦特=13）

这减少了键入的数量，但仍然比数据慢得多。表
如上eipi10所示，在dplyr中进行子集替换并不是一种简单的方法，因为DT使用按引用传递语义，而dplyr使用按值传递语义。dplyr需要在整个向量上使用ifelse（）
，而DT将执行子集并通过引用进行更新（返回整个DT）。因此，对于这个练习，DT将大大加快
您也可以先子集，然后更新，最后重新组合：
dt.sub <- dt[dt$measure == "exit",] %>%
  mutate(qty.exit= qty, cf= 0, delta.watts= 13)

dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])

dt.sub%
变异（数量退出=数量，cf=0，增量瓦特=13）
dt.new这些解决方案（1）维护管道，（2）不覆盖输入，（3）只需要指定一次条件：
1a）mutate__cond为可以合并到管道中的数据帧或数据表创建一个简单的函数。此函数类似于mutate
，但仅作用于满足条件的行：
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

2）因子输出条件将条件作为一个额外的列进行因子输出，该列稍后将被删除。然后使用ifelse
，replace
或算术替换逻辑，如图所示。这也适用于数据表
library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

3）sqldf我们可以通过管道中的sqldf包使用SQLupdate
，用于数据帧（但不是数据表，除非我们转换它们——这可能表示dplyr中存在缺陷。请参阅）。由于存在更新
，我们似乎不希望修改此代码中的输入，但实际上更新
作用于临时生成的数据库中输入的副本，而不是实际输入
library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

4）行大小写时也请查看中定义的行大小写时
.  当
时，它使用类似于
case_的语法，但适用于行
library(dplyr)

DF %>%
  row_case_when(
    measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
    TRUE ~ data.frame(qty.exit, cf, delta.watts)
  )

注1:我们将其用作DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

set.seed（1）
DF以下是我喜欢的解决方案：
mutate_when <- function(data, ...) {
  dots <- eval(substitute(alist(...)))
  for (i in seq(1, length(dots), by = 2)) {
    condition <- eval(dots[[i]], envir = data)
    mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
    data[condition, names(mutations)] <- mutations
  }
  data
}

这是非常可读的——虽然它可能没有它可能表现的那么好。
我刚刚偶然发现了这一点，非常喜欢@G.Grothendieck的mutate_cond（）
，但我认为它可能也能方便地处理新的变量。因此，下面增加了两项内容：
无关：最后一行通过使用filter（）

开头的三行新行获取用于mutate（）
的变量名，并在mutate（）
发生之前初始化数据帧中的所有新变量。使用默认设置为缺少（NA
）的New_init
，为data.frame
的其余部分初始化新变量
mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
  # Initialize any new variables as new_init
  new_vars <- substitute(list(...))[-1]
  new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
  .data[, new_vars] <- new_init

  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
  .data
}

同上，但也创建一个新变量x
（NA
，在条件中未包含的行中）。以前不可能
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)

与上面相同，但未包含在x
条件中的行被设置为FALSE
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)

此示例显示如何将new_init
设置为列表
，以使用不同的值初始化多个新变量。这里，创建了两个新变量，其中排除的行使用不同的值初始化（x
初始化为FALSE
，y
初始化为NA
）
iris%>%突变第二次（物种==“刚毛”和萼片长度<5，
x=真，y=萼片长度^2，
new_init=list（FALSE，NA））
以打破通常的dplyr语法为代价，您可以使用中的from base：
dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
              delta.watts[measure == 'exit'] <- 13)

dt%>%in（qty.exit[measure=='exit']mutate_________________________________________________，当条件为TRUE时返回行，但用FALSE和NA忽略这两行
通过这一小小的改变，功能就像一个符咒：
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
    condition <- eval(substitute(condition), .data, envir)
    condition[is.na(condition)] = FALSE
    .data[condition, ] <- .data[condition, ] %>% mutate(...)
    .data
}

mutate_cond通过创建rlang
，Grothendieck 1a示例的稍微修改版本是可能的，这样就不需要使用envir
参数，因为enquo（）
捕获了自动创建.p
的环境
mutate_rows我实际上看不到对dplyr
的任何更改会使这更容易。case_when
适用于一列有多个不同条件和结果的情况，但对于基于一个条件更改多个列的情况没有帮助。类似地，recode
如果要在一列中替换多个不同的值，则可以保存键入内容，但这无助于同时在多个列中执行此操作。最后，mutate\u at
等。只对列名应用条件，而不是对数据框中的行应用条件。您可以为mutate\u at编写一个函数来完成此操作，但我不知道如何执行此操作ld使其在不同的列中表现不同
也就是说，我将使用nest
表单tidyr
和map
从purr
来处理它
library(data.table)
library(dplyr)
library(tidyr)
library(purrr)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

dt2 <- dt %>% 
  nest(-measure) %>% 
  mutate(data = if_else(
    measure == "exit", 
    map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
    data
  )) %>%
  unnest()

libra
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)

iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
                  x = TRUE, y = Sepal.Length ^ 2,
                  new_init = list(FALSE, NA))

dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
              delta.watts[measure == 'exit'] <- 13)

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
    condition <- eval(substitute(condition), .data, envir)
    condition[is.na(condition)] = FALSE
    .data[condition, ] <- .data[condition, ] %>% mutate(...)
    .data
}

library(data.table)
library(dplyr)
library(tidyr)
library(purrr)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

dt2 <- dt %>% 
  nest(-measure) %>% 
  mutate(data = if_else(
    measure == "exit", 
    map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
    data
  )) %>%
  unnest()

library(tidyverse)
df1 %>%
  group_split(measure == "exit", keep=FALSE) %>% # or `split(.$measure == "exit")`
  modify_at(2,~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
  bind_rows()

#    site space measure qty qty.exit delta.watts          cf
# 1     1     4     led   1        0        73.5 0.246240409
# 2     2     3     cfl  25        0        56.5 0.360315879
# 3     5     4     cfl   3        0        38.5 0.279966850
# 4     5     3  linear  19        0        40.5 0.281439486
# 5     2     3  linear  18        0        82.5 0.007898384
# 6     5     1  linear  29        0        33.5 0.392412729
# 7     5     3  linear   6        0        46.5 0.970848817
# 8     4     1     led  10        0        89.5 0.404447182
# 9     4     1     led  18        0        96.5 0.115594622
# 10    6     3  linear  18        0        15.5 0.017919745
# 11    4     3     led  22        0        54.5 0.901829577
# 12    3     3     led  17        0        79.5 0.063949974
# 13    1     3     led  16        0        86.5 0.551321441
# 14    6     4     cfl   5        0        65.5 0.256845013
# 15    4     2     led  12        0        29.5 0.340603733
# 16    5     3  linear  27        0        63.5 0.895166931
# 17    1     4     led   0        0        47.5 0.173088800
# 18    5     3  linear  20        0        89.5 0.438504370
# 19    2     4     cfl  18        0        45.5 0.031725246
# 20    2     3     led  24        0        94.5 0.456653397
# 21    3     3     cfl  24        0        73.5 0.161274319
# 22    5     3     led   9        0        62.5 0.252212124
# 23    5     1     led  15        0        40.5 0.115608182
# 24    3     3     cfl   3        0        89.5 0.066147321
# 25    6     4     cfl   2        0        35.5 0.007888337
# 26    5     1  linear   7        0        51.5 0.835458916
# 27    2     3  linear  28        0        36.5 0.691483644
# 28    5     4     led   6        0        43.5 0.604847889
# 29    6     1  linear  12        0        59.5 0.918838163
# 30    3     3  linear   7        0        73.5 0.471644760
# 31    4     2     led   5        0        34.5 0.972078100
# 32    1     3     cfl  17        0        80.5 0.457241602
# 33    5     4  linear   3        0        16.5 0.492500255
# 34    3     2     cfl  12        0        44.5 0.804236607
# 35    2     2     cfl  21        0        50.5 0.845094268
# 36    3     2  linear  10        0        23.5 0.637194873
# 37    4     3     led   6        0        69.5 0.161431896
# 38    3     2    exit  19       19        13.0 0.000000000
# 39    6     3    exit   7        7        13.0 0.000000000
# 40    6     2    exit  20       20        13.0 0.000000000
# 41    3     2    exit   1        1        13.0 0.000000000
# 42    2     4    exit  19       19        13.0 0.000000000
# 43    3     1    exit  24       24        13.0 0.000000000
# 44    3     3    exit  16       16        13.0 0.000000000
# 45    5     3    exit   9        9        13.0 0.000000000
# 46    2     3    exit   6        6        13.0 0.000000000
# 47    4     1    exit   1        1        13.0 0.000000000
# 48    1     1    exit  14       14        13.0 0.000000000
# 49    6     3    exit   7        7        13.0 0.000000000
# 50    2     4    exit   3        3        13.0 0.000000000

df1 <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50),
                 stringsAsFactors = F)

df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
                          cf = replace( cf, measure == 'exit', 0 ),
                          delta.watts = replace( delta.watts, measure == 'exit', 13 ) )

#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )

df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
               cf = replace( cf, index.v, 0 ),
               delta.watts = replace( delta.watts, index.v, 13 ) )

# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
# data.table   1.005018 1.053370 1.137456 1.112871 1.186228 1.690996   100
# wimpel       1.061052 1.079128 1.218183 1.105037 1.137272 7.390613   100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995   100

library(dplyr)

dt %>% 
    filter(measure == 'exit') %>%
    mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
    rbind(dt %>% filter(measure != 'exit'))