如何避免R中特定多条件语句中的循环

如何避免R中特定多条件语句中的循环,r,performance,loops,bigdata,R,Performance,Loops,Bigdata,我使用R从R中的电子病历(EMR)中进行推断。实际上,我确实编写了一个可以工作的循环命令,但问题是,在处理数百万个EMR时,循环可能非常慢。那么,谁能把我的命令转换成更快的方式(可能是基于向量的计算或其他可能的方式)? 我的目的是确定一组商品(在本例中,它们是从p324到p9274)是否包含一组字符(在本例中,它们是I25.2、I21和I22)。 以下是我的数据示例: test <- data.frame(p324 = c("I24.001", "I10.x04", "I48.x02",

我使用R从R中的电子病历(EMR)中进行推断。实际上,我确实编写了一个可以工作的循环命令,但问题是,在处理数百万个EMR时,循环可能非常慢。那么,谁能把我的命令转换成更快的方式(可能是基于向量的计算或其他可能的方式)? 我的目的是确定一组商品(在本例中,它们是从p324到p9274)是否包含一组字符(在本例中,它们是I25.2、I21和I22)。 以下是我的数据示例:

test <- data.frame(p324 = c("I24.001", "I10.x04", "I48.x02", "I48.x01", "I25.201", "I25.201", "I25.101", "I25.101", "NA", "I50", "I25.101", "I25.101", "I25.101", "I45.102", "I50.902"),
p327 = c("I20.000", "K76.000", "E11.900", "I44.200", "NA", "I49.904", "I45.102", "I50.910", "NA", "I10  05", "J98.402", "NA", "NA", "R57.0", "I10.x04"),
p3291 = c("I50.903", "K80.100", "N39.000", "I25.103", "NA", "I50.908", "NA", "I10  04", "NA", "I25.101", "I10  03", "NA", "NA", "I25.101", "I10.x05"),
p3294 = c("I10.x05", "K76.807", "J98.414", "K81.100", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "J43", "I10.x06"),
p3297 = c("NA", "I83.900", "E87.801", "NA", "NA", "I21.620", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "I10.x07"),
p3281 = c("K80.100", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "I10.x08"),
p3284 = c("K76.807", "I21.620", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "I10.x09"),
p3287 = c("I83.900", "I10.x3", "I10.x2", "I10.x1", "I10.x0", "I10.x1", "I10.x2", "I10.x3", "I10.x4", "I10.x5", "I10.x6", "I10.x7", "I10.x8", "I10.x9", "I10.x10"),
p3271 = c("I50.908", "NA", "I10.x1", "I10.x0", "I10.x1", "I10.x2", "I10.x3", "I10.x4", "I10.x5", "I10.x6", "I10.x7", "I10.x8", "I10.x9", "I10.x10", "I10.x11"),
p3274 = c("NA", "I10.x1", "I10.x0", "I10.x1", "I10.x2", "I10.x3", "I10.x4", "I10.x5", "I10.x6", "I10.x7", "I10.x8", "I10.x9", "I10.x10", "I10.x11", "I10.x12"))

那么,有谁能转换我的命令,或者给我一些建议,使它能够高效快速地运行,即使在案例超过100万的情况下?非常感谢。

我建议使用常规express,然后使用sapply进行矢量化

t_test <- as.data.frame(t(test))
chk <- function(x){
  grepl("I25\\.2|I21\\.|I22\\.",x)
}

sapply(t_test,chk)

t_检验我建议使用正则表达式,然后使用sapply进行矢量化

t_test <- as.data.frame(t(test))
chk <- function(x){
  grepl("I25\\.2|I21\\.|I22\\.",x)
}

sapply(t_test,chk)

t_test如果您希望提高性能:

  • 不要按行运行循环(
    sapply
    也是循环)
  • 不要在循环中运行矢量化操作(为什么只需运行一次就可以逐行逐列运行
    substr
  • 避免使用正则表达式-它很慢。相反,如果您在这里处理精确匹配,只需在%
  • 中使用
    ==
    %
    下面是一个简单的矢量化的可能解决方案

    res <- (substr(unlist(test), 1, 5) == "I25.2") | 
           (substr(unlist(test), 1, 4) %in% c("I21.", "I22."))
    dim(res) <- dim(test)
    test$MI <- rowSums(res)
    
    验证

    基准结果


    如果您希望提高绩效:

  • 不要按行运行循环(
    sapply
    也是循环)
  • 不要在循环中运行矢量化操作(为什么只需运行一次就可以逐行逐列运行
    substr
  • 避免使用正则表达式-它很慢。相反,如果您在这里处理精确匹配,只需在%
  • 中使用
    ==
    %
    下面是一个简单的矢量化的可能解决方案

    res <- (substr(unlist(test), 1, 5) == "I25.2") | 
           (substr(unlist(test), 1, 4) %in% c("I21.", "I22."))
    dim(res) <- dim(test)
    test$MI <- rowSums(res)
    
    验证

    基准结果

    更新1
    我比较了@Sixiang.Hu使用“sappy&grepl()”的代码、@David Arenburg使用“grepl()”的代码和@David Arenburg使用“substr”的代码,似乎sappy代码的性能最好。然而,本节中提供的@David Arenburg代码的“substr”生成了许多NA值。有什么能解释这些NA值产生的原因吗

    > # sapply & grepl()
    > start.time <- Sys.time()
    > test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
    > MIchk <- function(x){
    +   grepl("I25\\.2|I21\\.|I22\\.",x)
    + }
    > test1 <- sapply(test,MIchk)
    > test$MI <- rowSums(test1)
    > end.time <- Sys.time()
    > end.time - start.time
    Time difference of 2.363007 secs
    > table(test$MI,exclude = NULL)
    
         0      1      2   <NA> 
    254495   3523     15      0 
    > 
    > # grepl() 
    > start.time <- Sys.time()
    > test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
    > res <- grepl("I25\\.2|I21\\.|I22\\.", unlist(test))
    > dim(res) <- dim(test)
    > test$MI1 <- rowSums(res)
    > end.time <- Sys.time()
    > end.time - start.time
    Time difference of 2.51223 secs
    > table(test$MI1,exclude = NULL)
    
         0      1      2   <NA> 
    254495   3523     15      0 
    > 
    > # substr
    > start.time <- Sys.time()
    > test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
    > res <- (substr(unlist(test), 1, 5) == "I25.2") | (substr(unlist(test), 1, 4) %in% c("I21.", "I22."))
    > dim(res) <- dim(test)
    > test$MI2 <- rowSums(res)
    > end.time <- Sys.time()
    > end.time - start.time
    Time difference of 3.473388 secs
    > table(test$MI2,exclude = NULL)
    
         0      1      2   <NA> 
    154897   2461     11 100664
    
    #sapply&grepl()
    >start.time测试MIchk test1测试$MI end.time end.time-start.time
    时差为2.363007秒
    >表(test$MI,exclude=NULL)
    0      1      2    
    254495   3523     15      0 
    > 
    >#grepl()
    >start.time测试res dim(res)测试$MI1 end.time end.time-start.time
    时差为2.51223秒
    >表(测试$MI1,排除=NULL)
    0      1      2    
    254495   3523     15      0 
    > 
    >#substr
    >start.time测试res dim(res)测试$MI2 end.time end.time-start.time
    时差为3.473388秒
    >表(test$MI2,exclude=NULL)
    0      1      2    
    154897   2461     11 100664
    
    更新2 substr操作生成许多NA值的原因是我的数据集包含NA值。因此,我执行了以下代码,然后上述三个操作得到了一致的结果:

    library(dplyr)
    test %>% mutate_if(is.factor, as.character) -> test 
    test[is.na(test)]<-0
    
    库(dplyr)
    test%>%mutate_if(is.factor,as.character)->test
    试验[is.na(试验)]#=================================
    >#sapply&grepl()
    >start.time MIchk test1测试$MI end.time end.time-start.time
    时差为5.864876秒
    >表(test$MI,exclude=NULL)
    0      1      2    
    520339   3081     21      0 
    > #=================================
    >#grepl()
    >start.time test1 res dim(res)test$MI1 end.time end.time-start.time
    时差为17.20333秒
    >表(测试$MI1,排除=NULL)
    0      1      2    
    520339   3081     21      0 
    > #=================================
    >#substr
    >start.time test2 tmp res dim(res)测试$MI2 end.time end.time-start.time
    时差为4.386484秒
    >表(test$MI2,exclude=NULL)
    0      1      2    
    520339   3081     21      0 
    
    最后,我还做了一个基准测试,结果表明substr操作略优于SAPPY/grepl,并且明显优于单独的矢量化grepl。以下是我的代码和结果:

    #--------------------------------
    SixHu <- function(df) { 
      MIchk <- function(x){
        grepl("I25\\.2|I21\\.|I22\\.",x)
      }
      test1 <- sapply(df,MIchk)
      rowSums(test1)
    }
    #--------------------------------
    # Vectorized grepl
    SixHuVec <- function(df) { 
      res <- grepl("I25\\.2|I21\\.|I22\\.", unlist(df))
      dim(res) <- dim(df) 
      rowSums(res)
    }
    #--------------------------------
    David <- function(df) { 
      tmp <- unlist(df)
      res <- (substr(tmp, 1, 5) == "I25.2") | (substr(tmp, 1, 4) %in% c("I21.", "I22."))
      dim(res) <- dim(df)
      rowSums(res)
    }
    > microbenchmark::microbenchmark(SixHu(test),
                                     +                                SixHuVec(test),
                                      +                                David(test))
    Unit: seconds
    expr       min        lq      mean    median        uq       max neval cld
    SixHu(test)  4.323772  4.598328  4.836165  4.760263  4.988194  5.801979   100  b 
    SixHuVec(test) 11.867062 12.826925 13.342357 13.243638 13.635339 18.705615   100   c
    David(test)  3.728264  4.180152  4.389600  4.344938  4.519908  6.396018   100 a 
    
    #--------------------------------
    四胡更新1
    
    我比较了@Sixiang.Hu使用“sappy&grepl()”的代码、@David Arenburg使用“grepl()”的代码和@David Arenburg使用“substr”的代码,似乎sappy代码的性能最好。然而,本节中提供的@David Arenburg代码的“substr”生成了许多NA值。有什么能解释这些NA值产生的原因吗

    > # sapply & grepl()
    > start.time <- Sys.time()
    > test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
    > MIchk <- function(x){
    +   grepl("I25\\.2|I21\\.|I22\\.",x)
    + }
    > test1 <- sapply(test,MIchk)
    > test$MI <- rowSums(test1)
    > end.time <- Sys.time()
    > end.time - start.time
    Time difference of 2.363007 secs
    > table(test$MI,exclude = NULL)
    
         0      1      2   <NA> 
    254495   3523     15      0 
    > 
    > # grepl() 
    > start.time <- Sys.time()
    > test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
    > res <- grepl("I25\\.2|I21\\.|I22\\.", unlist(test))
    > dim(res) <- dim(test)
    > test$MI1 <- rowSums(res)
    > end.time <- Sys.time()
    > end.time - start.time
    Time difference of 2.51223 secs
    > table(test$MI1,exclude = NULL)
    
         0      1      2   <NA> 
    254495   3523     15      0 
    > 
    > # substr
    > start.time <- Sys.time()
    > test <- subset(I61, select = c("p324", "p327", "p3291", "p3294", "p3297", "p3281", "p3284", "p3287", "p3271", "p3274"))
    > res <- (substr(unlist(test), 1, 5) == "I25.2") | (substr(unlist(test), 1, 4) %in% c("I21.", "I22."))
    > dim(res) <- dim(test)
    > test$MI2 <- rowSums(res)
    > end.time <- Sys.time()
    > end.time - start.time
    Time difference of 3.473388 secs
    > table(test$MI2,exclude = NULL)
    
         0      1      2   <NA> 
    154897   2461     11 100664
    
    #sapply&grepl()
    >start.time测试MIchk test1测试$MI end.time end.time-start.time
    时差为2.363007秒
    >表(test$MI,exclude=NULL)
    0      1      2    
    254495   3523     15      0 
    > 
    >#grepl()
    >start.time测试res dim(res)测试$MI1 end.time end.time-start.time
    时差为2.51223秒
    >表(测试$MI1,排除=NULL)
    0      1      2    
    254495   3523     15      0 
    > 
    >#substr
    >start.time测试res dim(res)测试$MI2 end.time end.time-start.time
    时差为3.473388秒
    >表(test$MI2,exclude=NULL)
    0      1      2    
    154897   2461     11 100664
    
    更新2 substr操作生成许多NA值的原因是我的数据集包含NA值。因此,我执行了以下代码,然后上述三个操作得到了一致的结果:

    library(dplyr)
    test %>% mutate_if(is.factor, as.character) -> test 
    test[is.na(test)]<-0
    
    库(dplyr)
    test%>%mutate_if(is.factor,as.character)->test
    试验[is.na(试验)]#=================================
    >#sapply&grepl()
    >start.time MIchk test1测试$MI end.time end.time-start.time
    时差为5.864876秒
    >表(test$MI,exclude=NULL)
    0      1      2    
    520339   3081     21      0 
    > #=================================
    >#grepl()
    >start.time test1 res d
    
    #--------------------------------
    SixHu <- function(df) { 
      MIchk <- function(x){
        grepl("I25\\.2|I21\\.|I22\\.",x)
      }
      test1 <- sapply(df,MIchk)
      rowSums(test1)
    }
    #--------------------------------
    # Vectorized grepl
    SixHuVec <- function(df) { 
      res <- grepl("I25\\.2|I21\\.|I22\\.", unlist(df))
      dim(res) <- dim(df) 
      rowSums(res)
    }
    #--------------------------------
    David <- function(df) { 
      tmp <- unlist(df)
      res <- (substr(tmp, 1, 5) == "I25.2") | (substr(tmp, 1, 4) %in% c("I21.", "I22."))
      dim(res) <- dim(df)
      rowSums(res)
    }
    > microbenchmark::microbenchmark(SixHu(test),
                                     +                                SixHuVec(test),
                                      +                                David(test))
    Unit: seconds
    expr       min        lq      mean    median        uq       max neval cld
    SixHu(test)  4.323772  4.598328  4.836165  4.760263  4.988194  5.801979   100  b 
    SixHuVec(test) 11.867062 12.826925 13.342357 13.243638 13.635339 18.705615   100   c
    David(test)  3.728264  4.180152  4.389600  4.344938  4.519908  6.396018   100 a