Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/65.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 在多个列中搜索以任意字符串开头的列;计数_R_Regex_Count - Fatal编程技术网

R 在多个列中搜索以任意字符串开头的列;计数

R 在多个列中搜索以任意字符串开头的列;计数,r,regex,count,R,Regex,Count,我有一个2百万次观测的数据集。我需要搜索多达50个字符的列,以计算(以及之后的筛选)哪些观察结果以最多20个字符串中的任何一个开始 我已经编写了代码,它返回每个字符串被发现的频率的计数;但是太慢了。在100k个观测值(9列,33个搜索字符串)上运行此操作需要2分钟,并且似乎是线性扩展的(这意味着≈完整数据集为30分钟)。我可以用SAS在几秒钟内完成这项工作,我正在一台带SSD的快速笔记本电脑上运行,所以我假设我的代码是问题所在(不是机器或问题本身) 注意:我正在将搜索字符串转换为向量,以便将其传

我有一个2百万次观测的数据集。我需要搜索多达50个字符的列,以计算(以及之后的筛选)哪些观察结果以最多20个字符串中的任何一个开始

我已经编写了代码,它返回每个字符串被发现的频率的计数;但是太慢了。在100k个观测值(9列,33个搜索字符串)上运行此操作需要2分钟,并且似乎是线性扩展的(这意味着≈完整数据集为30分钟)。我可以用SAS在几秒钟内完成这项工作,我正在一台带SSD的快速笔记本电脑上运行,所以我假设我的代码是问题所在(不是机器或问题本身)

注意:我正在将搜索字符串转换为向量,以便将其传递给
apply
,从而将速度提高3倍(与嵌套
sapply
)。我尝试了嵌套的
apply
语句,但没有加速。作为正则表达式语法的一部分,我还在搜索字符串前加了
^
,以将搜索限制在字符串的开头。我采用了完全不同的方法,但我必须能够在多个列上使用多个字符串搜索字符串的开头,并返回每个搜索字符串的计数

编辑/更新 这些解决方案比我的快得多。谢谢不幸的是,我的示例搜索字符串(无意中)具有误导性。道歉。我的实际搜索字符串的长度不同,有时是所有数字,从2到5个字符不等。我应该用一些更像:

search_strings <- c("64651","BC","654","DEF","EF","G6","F8","25","I9","J7")

search\u strings由于您并不真正关心按列计数,一个技巧是
unlist()
您的data.frame。这将生成所有值的向量。然后在这个向量上,您可以使用
stringr::str_count
来计算模式是否发生。然后计算计数的结果。简而言之,所有“硬”步骤都是矢量化的,您只需在
搜索字符串中的条目上“循环”即可

sapply(search_strings, function(i) sum(stringr::str_count(unlist(df_to_search), i)))

# ^AB ^BC ^CD ^DE ^EF ^G6 ^F8 ^H1 ^I9 ^J7 
# 394 392 387 389 359 417 397 780 378 382
编辑的完全矢量化方法-比sapply快约4-5倍

通过将所有值转换为单个字符串,每个条目由一个伪字符分隔,可以完全矢量化,例如
--

search\u字符串更新
只需在
substring()
上使用
table
。它容易阅读,而且速度快

starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
table(substr(unlist(df_to_search, use.names = FALSE), 1, 2))[starts]
## 
##  AB  BC  CD  DE  EF  G6  F8  H1  I9  J7 
## 394 392 387 389 359 417 397 780 378 382 

system.time(table(substr(unlist(df_to_search, use.names = FALSE), 1, 2))[starts])
##    user  system elapsed 
##   0.105   0.000   0.105 
原始答案 我会这样做:

  • 数据.frame
    值从
    因子
    转换为
    字符
  • 使用R 3.3中引入的
    startsWith()
  • 表演相当快

    # vector of starts you want to check
    starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
    
    # converting the data.frame to character
    df_to_search[] <- lapply(df_to_search, as.character)
    
    # searching and tabulating
    colSums(vapply(starts, function(x) {
      vapply(df_to_search, function(y) sum(startsWith(y, x)), integer(1L))
    }, integer(ncol(df_to_search))))
    #  AB  BC  CD  DE  EF  G6  F8  H1  I9  J7 
    # 394 392 387 389 359 417 397 780 378 382
    
    #要检查的起始向量
    开始
    
    search_strings <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
    vec2 <- function() {
        edited <- paste0("--", search_strings)
        vec_to_search <- paste0(paste0("--", unlist(df_to_search)), collapse="")
        result <- stringr::str_count(vec_to_search, edited)
        names(result) <- search_strings
        return(result)
    }
    vec2()
     # AB  BC  CD  DE  EF  G6  F8  H1  I9  J7 
    # 394 392 387 389 359 417 397 780 378 382
    
    starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
    table(substr(unlist(df_to_search, use.names = FALSE), 1, 2))[starts]
    ## 
    ##  AB  BC  CD  DE  EF  G6  F8  H1  I9  J7 
    ## 394 392 387 389 359 417 397 780 378 382 
    
    system.time(table(substr(unlist(df_to_search, use.names = FALSE), 1, 2))[starts])
    ##    user  system elapsed 
    ##   0.105   0.000   0.105 
    
    x <- factor(substr(unlist(df_to_search, use.names = FALSE), 1, 2))
    setNames(tabulate(x), levels(x))[starts]
    
    # vector of starts you want to check
    starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
    
    # converting the data.frame to character
    df_to_search[] <- lapply(df_to_search, as.character)
    
    # searching and tabulating
    colSums(vapply(starts, function(x) {
      vapply(df_to_search, function(y) sum(startsWith(y, x)), integer(1L))
    }, integer(ncol(df_to_search))))
    #  AB  BC  CD  DE  EF  G6  F8  H1  I9  J7 
    # 394 392 387 389 359 417 397 780 378 382
    
    starts <- c("AB","BC","CD","DE","EF","G6","F8","H1","I9","J7")
    df_to_search[] <- lapply(df_to_search, as.character)
    myfun <- function() {
      colSums(vapply(starts, function(x) {
        vapply(df_to_search, function(y) sum(startsWith(y, x)), integer(1L))
      }, integer(ncol(df_to_search))))
    } 
    #  user  system elapsed 
    # 0.199   0.000   0.199 
    
    myfun_unlist <- function() {
      temp <- unlist(df_to_search, use.names = FALSE)
      vapply(starts, function(x) sum(startsWith(temp, x)), integer(1L))
    }
    #  user  system elapsed 
    # 0.245   0.000   0.245 
    
    cPakfun <- function() {
      sapply(search_strings, function(i) sum(stringr::str_count(unlist(df_to_search), i)))
    }
    #  user  system elapsed 
    # 5.614   0.000   5.613 
    
    cPakfun2 <- function() {
      edited <- paste0("--", starts)
      vec_to_search <- paste0(paste0("--", unlist(df_to_search)), collapse="")
      result <- stringr::str_count(vec_to_search, edited)
      names(result) <- starts
      return(result)
    }
    #  user  system elapsed 
    # 0.902   0.000   0.901 
    
    opfun <- function() {
      sapply(search_strings, function(y)
        apply(df_to_search, 1, function(x) {
          str_detect(x, y)
        })) %>% colSums()
    }
    #   user  system elapsed 
    # 44.988   0.000  45.078 
    
    library(microbenchmark)
    
    ## Add tabulate to the options
    myfun_tabulate <- function() {
      df_to_search[] <- lapply(df_to_search, as.character)
      x <- factor(substr(unlist(df_to_search, use.names = FALSE), 1, 2))
      setNames(tabulate(x), levels(x))[starts]
    }
    
    res <- microbenchmark(myfun_tabulate(), myfun_table(), myfun(), myfun_unlist(), cPakfun2())
    # Unit: milliseconds
    #              expr       min       lq     mean   median       uq       max neval
    #  myfun_tabulate()  90.19794 100.2941 120.5411 102.7271 153.4527  238.6175   100
    #     myfun_table()  96.87556 110.1965 146.5356 154.3941 168.2660  562.4599   100
    #           myfun() 125.68799 127.8053 162.0679 130.0665 182.7757  577.3027   100
    #    myfun_unlist() 136.92772 138.4104 170.4002 140.0188 198.8845  613.7919   100
    #        cPakfun2() 859.22835 911.5291 940.6695 935.6335 955.3801 1154.5395   100
    
    autoplot(res, log = FALSE)