使用R将单个列拆分为多个观测值

使用R将单个列拆分为多个观测值,r,data.table,medical,data-cleaning,splitstackshape,R,Data.table,Medical,Data Cleaning,Splitstackshape,我正在处理HCUP数据,这在一个列中有一个值范围,需要拆分为多个列。以下是供参考的HCUP数据框: code label 61000-61003 excision of CNS 0169T-0169T ventricular shunt 所需输出应为: code label 61000 excision of CNS 61001 excision of CNS 61002 e

我正在处理HCUP数据,这在一个列中有一个值范围,需要拆分为多个列。以下是供参考的HCUP数据框:

code            label
61000-61003     excision of CNS
0169T-0169T     ventricular shunt
所需输出应为:

code            label
61000           excision of CNS
61001           excision of CNS
61002           excision of CNS
61003           excision of CNS
0169T           ventricular shunt
我解决这个问题的方法是使用包splitstackshape并使用以下代码

library(data.table)
library(splitstackshape)

cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]
这种方法会导致内存问题。有没有更好的办法解决这个问题

一些评论:

  • 数据中除了“T”之外还有许多字母
  • 字母可以在前面,也可以在最后,但不能在两个数字之间
  • 在一个范围内,字母从“T”到“U”没有变化

    • 这里有一个使用
      dplyr
      all.is.numeric
      的解决方案,来自
      Hmisc

      library(dplyr)
      library(Hmisc)
      library(tidyr)
      dat %>% separate(code, into=c("code1", "code2")) %>%
              rowwise %>%
              mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
                               list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
                               list(code1))) %>%
              unnest(lists) %>%
              select(code = lists, label)
      
      Source: local data frame [5 x 2]
      
         code             label
        (chr)            (fctr)
      1 61000   excision of CNS
      2 61001   excision of CNS
      3 61002   excision of CNS
      4 61003   excision of CNS
      5 0169T ventricular shunt
      
      使用字符值修复范围的编辑。简化了一点:

      dff %>% mutate(row = row_number()) %>%
              separate(code, into=c("code1", "code2")) %>%
              group_by(row) %>%
              summarise(lists = if(all.is.numeric(c(code1, code2)))
                                    {list(str_pad(as.character(
                                         seq(from = as.numeric(code1), to = as.numeric(code2))),
                                               nchar(code1), pad="0"))}
                                else if(grepl("^[0-9]", code1))
                                    {list(str_pad(paste0(as.character(
                                         seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                            strsplit(code1, "[0-9]+")[[1]][2]),
                                               nchar(code1), pad = "0"))}
                                else
                                    {list(paste0(
                                            strsplit(code1, "[0-9]+")[[1]],
                                            str_pad(as.character(
                                          seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                               nchar(gsub("[^0-9]", "", code1)), pad="0")))},
                         label = first(label)) %>%
              unnest(lists) %>%
              select(-row)
      Source: local data frame [15 x 2]
      
                     label lists
                     (chr) (chr)
      1    excision of CNS 61000
      2    excision of CNS 61001
      3    excision of CNS 61002
      4  ventricular shunt 0169T
      5  ventricular shunt 0170T
      6  ventricular shunt 0171T
      7    excision of CNS 01000
      8    excision of CNS 01001
      9    excision of CNS 01002
      10    some procedure A2543
      11    some procedure A2544
      12    some procedure A2545
      13    some procedure A0543
      14    some procedure A0544
      15    some procedure A0545
      
      数据:


      dff一种不那么优雅的方式:

      # the data
      hcup <- data.frame(code=c("61000-61003", "0169T-0169T"),
                         label=c("excision of CNS", "ventricular shunt"), stringsAsFactors = F)
      hcup
      >         code             label
      >1 61000-61003   excision of CNS
      >2 0169T-0169T ventricular shunt
      
      # reshaping
      # split the code ranges into separate columns
      seq.ends <- cbind(do.call(rbind.data.frame, strsplit(hcup$code, "-")), hcup$label)
      # create a list with a data.frame for each original line
      new.list <- apply(seq.ends, 1, FUN=function(x){data.frame(code=if(grepl("\\d{5}", x[1])){
                           z<-x[1]:x[2]}else{z<-x[1]}, label=rep(x[3], length(z)),
                           stringsAsFactors = F)})
      # collapse the list into a df
      new.df <- do.call(rbind, lapply(new.list, data.frame, stringsAsFactors=F))
      
      new.df
      >     code             label
      >1.1 61000   excision of CNS
      >1.2 61001   excision of CNS
      >1.3 61002   excision of CNS
      >1.4 61003   excision of CNS
      >2   0169T ventricular shunt
      
      #数据
      hcup代码标签
      >1 61000-61003中枢神经系统切除术
      >2 0169T-0169T心室分流术
      #重塑
      #将代码范围拆分为单独的列
      序号1.4 61003中枢神经系统切除术
      >2 0169T心室分流术
      
      原始答案:请参见下面的更新

      首先,我将第一行添加到底部,使示例数据更具挑战性

      dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003"
      ), label = c("excision of CNS", "ventricular shunt", "excision of CNS"
      )), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame")
      
      dff
      #          code             label
      # 1 61000-61003   excision of CNS
      # 2 0169T-0169T ventricular shunt
      # 3 61000-61003   excision of CNS
      
      我们试图将序列运算符
      应用于
      strsplit()
      中的每个元素,如果无法获取
      x[1]:x[2]
      ,则只返回这些元素的值,然后继续执行序列
      x[1]:x[2]
      。然后,我们只需根据
      xx
      中的结果长度复制
      label
      列的值,即可得到新的
      label


      更新:以下是我对您的编辑的回应。将上面的
      xx
      替换为

      xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) {
          s <- stringi::stri_locate_first_regex(x, "[A-Z]")
          nc <- nchar(x)[1L]
          fmt <- function(n) paste0("%0", n, "d")
          if(!all(is.na(s))) {
              ss <- s[1,1]
              fmt <- fmt(nc-1)
              if(ss == 1L) {
                  xx <- substr(x, 2, nc)
                  paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2]))
              } else {
                  xx <- substr(x, 1, ss-1)
                  paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc))
              }
          } else {
              sprintf(fmt(nc), x[1]:x[2])
          }
      })
      
      然后在上面运行
      xx
      代码,我们可以得到以下结果

      data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))
      #     code             label
      # 1  61000   excision of CNS
      # 2  61001   excision of CNS
      # 3  61002   excision of CNS
      # 4  61003   excision of CNS
      # 5  0169T ventricular shunt
      # 6  0170T ventricular shunt
      # 7  0171T ventricular shunt
      # 8  0172T ventricular shunt
      # 9  0173T ventricular shunt
      # 10 0174T ventricular shunt
      # 11 61000   excision of CNS
      # 12 61001   excision of CNS
      # 13 61002   excision of CNS
      # 14 61003   excision of CNS
      # 15 T0169 ventricular shunt
      # 16 T0170 ventricular shunt
      # 17 T0171 ventricular shunt
      # 18 T0172 ventricular shunt
      # 19 T0173 ventricular shunt
      # 20 T0174 ventricular shunt
      

      为此类代码创建排序规则:

      seq_code <- function(from,to){
      
          ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x)
      
          pre = unique(sapply(list(from,to), ext, part = 1 ))
          suf = unique(sapply(list(from,to), ext, part = 3 ))
      
          if (length(pre) > 1 | length(suf) > 1){
              return("NO!")
          }
      
          num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2))))
          len = nchar(from)-nchar(pre)-nchar(suf)
      
          paste0(pre, sprintf(paste0("%0",len,"d"), num), suf)
      
      }
      

          row             label  code
       1:   1   excision of CNS 61000
       2:   1   excision of CNS 61001
       3:   1   excision of CNS 61002
       4:   2 ventricular shunt 0169T
       5:   2 ventricular shunt 0170T
       6:   2 ventricular shunt 0171T
       7:   3   excision of CNS 01000
       8:   3   excision of CNS 01001
       9:   3   excision of CNS 01002
      10:   4    some procedure A2543
      11:   4    some procedure A2544
      12:   4    some procedure A2545
      13:   5    some procedure A0543
      14:   5    some procedure A0544
      15:   5    some procedure A0545
      

      从@jeremycg的答案中复制的数据:

      dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
      "A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
      "excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
      "label"), row.names = c(NA, 5L), class = "data.frame")
      

      dff如果您足够耐心,您可能会将字符串解析为单独的片段,而不是eval/parse技巧,唉,我不是,所以:

      fancy.seq = function(x) eval(parse(text=sub(', \\)', ')', sub('\\(, ', '(',
                     sub('.*?([0-9]+)(.*)-(.*?)([1-9][0-9]*).*',
                         'paste0("\\3",
                                 formatC(\\1:\\4, width=log10(\\4)+1, format="d", flag="0"),
                                 "\\2")',
                         x)))))
      # using example from jeremycg's answer
      dt[, .(fancy.seq(code), label), by = 1:nrow(dt)]
      #    nrow    V1             label
      # 1:    1 61000   excision of CNS
      # 2:    1 61001   excision of CNS
      # 3:    1 61002   excision of CNS
      # 4:    2 0169T ventricular shunt
      # 5:    2 0170T ventricular shunt
      # 6:    2 0171T ventricular shunt
      # 7:    3 01000   excision of CNS
      # 8:    3 01001   excision of CNS
      # 9:    3 01002   excision of CNS
      #10:    4 A2543    some procedure
      #11:    4 A2544    some procedure
      #12:    4 A2545    some procedure
      #13:    5 A0543    some procedure
      #14:    5 A0544    some procedure
      #15:    5 A0545    some procedure
      

      如果不清楚上面的操作-只需在一个“code”字符串上逐个运行
      sub
      命令。

      Hmmm我对data.table不是很有经验,但我看不出你的方法如何工作-
      code\u 1
      (不应该是
      code\u 1
      )如果你想建立序列,
      code\u 2
      必须是数字,例如,
      hcup谢谢。我已经接受了编辑。我对“splitstackshape”本身并不挑剔。是否有可能编写一个可以处理此问题的函数?这可能会从
      splitstackshape
      文档中得到帮助:如果您知道拆分后列中的所有值每行的值数相同,则应改用
      cSplit\u f
      函数,它使用
      fread
      而不是
      strsplit
      ,通常速度更快。因此,也许您可以给我们提供更多的信息。字母
      T
      总是字母吗?它总是在字符串的末尾吗?再猜猜这个问题,我认为扩展数据帧可能不是您最终想要做的事情。将代码列拆分为
      begin
      end
      ,并存储
      code.prefix
      code.sufix
      似乎会使匹配更加简单,这可能就是本文所针对的用例。这看起来不错。但它在最终输出中忽略了“0169T”之类的代码。这个解决方案非常接近,但仍然忽略了字母最先出现的代码。例如,代码“A4245”不会添加到最终的数据库中,这非常有效。但输入数据有类似“0005T-0006T”的代码。在这种情况下,最终输出中只标记了0005T,但缺少代码0006T。很抱歉,数据集太大了,我错过了它。是的,我希望在最终输出中包含这两个代码。不确定您的示例是否可行。我猜每个标签在原始数据中只显示一次。
      \\2:\\4
      太棒了!
      setDT(dff)[,.(
        label = label[1], 
        code  = do.call(seq_code, tstrsplit(code,'-'))
      ), by=.(row=seq(nrow(dff)))]
      
          row             label  code
       1:   1   excision of CNS 61000
       2:   1   excision of CNS 61001
       3:   1   excision of CNS 61002
       4:   2 ventricular shunt 0169T
       5:   2 ventricular shunt 0170T
       6:   2 ventricular shunt 0171T
       7:   3   excision of CNS 01000
       8:   3   excision of CNS 01001
       9:   3   excision of CNS 01002
      10:   4    some procedure A2543
      11:   4    some procedure A2544
      12:   4    some procedure A2545
      13:   5    some procedure A0543
      14:   5    some procedure A0544
      15:   5    some procedure A0545
      
      dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
      "A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
      "excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
      "label"), row.names = c(NA, 5L), class = "data.frame")
      
      fancy.seq = function(x) eval(parse(text=sub(', \\)', ')', sub('\\(, ', '(',
                     sub('.*?([0-9]+)(.*)-(.*?)([1-9][0-9]*).*',
                         'paste0("\\3",
                                 formatC(\\1:\\4, width=log10(\\4)+1, format="d", flag="0"),
                                 "\\2")',
                         x)))))
      # using example from jeremycg's answer
      dt[, .(fancy.seq(code), label), by = 1:nrow(dt)]
      #    nrow    V1             label
      # 1:    1 61000   excision of CNS
      # 2:    1 61001   excision of CNS
      # 3:    1 61002   excision of CNS
      # 4:    2 0169T ventricular shunt
      # 5:    2 0170T ventricular shunt
      # 6:    2 0171T ventricular shunt
      # 7:    3 01000   excision of CNS
      # 8:    3 01001   excision of CNS
      # 9:    3 01002   excision of CNS
      #10:    4 A2543    some procedure
      #11:    4 A2544    some procedure
      #12:    4 A2545    some procedure
      #13:    5 A0543    some procedure
      #14:    5 A0544    some procedure
      #15:    5 A0545    some procedure