Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/70.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 将字符串拆分为未知数量的新数据帧列_Regex_R_String - Fatal编程技术网

Regex 将字符串拆分为未知数量的新数据帧列

Regex 将字符串拆分为未知数量的新数据帧列,regex,r,string,Regex,R,String,我有一个带有字符列的数据框,其中包含多个字符串形式的电子邮件元数据,这些字符串由换行符分隔\n: person myString 1 John

我有一个带有字符列的数据框,其中包含多个字符串形式的电子邮件元数据,这些字符串由换行符分隔
\n

  person                                                                                                                                                 myString
1   John                                                                                                            To name5@email.com by sender6 on 01-12-2014\n
2   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n
3    Tim                                                                To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n
我想将myString的不同子字符串拆分为不同的列,使其如下所示:

  person                                                     email1                                      email2                                        email3
1   John                To name5@email.com by sender6 on 01-12-2014                                        <NA>                                          <NA>
2   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
3    Tim                To name2@email.com by sender2 on 05-11-2014  To name@email.com by sender2 on 06-03-2015                                          <NA>
但是使用这种方法,我必须手动指定有三列要提取

我希望通过以下其中一项或两项来改进此过程:

  • 自动计算定界字符最大出现次数的方法(即,需要多少新变量)
  • 拆分为未知列数的其他方法
  • 如果有一个很好的解决方案可以以长格式而不是宽格式返回数据,那也太好了

    样本数据:

    df <- structure(list(person = c("John", "Jane", "Tim"), myString = c("To name5@email.com by sender6 on 01-12-2014\n", 
        "To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n", 
        "To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
        )), .Names = c("person", "myString"), row.names = c(NA, -3L), class = "data.frame")
    

    df似乎有点老套,但你看

    使用strsplit分割字符向量。获取最大长度,将其用于列

    df <- data.frame(
      person = c("John", "Jane", "Tim"),
      myString = c("To name5@email.com by sender6 on 01-12-2014\n",
                   "To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n",
                   "To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
      ), stringsAsFactors=FALSE
    )
    
    a <- strsplit(df$myString, "\n")
    max_len <- max(sapply(a, length))
    for(i in 1:max_len){
      df[,paste0("email", i)] <- sapply(a, "[", i)
    }
    

    df这是一条通往长型的有效途径:

    a <- strsplit(df$myString, "\n")
    lens <- vapply(a, length, integer(1L)) # or lengths(a) in R 3.2
    longdf <- df[rep(seq_along(a), lens),]
    longdf$string <- unlist(a)
    
    然后,如果确实有必要,请转到广泛形式:

    longdf$myString <- NULL
    longdf$token <- sequence(lens)
    widedf <- reshape(longdf, timevar="token", idvar="person", direction="wide")
    
    longdf$myString这可能就足够了:

    library(data.table)
    dt = as.data.table(df) # or setDT to convert in place
    
    dt[, strsplit(myString, split = "\n"), by = person]
    #   person                                                         V1
    #1:   John                To name5@email.com by sender6 on 01-12-2014
    #2:   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014
    #3:   Jane                To name3@email.com by sender2 on 02-03-2014
    #4:   Jane              To email5@domain.com by sender1 on 06-21-2014
    #5:    Tim                To name2@email.com by sender2 on 05-11-2014
    #6:    Tim                 To name@email.com by sender2 on 06-03-2015
    
    然后可以轻松地转换为宽格式:

    dcast(dt[, strsplit(myString, split = "\n"), by = person][, idx := 1:.N, by = person],
          person ~ idx, value.var = 'V1')
    #   person                                                          1                                           2                                             3
    #1:   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
    #2:   John                To name5@email.com by sender6 on 01-12-2014                                          NA                                            NA
    #3:    Tim                To name2@email.com by sender2 on 05-11-2014  To name@email.com by sender2 on 06-03-2015                                            NA
    
    # (load reshape2 and use dcast.data.table instead of dcast if not using 1.9.5+)
    

    我建议从我的“splitstackshape”套餐中选择
    cSplit

    您还可以尝试使用参数
    simplify=TRUE
    从“stringi”包中执行
    stri\u split\u fixed
    (尽管对于示例数据,这会在末尾添加一个额外的空列)。该方法类似于:

    library(stringi)
    data.frame(person = df$person, 
               stri_split_fixed(df$myString, "\n", 
                                simplify = TRUE))
    

    @SamFirke您不需要使用最新版本的
    data.table
    重塑2
    (最后编辑注释以使其更清晰)感谢您的澄清-我有data.table的最新CRAN版本,但我看到它是1.9.4,1.9.5是GitHub上的当前开发版本。这对我来说很好,这是我见过的解决这个问题的最简单的函数。看起来这个包裹里还有其他好东西。谢谢@桑菲克,谢谢。我希望您注意到,
    cSplit
    还有一个“direction”参数,如果您想要长表单,可以将其设置为
    “long”
    。我只是想顺便向您表示感谢。cSplit是一件美丽的东西!
    dcast(dt[, strsplit(myString, split = "\n"), by = person][, idx := 1:.N, by = person],
          person ~ idx, value.var = 'V1')
    #   person                                                          1                                           2                                             3
    #1:   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
    #2:   John                To name5@email.com by sender6 on 01-12-2014                                          NA                                            NA
    #3:    Tim                To name2@email.com by sender2 on 05-11-2014  To name@email.com by sender2 on 06-03-2015                                            NA
    
    # (load reshape2 and use dcast.data.table instead of dcast if not using 1.9.5+)
    
    library(splitstackshape)
    cSplit(df, "myString", "\n")
    #    person                                                 myString_1
    # 1:   John                To name5@email.com by sender6 on 01-12-2014
    # 2:   Jane To name@email.com,name4@email.com by sender1 on 01-22-2014
    # 3:    Tim                To name2@email.com by sender2 on 05-11-2014
    #                                     myString_2
    # 1:                                          NA
    # 2: To name3@email.com by sender2 on 02-03-2014
    # 3:  To name@email.com by sender2 on 06-03-2015
    #                                       myString_3
    # 1:                                            NA
    # 2: To email5@domain.com by sender1 on 06-21-2014
    # 3:                                            NA
    
    library(stringi)
    data.frame(person = df$person, 
               stri_split_fixed(df$myString, "\n", 
                                simplify = TRUE))