Regex 将字符串拆分为未知数量的新数据帧列
我有一个带有字符列的数据框,其中包含多个字符串形式的电子邮件元数据,这些字符串由换行符分隔Regex 将字符串拆分为未知数量的新数据帧列,regex,r,string,Regex,R,String,我有一个带有字符列的数据框,其中包含多个字符串形式的电子邮件元数据,这些字符串由换行符分隔\n: person myString 1 John
\n
:
person myString
1 John To name5@email.com by sender6 on 01-12-2014\n
2 Jane To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n
3 Tim To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n
我想将myString的不同子字符串拆分为不同的列,使其如下所示:
person email1 email2 email3
1 John To name5@email.com by sender6 on 01-12-2014 <NA> <NA>
2 Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
3 Tim To name2@email.com by sender2 on 05-11-2014 To name@email.com by sender2 on 06-03-2015 <NA>
但是使用这种方法,我必须手动指定有三列要提取
我希望通过以下其中一项或两项来改进此过程:
df <- structure(list(person = c("John", "Jane", "Tim"), myString = c("To name5@email.com by sender6 on 01-12-2014\n",
"To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n",
"To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
)), .Names = c("person", "myString"), row.names = c(NA, -3L), class = "data.frame")
df似乎有点老套,但你看
使用strsplit分割字符向量。获取最大长度,将其用于列
df <- data.frame(
person = c("John", "Jane", "Tim"),
myString = c("To name5@email.com by sender6 on 01-12-2014\n",
"To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n",
"To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
), stringsAsFactors=FALSE
)
a <- strsplit(df$myString, "\n")
max_len <- max(sapply(a, length))
for(i in 1:max_len){
df[,paste0("email", i)] <- sapply(a, "[", i)
}
df这是一条通往长型的有效途径:
a <- strsplit(df$myString, "\n")
lens <- vapply(a, length, integer(1L)) # or lengths(a) in R 3.2
longdf <- df[rep(seq_along(a), lens),]
longdf$string <- unlist(a)
然后,如果确实有必要,请转到广泛形式:
longdf$myString <- NULL
longdf$token <- sequence(lens)
widedf <- reshape(longdf, timevar="token", idvar="person", direction="wide")
longdf$myString这可能就足够了:
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
dt[, strsplit(myString, split = "\n"), by = person]
# person V1
#1: John To name5@email.com by sender6 on 01-12-2014
#2: Jane To name@email.com,name4@email.com by sender1 on 01-22-2014
#3: Jane To name3@email.com by sender2 on 02-03-2014
#4: Jane To email5@domain.com by sender1 on 06-21-2014
#5: Tim To name2@email.com by sender2 on 05-11-2014
#6: Tim To name@email.com by sender2 on 06-03-2015
然后可以轻松地转换为宽格式:
dcast(dt[, strsplit(myString, split = "\n"), by = person][, idx := 1:.N, by = person],
person ~ idx, value.var = 'V1')
# person 1 2 3
#1: Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
#2: John To name5@email.com by sender6 on 01-12-2014 NA NA
#3: Tim To name2@email.com by sender2 on 05-11-2014 To name@email.com by sender2 on 06-03-2015 NA
# (load reshape2 and use dcast.data.table instead of dcast if not using 1.9.5+)
我建议从我的“splitstackshape”套餐中选择cSplit
:
您还可以尝试使用参数simplify=TRUE
从“stringi”包中执行stri\u split\u fixed
(尽管对于示例数据,这会在末尾添加一个额外的空列)。该方法类似于:
library(stringi)
data.frame(person = df$person,
stri_split_fixed(df$myString, "\n",
simplify = TRUE))
@SamFirke您不需要使用最新版本的data.table重塑2
(最后编辑注释以使其更清晰)感谢您的澄清-我有data.table的最新CRAN版本,但我看到它是1.9.4,1.9.5是GitHub上的当前开发版本。这对我来说很好,这是我见过的解决这个问题的最简单的函数。看起来这个包裹里还有其他好东西。谢谢@桑菲克,谢谢。我希望您注意到,cSplit
还有一个“direction”参数,如果您想要长表单,可以将其设置为“long”
。我只是想顺便向您表示感谢。cSplit是一件美丽的东西!
dcast(dt[, strsplit(myString, split = "\n"), by = person][, idx := 1:.N, by = person],
person ~ idx, value.var = 'V1')
# person 1 2 3
#1: Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
#2: John To name5@email.com by sender6 on 01-12-2014 NA NA
#3: Tim To name2@email.com by sender2 on 05-11-2014 To name@email.com by sender2 on 06-03-2015 NA
# (load reshape2 and use dcast.data.table instead of dcast if not using 1.9.5+)
library(splitstackshape)
cSplit(df, "myString", "\n")
# person myString_1
# 1: John To name5@email.com by sender6 on 01-12-2014
# 2: Jane To name@email.com,name4@email.com by sender1 on 01-22-2014
# 3: Tim To name2@email.com by sender2 on 05-11-2014
# myString_2
# 1: NA
# 2: To name3@email.com by sender2 on 02-03-2014
# 3: To name@email.com by sender2 on 06-03-2015
# myString_3
# 1: NA
# 2: To email5@domain.com by sender1 on 06-21-2014
# 3: NA
library(stringi)
data.frame(person = df$person,
stri_split_fixed(df$myString, "\n",
simplify = TRUE))