在R中将列分隔为多个具有唯一列名的变量
以下是我希望数据框的外观:在R中将列分隔为多个具有唯一列名的变量,r,regex,dplyr,R,Regex,Dplyr,以下是我希望数据框的外观: record color size height weight 1 blue large heavy 1 red 2 green small tall thin 但是,数据df如下所示: record vars 1 color = "blue", size = "large"
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
但是,数据df如下所示:
record vars
1 color = "blue", size = "large"
2 color = "green", size = "small"
2 height = "tall", weight = "thin"
1 color = "red", weight = "heavy"
df的代码
structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L,
2L, 4L,
3L), .Label = c("color = \"blue\", size = \"large\"",
"color = \"green\", size = \"small\"", "color = \"red\", weight =
\"heavy\"",
"height = \"tall\", weight = \"thin\""), class = "factor")), class =
"data.frame", row.names = c(NA,
-4L))
对于每个记录,我想用分隔符分隔vars列,并使用指定的变量名创建一个新列……如果某个特定变量有多个值,则应重复该记录
我知道,要使用tidyverse实现这一点,我需要使用dplyr::group_by和dplyr::separate,但是我不清楚如何将新变量名合并到into参数中,以实现separate。我是否需要某种类型的正则表达式来将等号=之前的任何文本标识为into中的新变量名??欢迎提出任何建议
df %>%
group_by(record) %>%
separate(col = vars, into = c(regex expression?? / character vector?), sep = ",")
这里有一个tidyverse选项。创建一个序列列“rn”,然后根据、、使用str\u remove\u all删除引号,将该列一分为二,并使用pivot\u wide将“long”重塑为“wide”
由于这些列几乎已经被编写为定义列表的R代码,您可以对它们进行解析/求值,然后取消对它们的测试
library(tidyverse)
df %>%
mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>%
unnest_wider(vars)
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
# 1 1 blue large NA NA
# 2 2 green small NA NA
# 3 2 NA NA tall thin
另一种方法是转换为2列矩阵并合并。我们需要一个助手函数,它可以将向量转换为以第一行为标题的矩阵
FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}
然后去掉非字符的内容并合并
l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)
l <- Map(`[<-`, l, 1, "record", dat$record) # cbind record column
Reduce(function(...) merge(..., all=TRUE), l) # merge
# record color weight size height
# 1 1 blue <NA> large <NA>
# 2 1 red heavy <NA> <NA>
# 3 2 green thin small tall
我只是注意到,到目前为止发布的所有答案(包括答案)都没有完全重现OP的预期结果: 虽然输入数据有4行,但仍显示3行 如果我理解正确,记录2的键值对可以安排在一行中,因为同一变量没有重复的值。对于记录1,变量颜色有两个值,分别出现在OP要求的第1行和第2行中 如果某个值有多个值,则应重复该记录 特定变量 记录1的所有其他变量只有一个值或没有值,并排列在第1行中 因此,对于每个记录,将创建一个底部参差不齐的子表,其中每个列从上到下分别填充 我试着用两种不同的方法来重现这一点:首先是使用我更流利的data.table,然后是使用dplyr/tidyr。最后,我将提出使用toString的重复值的替代表示 数据表 是data.table::rowid的替代品 添加参数值_fill=listval=以按空格重新固定NAs 替代代表 以下内容的目的不是尽可能接近地再现OP的预期结果,而是提出一种更简洁的结果表示方法,每个记录一行 在重塑过程中,可以使用函数来聚合每个单元格中的数据。toString函数连接字符串
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
或
在您的结构中找不到weight=heavy为什么您想要的输出在第二条记录1中没有weight=heavy?这是一个打字错误,非常抱歉,我已经更改了此内容在查看代码时,您能解释一下创建“rn序列”的原因吗?在mutate中创建后,在管道中需要哪一步?@mdb_ftl这是因为您的“记录”是重复的,“rn”在我们执行pivot_更广泛的步骤以唯一标识行时会很有用
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
, record_1 := NULL][]
record color size height weight
1: 1 blue large <NA> heavy
2: 1 red <NA> <NA> <NA>
3: 2 green small tall thin
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
group_by(record, key) %>%
mutate(keyid = row_number(key)) %>%
pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>%
arrange(record, keyid) %>%
select(-keyid)
# A tibble: 3 x 5
# Groups: record [2]
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue large NA heavy
2 1 red NA NA NA
3 2 green small tall thin
group_by(record, key) %>%
mutate(keyid = row_number(key))
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
record color size height weight
1: 1 blue, red large heavy
2: 2 green small tall thin
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))
# A tibble: 2 x 5
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue, red large NA heavy
2 2 green small tall thin