在R中将列分隔为多个具有唯一列名的变量

在R中将列分隔为多个具有唯一列名的变量,r,regex,dplyr,R,Regex,Dplyr,以下是我希望数据框的外观: record color size height weight 1 blue large heavy 1 red 2 green small tall thin 但是,数据df如下所示: record vars 1 color = "blue", size = "large"

以下是我希望数据框的外观:

record    color    size    height    weight
1         blue     large             heavy
1         red                        
2         green    small   tall      thin
但是,数据df如下所示:

record    vars
1         color = "blue", size = "large"
2         color = "green", size = "small"
2         height = "tall", weight = "thin"
1         color = "red", weight = "heavy"
df的代码

structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L, 
                                                              2L, 4L, 
3L), .Label = c("color = \"blue\", size = \"large\"", 

"color = \"green\", size = \"small\"", "color = \"red\", weight = 
\"heavy\"", 

"height = \"tall\", weight = \"thin\""), class = "factor")), class = 
"data.frame", row.names = c(NA, 

-4L))
对于每个记录,我想用分隔符分隔vars列,并使用指定的变量名创建一个新列……如果某个特定变量有多个值,则应重复该记录

我知道,要使用tidyverse实现这一点,我需要使用dplyr::group_by和dplyr::separate,但是我不清楚如何将新变量名合并到into参数中,以实现separate。我是否需要某种类型的正则表达式来将等号=之前的任何文本标识为into中的新变量名??欢迎提出任何建议

df %>%
  group_by(record) %>%
  separate(col = vars, into = c(regex expression?? / character vector?), sep = ",")
这里有一个tidyverse选项。创建一个序列列“rn”,然后根据、、使用str\u remove\u all删除引号,将该列一分为二,并使用pivot\u wide将“long”重塑为“wide”


由于这些列几乎已经被编写为定义列表的R代码,您可以对它们进行解析/求值,然后取消对它们的测试

library(tidyverse)

df %>% 
  mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>% 
  unnest_wider(vars)

# record color size  height weight
#    <int> <chr> <chr> <chr>  <chr> 
# 1      1 blue  large NA     NA    
# 2      2 green small NA     NA    
# 3      2 NA    NA    tall   thin  

另一种方法是转换为2列矩阵并合并。我们需要一个助手函数,它可以将向量转换为以第一行为标题的矩阵

FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}
然后去掉非字符的内容并合并

l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)       
l <- Map(`[<-`, l, 1, "record", dat$record)     # cbind record column
Reduce(function(...) merge(..., all=TRUE), l)   # merge
#   record color weight  size height
# 1      1  blue   <NA> large   <NA>
# 2      1   red  heavy  <NA>   <NA>
# 3      2 green   thin small   tall

我只是注意到,到目前为止发布的所有答案(包括答案)都没有完全重现OP的预期结果:

虽然输入数据有4行,但仍显示3行

如果我理解正确,记录2的键值对可以安排在一行中,因为同一变量没有重复的值。对于记录1,变量颜色有两个值,分别出现在OP要求的第1行和第2行中

如果某个值有多个值,则应重复该记录 特定变量

记录1的所有其他变量只有一个值或没有值,并排列在第1行中

因此,对于每个记录,将创建一个底部参差不齐的子表,其中每个列从上到下分别填充

我试着用两种不同的方法来重现这一点:首先是使用我更流利的data.table,然后是使用dplyr/tidyr。最后,我将提出使用toString的重复值的替代表示

数据表 是data.table::rowid的替代品

添加参数值_fill=listval=以按空格重新固定NAs

替代代表 以下内容的目的不是尽可能接近地再现OP的预期结果,而是提出一种更简洁的结果表示方法,每个记录一行

在重塑过程中,可以使用函数来聚合每个单元格中的数据。toString函数连接字符串

library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
  , V1 := str_remove_all(V1, '"')][
    , tstrsplit(V1, " = "), by = .(rn, record)][
      , dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]


在您的结构中找不到weight=heavy为什么您想要的输出在第二条记录1中没有weight=heavy?这是一个打字错误,非常抱歉,我已经更改了此内容在查看代码时,您能解释一下创建“rn序列”的原因吗?在mutate中创建后,在管道中需要哪一步?@mdb_ftl这是因为您的“记录”是重复的,“rn”在我们执行pivot_更广泛的步骤以唯一标识行时会很有用
record    color    size    height    weight
1         blue     large             heavy
1         red                        
2         green    small   tall      thin
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
  , V1 := str_remove_all(V1, '"')][
    , tstrsplit(V1, " = "), by = .(rn, record)][
      , dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
        , record_1 := NULL][]
   record color  size height weight
1:      1  blue large   <NA>  heavy
2:      1   red  <NA>   <NA>   <NA>
3:      2 green small   tall   thin
library(dplyr)
library(tidyr)
library(stringr)
df %>% 
  separate_rows(vars, sep = ", ") %>% 
  mutate(vars = str_remove_all(vars, '"')) %>% 
  separate(vars,c("key", "val")) %>% 
  group_by(record, key) %>% 
  mutate(keyid = row_number(key)) %>% 
  pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>% 
  arrange(record, keyid) %>% 
  select(-keyid)
# A tibble: 3 x 5
# Groups:   record [2]
  record color size  height weight
   <int> <chr> <chr> <chr>  <chr> 
1      1 blue  large NA     heavy 
2      1 red   NA    NA     NA    
3      2 green small tall   thin
  group_by(record, key) %>% 
  mutate(keyid = row_number(key))
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
  , V1 := str_remove_all(V1, '"')][
    , tstrsplit(V1, " = "), by = .(rn, record)][
      , dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
   record     color  size height weight
1:      1 blue, red large         heavy
2:      2     green small   tall   thin
library(dplyr)
library(tidyr)
library(stringr)
df %>% 
  separate_rows(vars, sep = ", ") %>% 
  mutate(vars = str_remove_all(vars, '"')) %>% 
  separate(vars,c("key", "val")) %>% 
  pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))
# A tibble: 2 x 5
  record color     size  height weight
   <int> <chr>     <chr> <chr>  <chr> 
1      1 blue, red large NA     heavy 
2      2 green     small tall   thin