在R中将列分隔为多个具有唯一列名的变量_R_Regex_Dplyr

在R中将列分隔为多个具有唯一列名的变量

r regex

在R中将列分隔为多个具有唯一列名的变量,r,regex,dplyr,R,Regex,Dplyr,以下是我希望数据框的外观： record color size height weight 1 blue large heavy 1 red 2 green small tall thin 但是，数据df如下所示： record vars 1 color = "blue", size = "large"

以下是我希望数据框的外观：

record    color    size    height    weight
1         blue     large             heavy
1         red                        
2         green    small   tall      thin

但是，数据df如下所示：

record    vars
1         color = "blue", size = "large"
2         color = "green", size = "small"
2         height = "tall", weight = "thin"
1         color = "red", weight = "heavy"

df的代码

structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L, 
                                                              2L, 4L, 
3L), .Label = c("color = \"blue\", size = \"large\"", 

"color = \"green\", size = \"small\"", "color = \"red\", weight = 
\"heavy\"", 

"height = \"tall\", weight = \"thin\""), class = "factor")), class = 
"data.frame", row.names = c(NA, 

-4L))

对于每个记录，我想用分隔符分隔vars列，并使用指定的变量名创建一个新列……如果某个特定变量有多个值，则应重复该记录

我知道，要使用tidyverse实现这一点，我需要使用dplyr:：group_by和dplyr:：separate，但是我不清楚如何将新变量名合并到into参数中，以实现separate。我是否需要某种类型的正则表达式来将等号=之前的任何文本标识为into中的新变量名？？欢迎提出任何建议

df %>%
  group_by(record) %>%
  separate(col = vars, into = c(regex expression?? / character vector?), sep = ",")

这里有一个tidyverse选项。创建一个序列列“rn”，然后根据、、使用str\u remove\u all删除引号，将该列一分为二，并使用pivot\u wide将“long”重塑为“wide”

由于这些列几乎已经被编写为定义列表的R代码，您可以对它们进行解析/求值，然后取消对它们的测试

library(tidyverse)

df %>% 
  mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>% 
  unnest_wider(vars)

# record color size  height weight
#    <int> <chr> <chr> <chr>  <chr> 
# 1      1 blue  large NA     NA    
# 2      2 green small NA     NA    
# 3      2 NA    NA    tall   thin

另一种方法是转换为2列矩阵并合并。我们需要一个助手函数，它可以将向量转换为以第一行为标题的矩阵

FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}

然后去掉非字符的内容并合并

l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)       
l <- Map(`[<-`, l, 1, "record", dat$record)     # cbind record column
Reduce(function(...) merge(..., all=TRUE), l)   # merge
#   record color weight  size height
# 1      1  blue   <NA> large   <NA>
# 2      1   red  heavy  <NA>   <NA>
# 3      2 green   thin small   tall

我只是注意到，到目前为止发布的所有答案（包括答案）都没有完全重现OP的预期结果：

虽然输入数据有4行，但仍显示3行

如果我理解正确，记录2的键值对可以安排在一行中，因为同一变量没有重复的值。对于记录1，变量颜色有两个值，分别出现在OP要求的第1行和第2行中

如果某个值有多个值，则应重复该记录特定变量

记录1的所有其他变量只有一个值或没有值，并排列在第1行中

因此，对于每个记录，将创建一个底部参差不齐的子表，其中每个列从上到下分别填充

我试着用两种不同的方法来重现这一点：首先是使用我更流利的data.table，然后是使用dplyr/tidyr。最后，我将提出使用toString的重复值的替代表示

数据表是data.table:：rowid的替代品

添加参数值_fill=listval=以按空格重新固定NAs

替代代表以下内容的目的不是尽可能接近地再现OP的预期结果，而是提出一种更简洁的结果表示方法，每个记录一行

在重塑过程中，可以使用函数来聚合每个单元格中的数据。toString函数连接字符串

library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
  , V1 := str_remove_all(V1, '"')][
    , tstrsplit(V1, " = "), by = .(rn, record)][
      , dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]

或

在您的结构中找不到weight=heavy为什么您想要的输出在第二条记录1中没有weight=heavy？这是一个打字错误，非常抱歉，我已经更改了此内容在查看代码时，您能解释一下创建“rn序列”的原因吗？在mutate中创建后，在管道中需要哪一步？@mdb_ftl这是因为您的“记录”是重复的，“rn”在我们执行pivot_更广泛的步骤以唯一标识行时会很有用

record    color    size    height    weight
1         blue     large             heavy
1         red                        
2         green    small   tall      thin

library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
  , V1 := str_remove_all(V1, '"')][
    , tstrsplit(V1, " = "), by = .(rn, record)][
      , dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
        , record_1 := NULL][]

   record color  size height weight
1:      1  blue large   <NA>  heavy
2:      1   red  <NA>   <NA>   <NA>
3:      2 green small   tall   thin

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
  separate_rows(vars, sep = ", ") %>% 
  mutate(vars = str_remove_all(vars, '"')) %>% 
  separate(vars,c("key", "val")) %>% 
  group_by(record, key) %>% 
  mutate(keyid = row_number(key)) %>% 
  pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>% 
  arrange(record, keyid) %>% 
  select(-keyid)

# A tibble: 3 x 5
# Groups:   record [2]
  record color size  height weight
   <int> <chr> <chr> <chr>  <chr> 
1      1 blue  large NA     heavy 
2      1 red   NA    NA     NA    
3      2 green small tall   thin

  group_by(record, key) %>% 
  mutate(keyid = row_number(key))

library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
  , V1 := str_remove_all(V1, '"')][
    , tstrsplit(V1, " = "), by = .(rn, record)][
      , dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]

   record     color  size height weight
1:      1 blue, red large         heavy
2:      2     green small   tall   thin

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
  separate_rows(vars, sep = ", ") %>% 
  mutate(vars = str_remove_all(vars, '"')) %>% 
  separate(vars,c("key", "val")) %>% 
  pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))

# A tibble: 2 x 5
  record color     size  height weight
   <int> <chr>     <chr> <chr>  <chr> 
1      1 blue, red large NA     heavy 
2      2 green     small tall   thin