R-将i个逗号分隔ID的字符向量分解为数据帧的i个离散向量_R_Regex_String_Vector_Data Cleaning

R-将i个逗号分隔ID的字符向量分解为数据帧的i个离散向量

r regex string vector

R-将i个逗号分隔ID的字符向量分解为数据帧的i个离散向量,r,regex,string,vector,data-cleaning,R,Regex,String,Vector,Data Cleaning,数据帧df包括两个字符向量。以下是前10行： rowid codes_raw a 15-1132, 15-1133 b 21-1091, 21-1094, 21-1099 c 25-9011, 25-9021, 25-9031, 25-9099 d 31-9093, 31-9099 e

数据帧

df

包括两个字符向量。以下是前10行：

rowid  codes_raw                            
a      15-1132, 15-1133                     
b      21-1091, 21-1094, 21-1099            
c      25-9011, 25-9021, 25-9031, 25-9099   
d      31-9093, 31-9099                     
e      33-9092, 33-9099                     
f      37-2011, 37-2019                     
g      39-4011, 39-4021                     
h      47-5051, 47-5099                     
i      49-2094, 49-2095                     
j      49-9041

df$codes\u raw

包含给定行的1到i个离散标识符。这些标识符需要分布在同一数据帧中的新向量上。结果应该如下所示：

rowid codes_raw                            code_1     code_2     code_3     code_4
a     15-1132, 15-1133                     15-1132    15-1133
b     21-1091, 21-1094, 21-1099            21-1091    21-1094    21-1099
c     25-9011, 25-9021, 25-9031, 25-9099   25-9011    25-9021    25-9031    25-9099
d     31-9093, 31-9099                     31-9093    31-9099
e     33-9092, 33-9099                     33-9092    33-9099
f     37-2011, 37-2019                     37-2011    37-2019
g     39-4011, 39-4021                     39-4011    39-4021
h     47-5051, 47-5099                     47-5051    47-5099
i     49-2094, 49-2095                     49-2094    49-2095
j     49-9041                              49-9041

我目前的解决方案是为字符串的每一部分单独调用

if_else（）

，这很笨重。例如：

df$code_2 <- if_else(
  grepl(',', df$codes_raw),
  sub('.*,\\s*', '', df$codes_raw),
  ' ')

df$code_2使用'separate（）'
library（tidyr）
要自动输入列名，我建议这样做
library(tidyverse)
df %>% 
  separate_rows(codes_raw, sep = ", ") %>% 
  group_by(rowid) %>% 
  mutate(id_cols = row_number()) %>% 
  pivot_wider(rowid, names_from = id_cols, values_from = codes_raw, names_prefix = "code_") %>% 
  ungroup()

# A tibble: 10 x 5
   rowid code_1  code_2  code_3  code_4 
   <chr> <chr>   <chr>   <chr>   <chr>  
 1 a     15-1132 15-1133 NA      NA     
 2 b     21-1091 21-1094 21-1099 NA     
 3 c     25-9011 25-9021 25-9031 25-9099
 4 d     31-9093 31-9099 NA      NA     
 5 e     33-9092 33-9099 NA      NA     
 6 f     37-2011 37-2019 NA      NA     
 7 g     39-4011 39-4021 NA      NA     
 8 h     47-5051 47-5099 NA      NA     
 9 i     49-2094 49-2095 NA      NA     
10 j     49-9041 NA      NA      NA 

库（tidyverse）
df%>%
单独的行（代码为原始，sep=“，”）%>%
分组依据（rowid）%>%
变异（id\u cols=row\u number（））%>%
pivot\u wide（rowid，names\u from=id\u cols，values\u from=code\u raw，names\u prefix=“code”）%>%
解组（）
#一个tibble:10x5
rowid代码\u 1代码\u 2代码\u 3代码\u 4
1 a 15-1132 15-1133 NA
2 b 21-1091 21-1094 21-1099 NA
3 c 25-9011 25-9021 25-9031 25-9099
4 d 31-9093 31-9099 NA
5 e 33-9092 33-9099 NA
6 f 37-2011 37-2019不适用
7 g 39-4011 39-4021 NA
8小时47-5051 47-5099 NA
9 i 49-2094 49-2095 NA
10 j 49-9041不适用

或
nm%
分开(
原始代码，
到=纳米，
sep=“，”）
您说列的最大数量是20，因此有一种方法可以使用包含捕获组的正则表达式（使用库（namedCapture）
）实现这一点，如
rowid像这样动态地执行（创建列名）。这将适用于连接在一起的任意数量的字符串
df rowid代码\u原始
#>1 a 15-1132、15-1133
#>2 b 21-1091、21-1094、21-1099
#>3 c 25-9011、25-9021、25-9031、25-9099
#>4 d 31-9093、31-9099
#>5 e 33-9092、33-9099
#>6 f 37-2011、37-2019
#>7 g 39-4011、39-4021
#>8小时47-5051，47-5099
#>9 i 49-2094、49-2095
#>10 j 49-9041
图书馆（tidyr）
图书馆（stringr）
df%>%分离（代码为原始，分为=paste0（'code_u'，seq_ulen（1+max）（str_ucount（df$code_uraw'，，'））），
移除=F，sep='，'）
#>警告：预计4件。在9行[1,2,4，
#> 5, 6, 7, 8, 9, 10].
#>rowid代码\u原始代码\u 1代码\u 2代码\u 3代码\u 4
#>1 a 15-1132，15-1133 15-1132 15-1133
#>2 b 21-1091、21-1094、21-1099 21-1091 21-1094 21-1099
#>3 c 25-9011、25-9021、25-9031、25-9099 25-9011 25-9021 25-9031 25-9099
#>4 d 31-9093，31-9099 31-9093 31-9099
#>5 e 33-9092，33-9099 33-9092 33-9099
#>6 f 37-2011，37-2019 37-2011 37-2019
#>7 g 39-4011、39-4021 39-4011 39-4021
#>8小时47-5051，47-5099 47-5051 47-5099
#>9 i 49-2094，49-2095 49-2094 49-2095
#>10 j 49-9041 49-9041

由（v2.0.0）于2021-05-25创建，您可以使用stringr
库中的stru split（）
来拆分列表中的代码，然后将向量列表（长度不等）转换为矩阵，然后使用mutate（）
连接到原始数据帧。以下是一个例子：
#your example data
df<-data.frame(rowid = c("a","b", "c","d", "e", "f","g","h","i","j"),
               codes_raw = c("15-1132, 15-1133", "21-1091, 21-1094, 21-1099" ,"25-9011, 25-9021, 25-9031, 25-9099","31-9093, 31-9099", "33-9092, 33-9099",                     
        "37-2011, 37-2019","39-4011, 39-4021", "47-5051, 47-5099", "49-2094, 49-2095","49-9041"))

library(stringr)
library(dplyr)
#Split codes raw by comma
l<-str_split(df$codes_raw, ",")
#get length of each code
n.codes <- sapply(l, length)
#find the longest number of codes, and make a sequence from 1 to that number.
seq.max <- seq_len(max(n.codes))
#Fill NAs in blanks as you make a matrix. Convert to dataframe.
codes_in_columns <- t(sapply(l, "[", i = seq.max)) %>% 
  data.frame(.)
#Set the desired column names.
names(codes_in_columns)<- paste0("code_",seq.max)
#combine original with separated codes
df<-df %>% mutate(codes_in_columns )

#您的示例数据
DFT该解决方案运行良好，但（a）需要手动指定到变量的数量，以及（b）由于到向量的NA值，产生一系列警告suppressWarnings（）wrapper显然解决了这个问题，但它引入了一个明显的问题：如果原始数据帧被更新，您可以设置参数“extra”来执行您想要的操作。请参阅文档。如果这些值必须拆分成的列数未知，该怎么办？因此，我建议动态命名我已经编辑了代码，请检查
nm <- paste0("code_", seq_len(max(str_count(df$codes_raw, pattern = ",")) + 1))

df %>% 
  separate(
    codes_raw, 
    into = nm, 
    sep = ", ")

#your example data
df<-data.frame(rowid = c("a","b", "c","d", "e", "f","g","h","i","j"),
               codes_raw = c("15-1132, 15-1133", "21-1091, 21-1094, 21-1099" ,"25-9011, 25-9021, 25-9031, 25-9099","31-9093, 31-9099", "33-9092, 33-9099",                     
        "37-2011, 37-2019","39-4011, 39-4021", "47-5051, 47-5099", "49-2094, 49-2095","49-9041"))

library(stringr)
library(dplyr)
#Split codes raw by comma
l<-str_split(df$codes_raw, ",")
#get length of each code
n.codes <- sapply(l, length)
#find the longest number of codes, and make a sequence from 1 to that number.
seq.max <- seq_len(max(n.codes))
#Fill NAs in blanks as you make a matrix. Convert to dataframe.
codes_in_columns <- t(sapply(l, "[", i = seq.max)) %>% 
  data.frame(.)
#Set the desired column names.
names(codes_in_columns)<- paste0("code_",seq.max)
#combine original with separated codes
df<-df %>% mutate(codes_in_columns )