R-将i个逗号分隔ID的字符向量分解为数据帧的i个离散向量
数据帧R-将i个逗号分隔ID的字符向量分解为数据帧的i个离散向量,r,regex,string,vector,data-cleaning,R,Regex,String,Vector,Data Cleaning,数据帧df包括两个字符向量。以下是前10行: rowid codes_raw a 15-1132, 15-1133 b 21-1091, 21-1094, 21-1099 c 25-9011, 25-9021, 25-9031, 25-9099 d 31-9093, 31-9099 e
df
包括两个字符向量。以下是前10行:
rowid codes_raw
a 15-1132, 15-1133
b 21-1091, 21-1094, 21-1099
c 25-9011, 25-9021, 25-9031, 25-9099
d 31-9093, 31-9099
e 33-9092, 33-9099
f 37-2011, 37-2019
g 39-4011, 39-4021
h 47-5051, 47-5099
i 49-2094, 49-2095
j 49-9041
df$codes\u raw
包含给定行的1到i个离散标识符。这些标识符需要分布在同一数据帧中的新向量上。结果应该如下所示:
rowid codes_raw code_1 code_2 code_3 code_4
a 15-1132, 15-1133 15-1132 15-1133
b 21-1091, 21-1094, 21-1099 21-1091 21-1094 21-1099
c 25-9011, 25-9021, 25-9031, 25-9099 25-9011 25-9021 25-9031 25-9099
d 31-9093, 31-9099 31-9093 31-9099
e 33-9092, 33-9099 33-9092 33-9099
f 37-2011, 37-2019 37-2011 37-2019
g 39-4011, 39-4021 39-4011 39-4021
h 47-5051, 47-5099 47-5051 47-5099
i 49-2094, 49-2095 49-2094 49-2095
j 49-9041 49-9041
我目前的解决方案是为字符串的每一部分单独调用if_else()
,这很笨重。例如:
df$code_2 <- if_else(
grepl(',', df$codes_raw),
sub('.*,\\s*', '', df$codes_raw),
' ')
df$code_2使用'separate()'
library(tidyr)
要自动输入列名,我建议这样做
library(tidyverse)
df %>%
separate_rows(codes_raw, sep = ", ") %>%
group_by(rowid) %>%
mutate(id_cols = row_number()) %>%
pivot_wider(rowid, names_from = id_cols, values_from = codes_raw, names_prefix = "code_") %>%
ungroup()
# A tibble: 10 x 5
rowid code_1 code_2 code_3 code_4
<chr> <chr> <chr> <chr> <chr>
1 a 15-1132 15-1133 NA NA
2 b 21-1091 21-1094 21-1099 NA
3 c 25-9011 25-9021 25-9031 25-9099
4 d 31-9093 31-9099 NA NA
5 e 33-9092 33-9099 NA NA
6 f 37-2011 37-2019 NA NA
7 g 39-4011 39-4021 NA NA
8 h 47-5051 47-5099 NA NA
9 i 49-2094 49-2095 NA NA
10 j 49-9041 NA NA NA
库(tidyverse)
df%>%
单独的行(代码为原始,sep=“,”)%>%
分组依据(rowid)%>%
变异(id\u cols=row\u number())%>%
pivot\u wide(rowid,names\u from=id\u cols,values\u from=code\u raw,names\u prefix=“code”)%>%
解组()
#一个tibble:10x5
rowid代码\u 1代码\u 2代码\u 3代码\u 4
1 a 15-1132 15-1133 NA
2 b 21-1091 21-1094 21-1099 NA
3 c 25-9011 25-9021 25-9031 25-9099
4 d 31-9093 31-9099 NA
5 e 33-9092 33-9099 NA
6 f 37-2011 37-2019不适用
7 g 39-4011 39-4021 NA
8小时47-5051 47-5099 NA
9 i 49-2094 49-2095 NA
10 j 49-9041不适用
或
nm%
分开(
原始代码,
到=纳米,
sep=“,”)
您说列的最大数量是20,因此有一种方法可以使用包含捕获组的正则表达式(使用库(namedCapture)
)实现这一点,如
rowid像这样动态地执行(创建列名)。这将适用于连接在一起的任意数量的字符串
df rowid代码\u原始
#>1 a 15-1132、15-1133
#>2 b 21-1091、21-1094、21-1099
#>3 c 25-9011、25-9021、25-9031、25-9099
#>4 d 31-9093、31-9099
#>5 e 33-9092、33-9099
#>6 f 37-2011、37-2019
#>7 g 39-4011、39-4021
#>8小时47-5051,47-5099
#>9 i 49-2094、49-2095
#>10 j 49-9041
图书馆(tidyr)
图书馆(stringr)
df%>%分离(代码为原始,分为=paste0('code_u',seq_ulen(1+max)(str_ucount(df$code_uraw',,'))),
移除=F,sep=',')
#>警告:预计4件。在9行[1,2,4,
#> 5, 6, 7, 8, 9, 10].
#>rowid代码\u原始代码\u 1代码\u 2代码\u 3代码\u 4
#>1 a 15-1132,15-1133 15-1132 15-1133
#>2 b 21-1091、21-1094、21-1099 21-1091 21-1094 21-1099
#>3 c 25-9011、25-9021、25-9031、25-9099 25-9011 25-9021 25-9031 25-9099
#>4 d 31-9093,31-9099 31-9093 31-9099
#>5 e 33-9092,33-9099 33-9092 33-9099
#>6 f 37-2011,37-2019 37-2011 37-2019
#>7 g 39-4011、39-4021 39-4011 39-4021
#>8小时47-5051,47-5099 47-5051 47-5099
#>9 i 49-2094,49-2095 49-2094 49-2095
#>10 j 49-9041 49-9041
由(v2.0.0)于2021-05-25创建,您可以使用stringr
库中的stru split()
来拆分列表中的代码,然后将向量列表(长度不等)转换为矩阵,然后使用mutate()
连接到原始数据帧。以下是一个例子:
#your example data
df<-data.frame(rowid = c("a","b", "c","d", "e", "f","g","h","i","j"),
codes_raw = c("15-1132, 15-1133", "21-1091, 21-1094, 21-1099" ,"25-9011, 25-9021, 25-9031, 25-9099","31-9093, 31-9099", "33-9092, 33-9099",
"37-2011, 37-2019","39-4011, 39-4021", "47-5051, 47-5099", "49-2094, 49-2095","49-9041"))
library(stringr)
library(dplyr)
#Split codes raw by comma
l<-str_split(df$codes_raw, ",")
#get length of each code
n.codes <- sapply(l, length)
#find the longest number of codes, and make a sequence from 1 to that number.
seq.max <- seq_len(max(n.codes))
#Fill NAs in blanks as you make a matrix. Convert to dataframe.
codes_in_columns <- t(sapply(l, "[", i = seq.max)) %>%
data.frame(.)
#Set the desired column names.
names(codes_in_columns)<- paste0("code_",seq.max)
#combine original with separated codes
df<-df %>% mutate(codes_in_columns )
#您的示例数据
DFT该解决方案运行良好,但(a)需要手动指定到变量的数量,以及(b)由于到向量的NA值,产生一系列警告suppressWarnings()
wrapper显然解决了这个问题,但它引入了一个明显的问题:如果原始数据帧被更新,您可以设置参数“extra”来执行您想要的操作。请参阅文档。如果这些值必须拆分成的列数未知,该怎么办?因此,我建议动态命名我已经编辑了代码,请检查
nm <- paste0("code_", seq_len(max(str_count(df$codes_raw, pattern = ",")) + 1))
df %>%
separate(
codes_raw,
into = nm,
sep = ", ")
#your example data
df<-data.frame(rowid = c("a","b", "c","d", "e", "f","g","h","i","j"),
codes_raw = c("15-1132, 15-1133", "21-1091, 21-1094, 21-1099" ,"25-9011, 25-9021, 25-9031, 25-9099","31-9093, 31-9099", "33-9092, 33-9099",
"37-2011, 37-2019","39-4011, 39-4021", "47-5051, 47-5099", "49-2094, 49-2095","49-9041"))
library(stringr)
library(dplyr)
#Split codes raw by comma
l<-str_split(df$codes_raw, ",")
#get length of each code
n.codes <- sapply(l, length)
#find the longest number of codes, and make a sequence from 1 to that number.
seq.max <- seq_len(max(n.codes))
#Fill NAs in blanks as you make a matrix. Convert to dataframe.
codes_in_columns <- t(sapply(l, "[", i = seq.max)) %>%
data.frame(.)
#Set the desired column names.
names(codes_in_columns)<- paste0("code_",seq.max)
#combine original with separated codes
df<-df %>% mutate(codes_in_columns )