Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/66.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R-将i个逗号分隔ID的字符向量分解为数据帧的i个离散向量_R_Regex_String_Vector_Data Cleaning - Fatal编程技术网

R-将i个逗号分隔ID的字符向量分解为数据帧的i个离散向量

R-将i个逗号分隔ID的字符向量分解为数据帧的i个离散向量,r,regex,string,vector,data-cleaning,R,Regex,String,Vector,Data Cleaning,数据帧df包括两个字符向量。以下是前10行: rowid codes_raw a 15-1132, 15-1133 b 21-1091, 21-1094, 21-1099 c 25-9011, 25-9021, 25-9031, 25-9099 d 31-9093, 31-9099 e

数据帧
df
包括两个字符向量。以下是前10行:

rowid  codes_raw                            
a      15-1132, 15-1133                     
b      21-1091, 21-1094, 21-1099            
c      25-9011, 25-9021, 25-9031, 25-9099   
d      31-9093, 31-9099                     
e      33-9092, 33-9099                     
f      37-2011, 37-2019                     
g      39-4011, 39-4021                     
h      47-5051, 47-5099                     
i      49-2094, 49-2095                     
j      49-9041                    
df$codes\u raw
包含给定行的1到i个离散标识符。这些标识符需要分布在同一数据帧中的新向量上。结果应该如下所示:

rowid codes_raw                            code_1     code_2     code_3     code_4
a     15-1132, 15-1133                     15-1132    15-1133
b     21-1091, 21-1094, 21-1099            21-1091    21-1094    21-1099
c     25-9011, 25-9021, 25-9031, 25-9099   25-9011    25-9021    25-9031    25-9099
d     31-9093, 31-9099                     31-9093    31-9099
e     33-9092, 33-9099                     33-9092    33-9099
f     37-2011, 37-2019                     37-2011    37-2019
g     39-4011, 39-4021                     39-4011    39-4021
h     47-5051, 47-5099                     47-5051    47-5099
i     49-2094, 49-2095                     49-2094    49-2095
j     49-9041                              49-9041
我目前的解决方案是为字符串的每一部分单独调用
if_else()
,这很笨重。例如:

df$code_2 <- if_else(
  grepl(',', df$codes_raw),
  sub('.*,\\s*', '', df$codes_raw),
  ' ')
df$code_2使用'separate()'

library(tidyr)

要自动输入列名,我建议这样做

library(tidyverse)
df %>% 
  separate_rows(codes_raw, sep = ", ") %>% 
  group_by(rowid) %>% 
  mutate(id_cols = row_number()) %>% 
  pivot_wider(rowid, names_from = id_cols, values_from = codes_raw, names_prefix = "code_") %>% 
  ungroup()

# A tibble: 10 x 5
   rowid code_1  code_2  code_3  code_4 
   <chr> <chr>   <chr>   <chr>   <chr>  
 1 a     15-1132 15-1133 NA      NA     
 2 b     21-1091 21-1094 21-1099 NA     
 3 c     25-9011 25-9021 25-9031 25-9099
 4 d     31-9093 31-9099 NA      NA     
 5 e     33-9092 33-9099 NA      NA     
 6 f     37-2011 37-2019 NA      NA     
 7 g     39-4011 39-4021 NA      NA     
 8 h     47-5051 47-5099 NA      NA     
 9 i     49-2094 49-2095 NA      NA     
10 j     49-9041 NA      NA      NA 
库(tidyverse)
df%>%
单独的行(代码为原始,sep=“,”)%>%
分组依据(rowid)%>%
变异(id\u cols=row\u number())%>%
pivot\u wide(rowid,names\u from=id\u cols,values\u from=code\u raw,names\u prefix=“code”)%>%
解组()
#一个tibble:10x5
rowid代码\u 1代码\u 2代码\u 3代码\u 4
1 a 15-1132 15-1133 NA
2 b 21-1091 21-1094 21-1099 NA
3 c 25-9011 25-9021 25-9031 25-9099
4 d 31-9093 31-9099 NA
5 e 33-9092 33-9099 NA
6 f 37-2011 37-2019不适用
7 g 39-4011 39-4021 NA
8小时47-5051 47-5099 NA
9 i 49-2094 49-2095 NA
10 j 49-9041不适用

nm%
分开(
原始代码,
到=纳米,
sep=“,”)

您说列的最大数量是20,因此有一种方法可以使用包含捕获组的正则表达式(使用
库(namedCapture)
)实现这一点,如

rowid像这样动态地执行(创建列名)。这将适用于连接在一起的任意数量的字符串

df rowid代码\u原始
#>1 a 15-1132、15-1133
#>2 b 21-1091、21-1094、21-1099
#>3 c 25-9011、25-9021、25-9031、25-9099
#>4 d 31-9093、31-9099
#>5 e 33-9092、33-9099
#>6 f 37-2011、37-2019
#>7 g 39-4011、39-4021
#>8小时47-5051,47-5099
#>9 i 49-2094、49-2095
#>10 j 49-9041
图书馆(tidyr)
图书馆(stringr)
df%>%分离(代码为原始,分为=paste0('code_u',seq_ulen(1+max)(str_ucount(df$code_uraw',,'))),
移除=F,sep=',')
#>警告:预计4件。在9行[1,2,4,
#> 5, 6, 7, 8, 9, 10].
#>rowid代码\u原始代码\u 1代码\u 2代码\u 3代码\u 4
#>1 a 15-1132,15-1133 15-1132 15-1133
#>2 b 21-1091、21-1094、21-1099 21-1091 21-1094 21-1099
#>3 c 25-9011、25-9021、25-9031、25-9099 25-9011 25-9021 25-9031 25-9099
#>4 d 31-9093,31-9099 31-9093 31-9099
#>5 e 33-9092,33-9099 33-9092 33-9099
#>6 f 37-2011,37-2019 37-2011 37-2019
#>7 g 39-4011、39-4021 39-4011 39-4021
#>8小时47-5051,47-5099 47-5051 47-5099
#>9 i 49-2094,49-2095 49-2094 49-2095
#>10 j 49-9041 49-9041

由(v2.0.0)于2021-05-25创建,您可以使用
stringr
库中的
stru split()
来拆分列表中的代码,然后将向量列表(长度不等)转换为矩阵,然后使用
mutate()
连接到原始数据帧。以下是一个例子:

#your example data
df<-data.frame(rowid = c("a","b", "c","d", "e", "f","g","h","i","j"),
               codes_raw = c("15-1132, 15-1133", "21-1091, 21-1094, 21-1099" ,"25-9011, 25-9021, 25-9031, 25-9099","31-9093, 31-9099", "33-9092, 33-9099",                     
        "37-2011, 37-2019","39-4011, 39-4021", "47-5051, 47-5099", "49-2094, 49-2095","49-9041"))

library(stringr)
library(dplyr)
#Split codes raw by comma
l<-str_split(df$codes_raw, ",")
#get length of each code
n.codes <- sapply(l, length)
#find the longest number of codes, and make a sequence from 1 to that number.
seq.max <- seq_len(max(n.codes))
#Fill NAs in blanks as you make a matrix. Convert to dataframe.
codes_in_columns <- t(sapply(l, "[", i = seq.max)) %>% 
  data.frame(.)
#Set the desired column names.
names(codes_in_columns)<- paste0("code_",seq.max)
#combine original with separated codes
df<-df %>% mutate(codes_in_columns )
#您的示例数据

DFT该解决方案运行良好,但(a)需要手动指定
变量的数量,以及(b)由于
向量的NA值,产生一系列警告
suppressWarnings()
wrapper显然解决了这个问题,但它引入了一个明显的问题:如果原始数据帧被更新,您可以设置参数“extra”来执行您想要的操作。请参阅文档。如果这些值必须拆分成的列数未知,该怎么办?因此,我建议动态命名我已经编辑了代码,请检查
nm <- paste0("code_", seq_len(max(str_count(df$codes_raw, pattern = ",")) + 1))

df %>% 
  separate(
    codes_raw, 
    into = nm, 
    sep = ", ")
#your example data
df<-data.frame(rowid = c("a","b", "c","d", "e", "f","g","h","i","j"),
               codes_raw = c("15-1132, 15-1133", "21-1091, 21-1094, 21-1099" ,"25-9011, 25-9021, 25-9031, 25-9099","31-9093, 31-9099", "33-9092, 33-9099",                     
        "37-2011, 37-2019","39-4011, 39-4021", "47-5051, 47-5099", "49-2094, 49-2095","49-9041"))

library(stringr)
library(dplyr)
#Split codes raw by comma
l<-str_split(df$codes_raw, ",")
#get length of each code
n.codes <- sapply(l, length)
#find the longest number of codes, and make a sequence from 1 to that number.
seq.max <- seq_len(max(n.codes))
#Fill NAs in blanks as you make a matrix. Convert to dataframe.
codes_in_columns <- t(sapply(l, "[", i = seq.max)) %>% 
  data.frame(.)
#Set the desired column names.
names(codes_in_columns)<- paste0("code_",seq.max)
#combine original with separated codes
df<-df %>% mutate(codes_in_columns )