基于另一个数据集中的值索引,使用base R替换数据集中的值

基于另一个数据集中的值索引,使用base R替换数据集中的值,r,dataframe,indexing,dplyr,recode,R,Dataframe,Indexing,Dplyr,Recode,我想用此索引中的值替换诊断列中的值: structure(list(ID = c(123, 5345, 234, 453, 3656, 345), diagnosis_1 = c("B657", "B658", "B659", "B660", "B661", "B662"), diagnosis_2 = c("F8827", "G432&quo

我想用此索引中的值替换诊断列中的值:

structure(list(ID = c(123, 5345, 234, 453, 3656, 345), diagnosis_1 = c("B657", 
"B658", "B659", "B660", "B661", "B662"), diagnosis_2 = c("F8827", 
"G432", NA, "B657", NA, "H8940"), diagnosis_3 = c(NA, "B657", 
NA, NA, NA, "G432"), diagnosis_4 = c(NA, NA, NA, NA, NA, "B657"
), diagnosis_5 = c(NA, NA, NA, NA, NA, NA), diagnosis_6 = c(NA, 
NA, NA, NA, NA, NA), diagnosis_7 = c(NA, NA, NA, NA, NA, NA), 
    diagnosis_8 = c(NA, NA, NA, NA, NA, NA), diagnosis_9 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_10 = c(NA, NA, NA, NA, NA, 
    NA), diagnosis_11 = c(NA, NA, NA, NA, NA, NA), diagnosis_12 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_13 = c(NA, NA, NA, NA, NA, 
    NA), age = c(54, 65, 23, 22, 33, 77)), row.names = c(NA, 
-6L), class = "data.frame")
实际上,该表有数千行,我处理的其他表具有可变数量的诊断列,因此理想的解决方案是不知道列的数量。索引也有几百个条目长

如果索引表按如下方式划分:

B657    1
B658    2
B659    3
B660    4
B661    5
B662    1
F8827   3
G432    3
H8940   4
这会改变它的编码方式吗

所需的输出如下所示:

B657    1
B658    2
B659    3
B660    4
B661    5
B662    1
F8827   3
G432    3
H8940   4
非常感谢你可以试试这个

1 B657, B662
2 B658
3 B659, F8827, G432 
4 B660 H8940    
5 B661

一种可能的解决方案是首先构造一个向量tab_vec,其中旧值作为名称,新值和实际值。之后,我们可以使用包dplyr version>=1.0.0中的recode函数,并在名称以字符串诊断开头的变量之间使用它

您可以使用“匹配”来使用查找表更改内容

    ID diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4 diagnosis_5 diagnosis_6 diagnosis_7 diagnosis_8 diagnosis_9 diagnosis_10 diagnosis_11 diagnosis_12 diagnosis_13 age
1  123           1           3          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  54
2 5345           2           3           1          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  65
3  234           3          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  23
4  453           4           1          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  22
5 3656           5          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  33
6  345           1           4           3           1          NA          NA          NA          NA          NA           NA           NA           NA           NA  77
如果查找具有给定的不同结构,则:

i <- startsWith(colnames(x), "diagnosis_")
x[,i] <- y[match(unlist(x[,i]), y[,1]),2]
x
#    ID diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4 diagnosis_5 diagnosis_6 diagnosis_7 diagnosis_8 diagnosis_9 diagnosis_10 diagnosis_11 diagnosis_12 diagnosis_13 age
#1  123           1           3          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  54
#2 5345           2           3           1          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  65
#3  234           3          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  23
#4  453           4           1          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  22
#5 3656           5          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  33
#6  345           1           4           3           1          NA          NA          NA          NA          NA           NA           NA           NA           NA  77
如果找不到代码,并且您不想将其设置为NA

zz <- strsplit(z, "[, ]+")
zz <- setNames(rep(seq_along(zz), lengths(zz)), unlist(zz))
i <- startsWith(colnames(x), "diagnosis_")
x[,i] <- zz[unlist(x[,i])]
数据:


对于您的代码,我得到一个错误:没有适用于逻辑类对象的“recode”方法。这是因为具有总NA的列(例如diagnosis_5)属于逻辑类。最好使用~recodeas.character!!!是的,我以前转换过类型,但我没有在代码中插入它。我将根据您的建议编辑我的答案,我更喜欢您的建议,谢谢。很抱歉,我收到了erorr:评估错误:找不到函数“Cross”。我已经完成了librarydplyr,没有任何问题…@tacrolimus Cross是一个新函数,从dplyr版本1.0.0开始,它已经取代了所有作用域函数*\u at、*\u if、*\u all。您可能需要更新dplyrpackage@tacrolimus是的,如果您的dplyr版本<1.0.0,则可以使用作用域变量mutate_at和助手函数vars,如下所示:dplyr::mutate_atdf,varsstarts_with diagnosis,~recodeas.character!!!tab_vec很抱歉,我收到了这个错误:'[.data.table'x,I:j中的错误{…}中的第二个参数是一个符号,但找不到列名'I'。可能是您想要的DT[,…I]。与data.frame的差异是经过深思熟虑的,并在常见问题解答1.1中进行了解释。@Tacromus我不使用data.table。您是否合并或修改了答案?成功了!非常感谢结尾处的2表示,它应该从y中选取第二列。它选取的行来自match。我添加了“如果找不到代码,您不想设置”谢谢你花时间回答。这很有效
i <- startsWith(colnames(x), "diagnosis_")
x[,i] <- y[match(unlist(x[,i]), y[,1]),2]
x
#    ID diagnosis_1 diagnosis_2 diagnosis_3 diagnosis_4 diagnosis_5 diagnosis_6 diagnosis_7 diagnosis_8 diagnosis_9 diagnosis_10 diagnosis_11 diagnosis_12 diagnosis_13 age
#1  123           1           3          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  54
#2 5345           2           3           1          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  65
#3  234           3          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  23
#4  453           4           1          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  22
#5 3656           5          NA          NA          NA          NA          NA          NA          NA          NA           NA           NA           NA           NA  33
#6  345           1           4           3           1          NA          NA          NA          NA          NA           NA           NA           NA           NA  77
zz <- strsplit(z, "[, ]+")
zz <- setNames(rep(seq_along(zz), lengths(zz)), unlist(zz))
i <- startsWith(colnames(x), "diagnosis_")
x[,i] <- zz[unlist(x[,i])]
i <- startsWith(colnames(x), "diagnosis_")
j <- match(unlist(x[,i]), y[,1])
k <- !is.na(j)
tt <- unlist(x[,i])
tt[k] <- y[j[k],2]
x[,i] <- tt
rm(i, j, k, tt)
x <- structure(list(ID = c(123, 5345, 234, 453, 3656, 345), diagnosis_1 = c("B657", 
"B658", "B659", "B660", "B661", "B662"), diagnosis_2 = c("F8827", 
"G432", NA, "B657", NA, "H8940"), diagnosis_3 = c(NA, "B657", 
NA, NA, NA, "G432"), diagnosis_4 = c(NA, NA, NA, NA, NA, "B657"
), diagnosis_5 = c(NA, NA, NA, NA, NA, NA), diagnosis_6 = c(NA, 
NA, NA, NA, NA, NA), diagnosis_7 = c(NA, NA, NA, NA, NA, NA), 
    diagnosis_8 = c(NA, NA, NA, NA, NA, NA), diagnosis_9 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_10 = c(NA, NA, NA, NA, NA, 
    NA), diagnosis_11 = c(NA, NA, NA, NA, NA, NA), diagnosis_12 = c(NA, 
    NA, NA, NA, NA, NA), diagnosis_13 = c(NA, NA, NA, NA, NA, 
    NA), age = c(54, 65, 23, 22, 33, 77)), row.names = c(NA, 
                                                         -6L), class = "data.frame")
y <- read.table(text="B657    1
B658    2
B659    3
B660    4
B661    5
B662    1
F8827   3
G432    3
H8940   4")
z <- readLines(con=textConnection("B657, B662
B658
B659, F8827, G432
B660 H8940
B661"))