r匹配和替换
我有一个包含多行和多列的数据集。下面是一些行和列的快照r匹配和替换,r,replace,match,R,Replace,Match,我有一个包含多行和多列的数据集。下面是一些行和列的快照 ID Date Gender Age Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 10 2015-10-14 F 68 345.50 884.2 008.69 202.18 189.8 435.2 084.7 757 93 2002-07-22 F 87 242.80 71
ID Date Gender Age Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
10 2015-10-14 F 68 345.50 884.2 008.69 202.18 189.8 435.2 084.7 757
93 2002-07-22 F 87 242.80 710.9 345.50 884.2 008.69 202.18 189.8 435.2
14 2004-07-28 M 92 084.7 757 242.80 710.9 427.2 530.10 567.89 227.9
41 2011-02-24 M 39 714.0 084.7 757 242.80 710.9 427.2 530.10 567.89
64 2002-03-14 F 39 227.9 714.0 V58.49 906.7 800.35 V88.0 349.31 289.84
22 2015-11-21 F 68 324.0 V65.44 411.8 200.41 187.7 E869.3 041.04 170.4
36 2003-09-17 F 75 389.1 176.3 788.37 E936.3 277.82 812.12 E816.7 663.90
11 2000-10-07 M 74 716.90 396.3 482.1 E816.7 663.90 716.90 396.3 482.1
45 2001-07-14 F 31 614.2 945.44 799.4 864.05 371.31 268 626.2 780.72
60 1999-02-23 M 45 674 645.2 006.5 V68.2 V67.00 665.24 434.00 914.3
我还有另一个数据集,它是一个查找表,包含Col1、Col2、Col3、Col4、Col5、Col6、Col7
和Col8
中代码的简短描述,如下所示
Code Short_Description
345.50 interStellar
884.2 indispensable
008.69 hallucination
202.18 flow
189.8 categorizing
435.2 choppiness
084.7 chieftain
757 substantiating
V58.49 unbridled
V88.0 polish
324.0 stumble
V65.44 hoopster
411.8 overtrimmed
E869.3 overbrutalizing
041.04 choric
E936.3 busera
277.82 subdelegating
E816.7 baton
663.90 Space
我的问题是,如何将第一个数据集中的代码与第二个查找数据集中的代码进行匹配,并用相应的简短描述替换匹配的代码
下面的预期输出显示了代码345.50
匹配并替换为interStellar
,V58.49
匹配并替换为unbridled
我希望得到一个输出,其中所有代码都匹配并替换为相应的描述。我知道如何使用if-then-else实现这一点,但这将非常低效,我认为应该有一些简单的方法来实现这一点。非常感谢您的帮助。提前谢谢
ID Date Gender Age Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
10 2015-10-14 F 68 interStellar 884.2 008.69 202.18 189.8 435.2 084.7 757
93 2002-07-22 F 87 242.80 710.9 interStellar 884.2 008.69 202.18 189.8 435.2
14 2004-07-28 M 92 084.7 757 242.80 710.9 427.2 530.10 567.89 227.9
41 2011-02-24 M 39 714.0 084.7 757 242.80 710.9 427.2 530.10 567.89
64 2002-03-14 F 39 227.9 714.0 unbridled 906.7 800.35 V88.0 349.31 289.84
22 2015-11-21 F 68 324.0 hoopster 411.8 200.41 187.7 E869.3 041.04 170.4
36 2003-09-17 F 75 389.1 176.3 788.37 E936.3 277.82 812.12 baton 663.90
11 2000-10-07 M 74 716.90 396.3 482.1 baton 663.90 716.90 396.3 482.1
45 2001-07-14 F 31 614.2 945.44 799.4 864.05 371.31 268 626.2 780.72
60 1999-02-23 M 45 674 645.2 006.5 V68.2 V67.00 665.24 434.00 914.3
====================
本例中使用的可复制数据集========================
df1 = structure(list(ID = c(10L, 93L, 14L, 41L, 64L, 22L, 36L, 11L,
45L, 60L), Date = c("10/14/2015", "7/22/2002", "7/28/2004", "2/24/2011",
"3/14/2002", "11/21/2015", "9/17/2003", "10/7/2000", "7/14/2001",
"2/23/1999"), Gender = c("F", "F", "M", "M", "F", "F", "F", "M",
"F", "M"), Age = c(68L, 87L, 92L, 39L, 39L, 68L, 75L, 74L, 31L,
45L), Col1 = c(345.5, 242.8, 84.7, 714, 227.9, 324, 389.1, 716.9,
614.2, 674), Col2 = c("884.2", "710.9", "757", "84.7", "714",
"V65.44", "176.3", "396.3", "945.44", "645.2"), Col3 = c("8.69",
"345.5", "242.8", "757", "V58.49", "411.8", "788.37", "482.1",
"799.4", "6.5"), Col4 = c("202.18", "884.2", "710.9", "242.8",
"906.7", "200.41", "E936.3", "E816.7", "864.05", "V68.2"), Col5 = c("189.8",
"8.69", "427.2", "710.9", "800.35", "187.7", "277.82", "663.9",
"371.31", "V67.00"), Col6 = c("435.2", "202.18", "530.1", "427.2",
"V88.0", "E869.3", "812.12", "716.9", "268", "665.24"), Col7 = c("84.7",
"189.8", "567.89", "530.1", "349.31", "41.04", "E816.7", "396.3",
"626.2", "434"), Col8 = c(757, 435.2, 227.9, 567.89, 289.84,
170.4, 663.9, 482.1, 780.72, 914.3)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), .Names = c("ID", "Date",
"Gender", "Age", "Col1", "Col2", "Col3", "Col4", "Col5", "Col6",
"Col7", "Col8"), spec = structure(list(cols = structure(list(
ID = structure(list(), class = c("collector_integer", "collector"
)), Date = structure(list(), class = c("collector_character",
"collector")), Gender = structure(list(), class = c("collector_character",
"collector")), Age = structure(list(), class = c("collector_integer",
"collector")), Col1 = structure(list(), class = c("collector_double",
"collector")), Col2 = structure(list(), class = c("collector_character",
"collector")), Col3 = structure(list(), class = c("collector_character",
"collector")), Col4 = structure(list(), class = c("collector_character",
"collector")), Col5 = structure(list(), class = c("collector_character",
"collector")), Col6 = structure(list(), class = c("collector_character",
"collector")), Col7 = structure(list(), class = c("collector_character",
"collector")), Col8 = structure(list(), class = c("collector_double",
"collector"))), .Names = c("ID", "Date", "Gender", "Age",
"Col1", "Col2", "Col3", "Col4", "Col5", "Col6", "Col7", "Col8"
)), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
lookup_table = structure(list(Code = c("345.5", "884.2", "8.69", "202.18", "189.8",
"435.2", "84.7", "757", "V58.49", "V88.0", "324", "V65.44", "411.8",
"E869.3", "41.04", "E936.3", "277.82", "E816.7", "63.9"), Short_Description = c("interStellar",
"indispensable", "hallucination", "flow", "\tcategorizing", "choppiness",
"chieftain", "\tsubstantiating", "unbridled", "polish", "stumble",
"hoopster", "overtrimmed", "overbrutalizing", "choric", "busera",
"subdelegating", "baton\t", "Space")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -19L), .Names = c("Code", "Short_Description"
), spec = structure(list(cols = structure(list(Code = structure(list(), class = c("collector_character",
"collector")), Short_Description = structure(list(), class = c("collector_character",
"collector"))), .Names = c("Code", "Short_Description")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
我们可以通过
collect/spread
将“宽”数据集重塑为“长”格式,首先使用collect
,使用“lookup\u table”进行左连接,通过将“code”中的元素替换为“Short\u Description”(在不缺少值的情况下)对“code”进行变异并在选择所需列后将扩展为“宽”格式(删除“短描述”)
对于大型数据集,另一个选项是从data.table
设置。创建以“Col”作为子字符串(“nm1”)的列名的数字索引。将“data.frame”转换为“data.table”(setDT(df1)
),在指定.SDcols
后循环通过“nm1”列,并将其转换为字符
(因为预期输出将具有来自“Short_Description”列的字符字符串。使用for
循环,并使用set
更改“i”中指定的列和行的“值”(使用match
)
library(data.table)
nm1 <- grep("Col", names(df1))
setDT(df1)[, (nm1) := lapply(.SD, as.character), .SDcols= nm1]
for(j in nm1){
set(df1, i = which(df1[[j]] %chin% lookup_table$Code), j = j,
value = lookup_table$Short_Description[match(df1[[j]], lookup_table$Code, nomatch=0)])
}
df1
库(data.table)
nm1您可以使用qdap
package中的lookup
来进行循环
library(qdap)
df1[,13:20] <- NA
for(i in 1:dim(df1)[1]){
for(j in 1:8){
df1[i,j+12] <- lookup(df1[i,j+4], lookup_table)
}
}
head(df1)
ID Date Gender Age Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 V13 V14 V15 V16 V17 V18 V19 V20
1 10 10/14/2015 F 68 345.5 884.2 8.69 202.18 189.8 435.2 84.7 757 interStellar indispensable hallucination flow \tcategorizing choppiness chieftain \tsubstantiating
2 93 7/22/2002 F 87 242.8 710.9 345.5 884.2 8.69 202.18 189.8 435.2 <NA> <NA> interStellar indispensable hallucination flow \tcategorizing choppiness
3 14 7/28/2004 M 92 84.7 757 242.8 710.9 427.2 530.1 567.89 227.9 chieftain \tsubstantiating <NA> <NA> <NA> <NA> <NA> <NA>
4 41 2/24/2011 M 39 714 84.7 757 242.8 710.9 427.2 530.1 567.89 <NA> chieftain \tsubstantiating <NA> <NA> <NA> <NA> <NA>
5 64 3/14/2002 F 39 227.9 714 V58.49 906.7 800.35 V88.0 349.31 289.84 <NA> <NA> unbridled <NA> <NA> polish <NA> <NA>
6 22 11/21/2015 F 68 324 V65.44 411.8 200.41 187.7 E869.3 41.04 170.4 stumble hoopster overtrimmed <NA> <NA> overbrutalizing choric <NA>
库(qdap)
df1[,13:20]library(dplyr);df1%>%mutate_at(vars)(以('Col')开头)、funs(ifelse(.%在%lookup_table$code中,lookup_table$Short_Description[匹配(,lookup_table$code)])
@alistaire这起作用了,我如何更改变量(以('Col开头)
代码的一部分,以便我可以选择以Col
或Secondary
变量开头的变量(匹配('Col | Secondary'))
可能。文档:?dplyr::select_helpers
很好的解决方案,但我得到一个错误,即内存不足,这主要是因为我处理的是一个包含许多列和至少一百万行的大型数据集。然而,alistaire
,解决方案worked@KimJenkins谢谢你的评论。我更新了data.table option.Co你能检查一下我之前的评论吗?最新的版本就像黄油上的一把刀。它很平滑*100。我很欣赏这个解决方案,但我甚至不打算尝试这个选项,因为正如我前面提到的,我的数据集中有数百列和数百万行,如果我尝试你的方法,我看起来会是adding是列数的两倍,这将导致内存问题。
library(qdap)
df1[,13:20] <- NA
for(i in 1:dim(df1)[1]){
for(j in 1:8){
df1[i,j+12] <- lookup(df1[i,j+4], lookup_table)
}
}
head(df1)
ID Date Gender Age Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 V13 V14 V15 V16 V17 V18 V19 V20
1 10 10/14/2015 F 68 345.5 884.2 8.69 202.18 189.8 435.2 84.7 757 interStellar indispensable hallucination flow \tcategorizing choppiness chieftain \tsubstantiating
2 93 7/22/2002 F 87 242.8 710.9 345.5 884.2 8.69 202.18 189.8 435.2 <NA> <NA> interStellar indispensable hallucination flow \tcategorizing choppiness
3 14 7/28/2004 M 92 84.7 757 242.8 710.9 427.2 530.1 567.89 227.9 chieftain \tsubstantiating <NA> <NA> <NA> <NA> <NA> <NA>
4 41 2/24/2011 M 39 714 84.7 757 242.8 710.9 427.2 530.1 567.89 <NA> chieftain \tsubstantiating <NA> <NA> <NA> <NA> <NA>
5 64 3/14/2002 F 39 227.9 714 V58.49 906.7 800.35 V88.0 349.31 289.84 <NA> <NA> unbridled <NA> <NA> polish <NA> <NA>
6 22 11/21/2015 F 68 324 V65.44 411.8 200.41 187.7 E869.3 41.04 170.4 stumble hoopster overtrimmed <NA> <NA> overbrutalizing choric <NA>