r匹配和替换_R_Replace_Match - Fatal编程技术网

r匹配和替换

r replace

r匹配和替换,r,replace,match,R,Replace,Match,我有一个包含多行和多列的数据集。下面是一些行和列的快照 ID Date Gender Age Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 10 2015-10-14 F 68 345.50 884.2 008.69 202.18 189.8 435.2 084.7 757 93 2002-07-22 F 87 242.80 71

我有一个包含多行和多列的数据集。下面是一些行和列的快照

  ID  Date        Gender Age  Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8 
  10  2015-10-14  F      68   345.50  884.2   008.69  202.18  189.8   435.2   084.7   757
  93  2002-07-22  F      87   242.80  710.9   345.50  884.2   008.69  202.18  189.8   435.2  
  14  2004-07-28  M      92   084.7   757     242.80  710.9   427.2   530.10  567.89  227.9
  41  2011-02-24  M      39   714.0   084.7   757     242.80  710.9   427.2   530.10  567.89
  64  2002-03-14  F      39   227.9   714.0   V58.49  906.7   800.35  V88.0   349.31  289.84 
  22  2015-11-21  F      68   324.0   V65.44  411.8   200.41  187.7   E869.3  041.04  170.4
  36  2003-09-17  F      75   389.1   176.3   788.37  E936.3  277.82  812.12  E816.7  663.90
  11  2000-10-07  M      74   716.90  396.3   482.1   E816.7  663.90  716.90  396.3   482.1 
  45  2001-07-14  F      31   614.2   945.44  799.4   864.05  371.31  268     626.2   780.72
  60  1999-02-23  M      45   674     645.2   006.5   V68.2   V67.00  665.24  434.00  914.3

我还有另一个数据集，它是一个查找表，包含

Col1、Col2、Col3、Col4、Col5、Col6、Col7

和

Col8

中代码的简短描述，如下所示

 Code       Short_Description
 345.50     interStellar
 884.2      indispensable
 008.69     hallucination
 202.18     flow
 189.8      categorizing
 435.2      choppiness
 084.7      chieftain
 757        substantiating
 V58.49     unbridled
 V88.0      polish
 324.0      stumble
 V65.44     hoopster
 411.8      overtrimmed
 E869.3     overbrutalizing
 041.04     choric
 E936.3     busera
 277.82     subdelegating
 E816.7     baton   
 663.90     Space

我的问题是，如何将第一个数据集中的代码与第二个查找数据集中的代码进行匹配，并用相应的简短描述替换匹配的代码

下面的预期输出显示了代码

345.50

匹配并替换为

interStellar

，

V58.49

匹配并替换为

unbridled

我希望得到一个输出，其中所有代码都匹配并替换为相应的描述。我知道如何使用if-then-else实现这一点，但这将非常低效，我认为应该有一些简单的方法来实现这一点。非常感谢您的帮助。提前谢谢

  ID  Date        Gender Age  Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8 
  10  2015-10-14  F      68   interStellar  884.2   008.69  202.18  189.8   435.2   084.7   757
  93  2002-07-22  F      87   242.80  710.9   interStellar  884.2   008.69  202.18  189.8   435.2  
  14  2004-07-28  M      92   084.7   757     242.80  710.9   427.2   530.10  567.89  227.9
  41  2011-02-24  M      39   714.0   084.7   757     242.80  710.9   427.2   530.10  567.89
  64  2002-03-14  F      39   227.9   714.0   unbridled  906.7   800.35  V88.0   349.31  289.84 
  22  2015-11-21  F      68   324.0  hoopster  411.8   200.41  187.7   E869.3  041.04  170.4
  36  2003-09-17  F      75   389.1   176.3   788.37  E936.3  277.82  812.12  baton  663.90
  11  2000-10-07  M      74   716.90  396.3   482.1   baton  663.90  716.90  396.3   482.1 
  45  2001-07-14  F      31   614.2   945.44  799.4   864.05  371.31  268     626.2   780.72
  60  1999-02-23  M      45   674     645.2   006.5   V68.2   V67.00  665.24  434.00  914.3

==================== 本例中使用的可复制数据集========================

df1 = structure(list(ID = c(10L, 93L, 14L, 41L, 64L, 22L, 36L, 11L, 
45L, 60L), Date = c("10/14/2015", "7/22/2002", "7/28/2004", "2/24/2011", 
"3/14/2002", "11/21/2015", "9/17/2003", "10/7/2000", "7/14/2001", 
"2/23/1999"), Gender = c("F", "F", "M", "M", "F", "F", "F", "M", 
"F", "M"), Age = c(68L, 87L, 92L, 39L, 39L, 68L, 75L, 74L, 31L, 
45L), Col1 = c(345.5, 242.8, 84.7, 714, 227.9, 324, 389.1, 716.9, 
614.2, 674), Col2 = c("884.2", "710.9", "757", "84.7", "714", 
"V65.44", "176.3", "396.3", "945.44", "645.2"), Col3 = c("8.69", 
"345.5", "242.8", "757", "V58.49", "411.8", "788.37", "482.1", 
"799.4", "6.5"), Col4 = c("202.18", "884.2", "710.9", "242.8", 
"906.7", "200.41", "E936.3", "E816.7", "864.05", "V68.2"), Col5 = c("189.8", 
"8.69", "427.2", "710.9", "800.35", "187.7", "277.82", "663.9", 
"371.31", "V67.00"), Col6 = c("435.2", "202.18", "530.1", "427.2", 
"V88.0", "E869.3", "812.12", "716.9", "268", "665.24"), Col7 = c("84.7", 
"189.8", "567.89", "530.1", "349.31", "41.04", "E816.7", "396.3", 
"626.2", "434"), Col8 = c(757, 435.2, 227.9, 567.89, 289.84, 
170.4, 663.9, 482.1, 780.72, 914.3)), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -10L), .Names = c("ID", "Date", 
"Gender", "Age", "Col1", "Col2", "Col3", "Col4", "Col5", "Col6", 
"Col7", "Col8"), spec = structure(list(cols = structure(list(
    ID = structure(list(), class = c("collector_integer", "collector"
    )), Date = structure(list(), class = c("collector_character", 
    "collector")), Gender = structure(list(), class = c("collector_character", 
    "collector")), Age = structure(list(), class = c("collector_integer", 
    "collector")), Col1 = structure(list(), class = c("collector_double", 
    "collector")), Col2 = structure(list(), class = c("collector_character", 
    "collector")), Col3 = structure(list(), class = c("collector_character", 
    "collector")), Col4 = structure(list(), class = c("collector_character", 
    "collector")), Col5 = structure(list(), class = c("collector_character", 
    "collector")), Col6 = structure(list(), class = c("collector_character", 
    "collector")), Col7 = structure(list(), class = c("collector_character", 
    "collector")), Col8 = structure(list(), class = c("collector_double", 
    "collector"))), .Names = c("ID", "Date", "Gender", "Age", 
"Col1", "Col2", "Col3", "Col4", "Col5", "Col6", "Col7", "Col8"
)), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))


lookup_table = structure(list(Code = c("345.5", "884.2", "8.69", "202.18", "189.8", 
"435.2", "84.7", "757", "V58.49", "V88.0", "324", "V65.44", "411.8", 
"E869.3", "41.04", "E936.3", "277.82", "E816.7", "63.9"), Short_Description = c("interStellar", 
"indispensable", "hallucination", "flow", "\tcategorizing", "choppiness", 
"chieftain", "\tsubstantiating", "unbridled", "polish", "stumble", 
"hoopster", "overtrimmed", "overbrutalizing", "choric", "busera", 
"subdelegating", "baton\t", "Space")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -19L), .Names = c("Code", "Short_Description"
), spec = structure(list(cols = structure(list(Code = structure(list(), class = c("collector_character", 
"collector")), Short_Description = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Code", "Short_Description")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

我们可以通过

collect/spread

将“宽”数据集重塑为“长”格式，首先使用

collect

，使用“lookup\u table”进行

左连接，通过将“code”中的元素替换为“Short\u Description”（在不缺少值的情况下）对“code”进行变异并在选择所需列后将扩展为“宽”格式（删除“短描述”）

对于大型数据集，另一个选项是从data.table
设置。创建以“Col”作为子字符串（“nm1”）的列名的数字索引。将“data.frame”转换为“data.table”（setDT（df1）
），在指定.SDcols
后循环通过“nm1”列，并将其转换为字符
（因为预期输出将具有来自“Short_Description”列的字符字符串。使用for
循环，并使用set
更改“i”中指定的列和行的“值”（使用match
）
library(data.table)
nm1 <- grep("Col", names(df1))
setDT(df1)[, (nm1) := lapply(.SD, as.character), .SDcols= nm1]
for(j in nm1){
  set(df1, i = which(df1[[j]] %chin% lookup_table$Code), j = j,
    value = lookup_table$Short_Description[match(df1[[j]], lookup_table$Code, nomatch=0)])
 }

df1

库（data.table）
nm1您可以使用qdap
package中的lookup
来进行循环
library(qdap)

df1[,13:20] <- NA

for(i in 1:dim(df1)[1]){
  for(j in 1:8){
    df1[i,j+12] <- lookup(df1[i,j+4], lookup_table)
  }
}

head(df1)

  ID       Date Gender Age  Col1   Col2   Col3   Col4   Col5   Col6   Col7   Col8          V13              V14              V15           V16            V17             V18            V19              V20
1 10 10/14/2015      F  68 345.5  884.2   8.69 202.18  189.8  435.2   84.7    757 interStellar    indispensable    hallucination          flow \tcategorizing      choppiness      chieftain \tsubstantiating
2 93  7/22/2002      F  87 242.8  710.9  345.5  884.2   8.69 202.18  189.8  435.2         <NA>             <NA>     interStellar indispensable  hallucination            flow \tcategorizing       choppiness
3 14  7/28/2004      M  92  84.7    757  242.8  710.9  427.2  530.1 567.89  227.9    chieftain \tsubstantiating             <NA>          <NA>           <NA>            <NA>           <NA>             <NA>
4 41  2/24/2011      M  39   714   84.7    757  242.8  710.9  427.2  530.1 567.89         <NA>        chieftain \tsubstantiating          <NA>           <NA>            <NA>           <NA>             <NA>
5 64  3/14/2002      F  39 227.9    714 V58.49  906.7 800.35  V88.0 349.31 289.84         <NA>             <NA>        unbridled          <NA>           <NA>          polish           <NA>             <NA>
6 22 11/21/2015      F  68   324 V65.44  411.8 200.41  187.7 E869.3  41.04  170.4      stumble         hoopster      overtrimmed          <NA>           <NA> overbrutalizing         choric             <NA>

库（qdap）
df1[，13:20]library（dplyr）；df1%>%mutate_at（vars）（以（'Col'）开头）、funs（ifelse（.%在%lookup_table$code中，lookup_table$Short_Description[匹配（，lookup_table$code）]）
@alistaire这起作用了，我如何更改变量（以（'Col开头）
代码的一部分，以便我可以选择以Col
或Secondary
变量开头的变量（匹配（'Col | Secondary'））
可能。文档：？dplyr:：select_helpers
很好的解决方案，但我得到一个错误，即内存不足，这主要是因为我处理的是一个包含许多列和至少一百万行的大型数据集。然而，alistaire，解决方案worked@KimJenkins谢谢你的评论。我更新了data.table option.Co你能检查一下我之前的评论吗？最新的版本就像黄油上的一把刀。它很平滑*100。我很欣赏这个解决方案，但我甚至不打算尝试这个选项，因为正如我前面提到的，我的数据集中有数百列和数百万行，如果我尝试你的方法，我看起来会是adding是列数的两倍，这将导致内存问题。
library(qdap)

df1[,13:20] <- NA

for(i in 1:dim(df1)[1]){
  for(j in 1:8){
    df1[i,j+12] <- lookup(df1[i,j+4], lookup_table)
  }
}

head(df1)

  ID       Date Gender Age  Col1   Col2   Col3   Col4   Col5   Col6   Col7   Col8          V13              V14              V15           V16            V17             V18            V19              V20
1 10 10/14/2015      F  68 345.5  884.2   8.69 202.18  189.8  435.2   84.7    757 interStellar    indispensable    hallucination          flow \tcategorizing      choppiness      chieftain \tsubstantiating
2 93  7/22/2002      F  87 242.8  710.9  345.5  884.2   8.69 202.18  189.8  435.2         <NA>             <NA>     interStellar indispensable  hallucination            flow \tcategorizing       choppiness
3 14  7/28/2004      M  92  84.7    757  242.8  710.9  427.2  530.1 567.89  227.9    chieftain \tsubstantiating             <NA>          <NA>           <NA>            <NA>           <NA>             <NA>
4 41  2/24/2011      M  39   714   84.7    757  242.8  710.9  427.2  530.1 567.89         <NA>        chieftain \tsubstantiating          <NA>           <NA>            <NA>           <NA>             <NA>
5 64  3/14/2002      F  39 227.9    714 V58.49  906.7 800.35  V88.0 349.31 289.84         <NA>             <NA>        unbridled          <NA>           <NA>          polish           <NA>             <NA>
6 22 11/21/2015      F  68   324 V65.44  411.8 200.41  187.7 E869.3  41.04  170.4      stumble         hoopster      overtrimmed          <NA>           <NA> overbrutalizing         choric             <NA>