R通过查找字典替换列

R通过查找字典替换列,r,dataframe,lookup,na,R,Dataframe,Lookup,Na,在这个问题中,我需要能够从dataframe的列中查找值,不仅基于一个属性,而且基于与字典相比的更多属性和范围。 (是的,这实际上是一个故事的延续) 对于R-known ppl来说,这应该是一个简单的问题,因为我提供了基本索引的工作解决方案,需要升级,可能很容易。。。但这对我来说很难,因为我正在学习R 从何处开始: "rngvalue","80","116" 36,NA,NA 600000,NA,NA 367,5,NA 90,NA,6 "rngvalue","80","116" 36,0.03

在这个问题中,我需要能够从dataframe的列中查找值,不仅基于一个属性,而且基于与字典相比的更多属性和范围。 (是的,这实际上是一个故事的延续)

对于R-known ppl来说,这应该是一个简单的问题,因为我提供了基本索引的工作解决方案,需要升级,可能很容易。。。但这对我来说很难,因为我正在学习R

从何处开始:

"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6
"rngvalue","80","116"
36,0.03,0.135                   #col80 is always replaced by 0.03
600000,0.03,0.105               #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11                      #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6                       #6 not replaced
当我确实想根据(小)字典testdefs的列默认值替换(大)表df1中列testcolnames中缺少的值时(通过使testdefs$LABMET_ID等于testcolnames中的列名来选择行),我使用以下代码:

testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer

df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
  tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
  tmpcol  }) 
df1:

"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6
"rngvalue","80","116"
36,0.03,0.135                   #col80 is always replaced by 0.03
600000,0.03,0.105               #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11                      #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6                       #6 not replaced
要转换为:

"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6
"rngvalue","80","116"
36,0.03,0.135                   #col80 is always replaced by 0.03
600000,0.03,0.105               #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11                      #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6                       #6 not replaced

由于间隔没有间隙,因此可以使用
findInterval
。我将使用
plyr
中的
dlply
将查找表更改为包含断点和每个值的默认值的列表

## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
    list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
         default=x$default))
然后,您可以使用以下命令执行查找

testcolnames=c("80","116")

df1[,testcolnames] <- lapply(testcolnames, function(x) {
    tmpcol <- df1[,x]
    defaults <- with(lookup[[x]], {
        default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
    })
    tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
    tmpcol
})

#   rngvalue   80   116
# 1       36 0.03 0.135
# 2   600000 0.03 0.105
# 3      367 5.00 0.110
# 4       90 0.03 6.000
testcolnames=c(“80”、“116”)

df1[,testcolnames]间隔是否总是连续的,如“116”,即(31-365、366-5475、5476-54750等),并且没有间隙?是的!我很抱歉忘了提:)谢谢你的邀请!工作很有魅力!我只需要在那里明确地写下查找[[x]]不知道为什么“with”不起作用,如果字典中没有替换列,我需要添加ifelse。