R data.fame操作:在特定列后转换为NA

R data.fame操作:在特定列后转换为NA,r,dataframe,dplyr,apply,base,R,Dataframe,Dplyr,Apply,Base,我有一个大的数据。frame,我需要按行进行一些转换。我的目的是在列中有特定字符时,将行中的所有值转换为NA 例如,我提供了真实数据集中的小样本: sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V")) result_df <- data.frame( a = c("V","I","V","V"

我有一个大的
数据。frame
,我需要按行进行一些转换。我的目的是在列中有特定字符时,将行中的所有值转换为NA

例如,我提供了真实数据集中的小样本:

sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"),  d = c("V","V","I","V"))


result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
sample\u df试试这个:

查找“I”值

例2:

sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
  a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
     a   b   c   d  
[1,] "V" "I" NA  NA 
[2,] "I" NA  NA  NA 
[3,] "I" NA  NA  NA 
[4,] "V" "V" "V" "V"

sample_df这里有一个蛮力方法,它应该是最容易想到但最不受欢迎的方法。无论如何,这是:

df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"),  d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
   if (any(as.character(df[i,])=='I')){
      first<-which(as.character(df[i,])=='I')[1]+1
      df[i,first:rowlength]<-NA
   }
}

df这里有一个可能的答案,使用
plyr
包中的
ddply

ddply(sample_df,.(a,b,c,d), function(x){
  idx<-which(x=='I')[1]+1 #ID after first 'I'
  if(!is.na(idx)){    #Check if found
    if(idx<=ncol(x)){  # Prevent out of bounds
      x[,idx:ncol(x)]<-NA
    }

  }
  x
})
ddply(示例)df,(a,b,c,d),函数(x){

idxplyr的
plyr
方法:

plyr::adply(sample_df, 1L, function(x) { 
  if (all(x != "I")) 
    return(x)
  x[1L:min(which(x == "I"))]
})

您必须使用
if
,因为
x[min(x=“I”)]
将返回
数值(0)
,用于至少没有一个
I
的行。

我的解决方案:

在@Julien Navarre推荐之后,我首先创建了
toNA()
函数:

toNA <- function(x) {

  temp <- grep("INVALID", unlist(x)) # which can be generalized for any string

  lt <- length(x)
  loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count

  #print(lt) #Debug purposes 

  if( (loc < lt+1) ) {
    x[ (loc):(lt)] <-NA
  }

  x

} 
1000行只需0.2秒

我不确定编程风格,但现在这个解决方案适合我


感谢您的所有建议。

一个纯基解决方案,我们正在构建一个布尔矩阵“
=”I“
”,然后通过行的双累积和,我们可以找到我们的
NAs
必须放置的位置:

result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1

result_df 
#   a    b    c    d
# 1 V    I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V    V    I <NA>
# 4 V    V    V    V

result|df此解决方案适用于示例数据,但我认为如果一行中有多个“I”,则不起作用。感谢您的回复。我的实际值包含“INVALID | |……”,因此我使用grepl或str|u detect查找“INVALID”标记:我使用grepl(“INVALID”,real|data|set)但它只返回向量。我如何将第一步扩展到grepl?第二,@Antonis提供的sample_df无法正确工作,请检查此示例:
sample_df@DenizTopcu上面的示例是否真实?toNA我编写了此函数。但速度太慢了,因为我的数据帧有300万行和20列。您有什么建议吗?谢谢s
ddply(sample_df,.(a,b,c,d), function(x){
  idx<-which(x=='I')[1]+1 #ID after first 'I'
  if(!is.na(idx)){    #Check if found
    if(idx<=ncol(x)){  # Prevent out of bounds
      x[,idx:ncol(x)]<-NA
    }

  }
  x
})
plyr::adply(sample_df, 1L, function(x) { 
  if (all(x != "I")) 
    return(x)
  x[1L:min(which(x == "I"))]
})
toNA <- function(x) {

  temp <- grep("INVALID", unlist(x)) # which can be generalized for any string

  lt <- length(x)
  loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count

  #print(lt) #Debug purposes 

  if( (loc < lt+1) ) {
    x[ (loc):(lt)] <-NA
  }

  x

} 
as.tibble(t(apply(result, 1,  toNA ) ))
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1

result_df 
#   a    b    c    d
# 1 V    I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V    V    I <NA>
# 4 V    V    V    V