R data.fame操作:在特定列后转换为NA
我有一个大的R data.fame操作:在特定列后转换为NA,r,dataframe,dplyr,apply,base,R,Dataframe,Dplyr,Apply,Base,我有一个大的数据。frame,我需要按行进行一些转换。我的目的是在列中有特定字符时,将行中的所有值转换为NA 例如,我提供了真实数据集中的小样本: sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V")) result_df <- data.frame( a = c("V","I","V","V"
数据。frame
,我需要按行进行一些转换。我的目的是在列中有特定字符时,将行中的所有值转换为NA
例如,我提供了真实数据集中的小样本:
sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
sample\u df试试这个:
查找“I”值
例2:
sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "I" NA NA NA
[4,] "V" "V" "V" "V"
sample_df这里有一个蛮力方法,它应该是最容易想到但最不受欢迎的方法。无论如何,这是:
df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
if (any(as.character(df[i,])=='I')){
first<-which(as.character(df[i,])=='I')[1]+1
df[i,first:rowlength]<-NA
}
}
df这里有一个可能的答案,使用plyr
包中的ddply
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})
ddply(示例)df,(a,b,c,d),函数(x){
idxplyr的plyr
方法:
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
您必须使用if
,因为x[min(x=“I”)]
将返回数值(0)
,用于至少没有一个I
的行。我的解决方案:
在@Julien Navarre推荐之后,我首先创建了toNA()
函数:
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
1000行只需0.2秒
我不确定编程风格,但现在这个解决方案适合我
感谢您的所有建议。一个纯基解决方案,我们正在构建一个布尔矩阵“=”I“
”,然后通过行的双累积和,我们可以找到我们的NAs
必须放置的位置:
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V
result|df此解决方案适用于示例数据,但我认为如果一行中有多个“I”,则不起作用。感谢您的回复。我的实际值包含“INVALID | |……”,因此我使用grepl或str|u detect查找“INVALID”标记:我使用grepl(“INVALID”,real|data|set)但它只返回向量。我如何将第一步扩展到grepl?第二,@Antonis提供的sample_df无法正确工作,请检查此示例:sample_df@DenizTopcu上面的示例是否真实?toNA我编写了此函数。但速度太慢了,因为我的数据帧有300万行和20列。您有什么建议吗?谢谢s
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
as.tibble(t(apply(result, 1, toNA ) ))
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V