R 替换为NA';使用其他列中的数据(并检查反向解决方案)
我合并了两个数据集,一个原始数据(x)和一个参考数据(y)。在某些情况下,数据在特定列中缺少一些值,但在所有情况下,都可以从引用中检索此信息。第1-4列为数据,第5-8列为参考。因此,如果有一个(虚构的)数据集,它看起来像这样: 输入: 第5行有点不同,因为这里的等位基因1和ALLLELE2在x和y之间是反向的。因此,替代也应该是反向的:R 替换为NA';使用其他列中的数据(并检查反向解决方案),r,R,我合并了两个数据集,一个原始数据(x)和一个参考数据(y)。在某些情况下,数据在特定列中缺少一些值,但在所有情况下,都可以从引用中检索此信息。第1-4列为数据,第5-8列为参考。因此,如果有一个(虚构的)数据集,它看起来像这样: 输入: 第5行有点不同,因为这里的等位基因1和ALLLELE2在x和y之间是反向的。因此,替代也应该是反向的: EFFECT_ALLELE.x NON_EFFECT_ALLELE.x ALLELE1.x ALLELE2.x EFFECT_ALLELE.y NO
EFFECT_ALLELE.x NON_EFFECT_ALLELE.x ALLELE1.x ALLELE2.x EFFECT_ALLELE.y NON_EFFECT_ALLELE.y ALLELE1.y ALLELE2.y
5 R I T TG I R TG T
工作代码(但太慢)
我自己写了一个脚本,逐行检查这一点。毫不奇怪,这是非常缓慢的。
检查50行大约需要0.12秒,这意味着检查我文件中的~100万行根本不可行。
但是,它确实起作用,因此它是:
ALLELE_CHECK_LENGTH <- 3
if (TRUE %in% is.na(data$EFFECT_ALLELE.x)){ #to make sure the script won't check lines if it is not necessary
z <- 1
for (z in seq(along=data$EFFECT_ALLELE.x))
{
if(is.na(data$EFFECT_ALLELE.x[z]) &
is.na(data$NON_EFFECT_ALLELE.x[z]) &
!is.na(data$ALLELE1.x[z]) &
!is.na(data$ALLELE2.x[z]) &
!is.na(data$ALLELE1.y[z]) &
!is.na(data$ALLELE2.y[z]) &
(substr(data$ALLELE1.x[z],1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE1.y[z],1,ALLELE_CHECK_LENGTH)) &
(substr(data$ALLELE2.x[z],1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE2.y[z],1,ALLELE_CHECK_LENGTH))){
data$EFFECT_ALLELE.x[z] <- data$EFFECT_ALLELE.y[z]
data$NON_EFFECT_ALLELE.x[z] <- data$NON_EFFECT_ALLELE.y[z]
}
}
z <- 1
for (z in seq(along=data$EFFECT_ALLELE.x))
{
if(is.na(data$EFFECT_ALLELE.x[z]) &
is.na(data$NON_EFFECT_ALLELE.x[z]) &
!is.na(data$ALLELE1.x[z]) &
!is.na(data$ALLELE2.x[z]) &
!is.na(data$ALLELE1.y[z]) &
!is.na(data$ALLELE2.y[z]) &
(substr(data$ALLELE1.x[z],1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE2.y[z],1,ALLELE_CHECK_LENGTH)) &
(substr(data$ALLELE2.x[z],1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE1.y[z],1,ALLELE_CHECK_LENGTH))){
data$EFFECT_ALLELE.x[z] <- data$NON_EFFECT_ALLELE.y[z]
data$NON_EFFECT_ALLELE.x[z] <- data$EFFECT_ALLELE.y[z]
}
}
}
- 但是:检查是否应该“反转”
- 性能是一个问题(希望在尽可能短的时间内检查~1M条线路) 非常感谢您对这个问题的任何帮助!当然,如果以前有人问过这个问题(我找不到),我也会接受这个问题的链接作为答案
w<-which(is.na(data$EFFECT_ALLELE.x) &
is.na(data$NON_EFFECT_ALLELE.x) &
!is.na(data$ALLELE1.x) &
!is.na(data$ALLELE2.x) &
!is.na(data$ALLELE1.y) &
!is.na(data$ALLELE2.y) &
(substr(data$ALLELE1.x,1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE2.y,1,ALLELE_CHECK_LENGTH)) &
(substr(data$ALLELE2.x,1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE1.y,1,ALLELE_CHECK_LENGTH)))
data$EFFECT_ALLELE.x[w] <- data$NON_EFFECT_ALLELE.y[w]
data$NON_EFFECT_ALLELE.x[w] <- data$EFFECT_ALLELE.y[w]
w这为我提供了所需行的索引,但R为最后两行提供了一条警告消息:data$EFFECT\u allege.x[w]Update:converting to as.character完成了这一技巧,现在它可以工作了。我现在正在测试它是否足够快。它运行得非常快(整个文件不到2秒),并且看起来做得非常好(它仍然需要第二部分,但是用您提供的代码生成它没有问题)。非常感谢。
ALLELE_CHECK_LENGTH <- 3
if (TRUE %in% is.na(data$EFFECT_ALLELE.x)){ #to make sure the script won't check lines if it is not necessary
z <- 1
for (z in seq(along=data$EFFECT_ALLELE.x))
{
if(is.na(data$EFFECT_ALLELE.x[z]) &
is.na(data$NON_EFFECT_ALLELE.x[z]) &
!is.na(data$ALLELE1.x[z]) &
!is.na(data$ALLELE2.x[z]) &
!is.na(data$ALLELE1.y[z]) &
!is.na(data$ALLELE2.y[z]) &
(substr(data$ALLELE1.x[z],1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE1.y[z],1,ALLELE_CHECK_LENGTH)) &
(substr(data$ALLELE2.x[z],1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE2.y[z],1,ALLELE_CHECK_LENGTH))){
data$EFFECT_ALLELE.x[z] <- data$EFFECT_ALLELE.y[z]
data$NON_EFFECT_ALLELE.x[z] <- data$NON_EFFECT_ALLELE.y[z]
}
}
z <- 1
for (z in seq(along=data$EFFECT_ALLELE.x))
{
if(is.na(data$EFFECT_ALLELE.x[z]) &
is.na(data$NON_EFFECT_ALLELE.x[z]) &
!is.na(data$ALLELE1.x[z]) &
!is.na(data$ALLELE2.x[z]) &
!is.na(data$ALLELE1.y[z]) &
!is.na(data$ALLELE2.y[z]) &
(substr(data$ALLELE1.x[z],1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE2.y[z],1,ALLELE_CHECK_LENGTH)) &
(substr(data$ALLELE2.x[z],1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE1.y[z],1,ALLELE_CHECK_LENGTH))){
data$EFFECT_ALLELE.x[z] <- data$NON_EFFECT_ALLELE.y[z]
data$NON_EFFECT_ALLELE.x[z] <- data$EFFECT_ALLELE.y[z]
}
}
}
<NA><NA> T TTTCG I R T TTTCG
<NA><NA> T TG I R TG T
I R T TTTCG I R T TTTCG
R I T TG I R TG T
w<-which(is.na(data$EFFECT_ALLELE.x) &
is.na(data$NON_EFFECT_ALLELE.x) &
!is.na(data$ALLELE1.x) &
!is.na(data$ALLELE2.x) &
!is.na(data$ALLELE1.y) &
!is.na(data$ALLELE2.y) &
(substr(data$ALLELE1.x,1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE2.y,1,ALLELE_CHECK_LENGTH)) &
(substr(data$ALLELE2.x,1,ALLELE_CHECK_LENGTH) == substr(data$ALLELE1.y,1,ALLELE_CHECK_LENGTH)))
data$EFFECT_ALLELE.x[w] <- data$NON_EFFECT_ALLELE.y[w]
data$NON_EFFECT_ALLELE.x[w] <- data$EFFECT_ALLELE.y[w]