R 使用data.table中的变量指定新列_R_Data.table

R 使用data.table中的变量指定新列

R 使用data.table中的变量指定新列,r,data.table,R,Data.table,我有一个data.table，我想对它执行一些处理。作为第一步，我想为列设置一个新的data.table。我为感兴趣的列创建了一个循环，并尝试分配失败或出现问题的NA/0，如下所述 library(data.table) input_allele <- data.table(FID= paste0("gid",1:10),IID=paste0("IID",11:20),PAT=c(1:10),MAT=c(rep(0,10)),SEX=c

我有一个data.table，我想对它执行一些处理。作为第一步，我想为列设置一个新的

data.table

。
我为感兴趣的列创建了一个循环，并尝试分配失败或出现问题的

NA

/0，如下所述

library(data.table)    
 
input_allele <- data.table(FID= paste0("gid",1:10),IID=paste0("IID",11:20),PAT=c(1:10),MAT=c(rep(0,10)),SEX=c(rep(1,10)),PHENOTYPE =c(rep(1,10)),
SNP1=(c(rep(1,5), rep(0,5))),SNP2=(c(rep(1,6),rep(0,3),NA)),SNP3=(c(rep(NA,6),rep(1,4))),SNP4=(c(rep(NA,6),rep(0,4))),SNP5=(c(rep(1,6),rep(0,4)))  )


multiplied_value<-input_allele[,c(1:6)]

for(temp_snp in (colnames(input_allele[,.SD,.SDcols=c(7:11)]))){
temp_snpquote<-quote(temp_snp)
multiplied_value[,(temp_snpquote):=0]
}

我想了解：1）如何将新列设置为NA或0。2）为什么使用

eval

会将me类型与U值数据相乘。表格会打印两次

R版本4.0.0（2020-04-24），数据。表1.13.4

Unix debian发行版

从

？set

，您可以发现重复调用

[.data.table

的开销可能会增加。在这种情况下，您可以尝试使用

set

此外，任何

set*

函数后面都应该跟有

[]

以打印输出

因此，这里有两种选择：

copy1 <- copy2 <- copy3 <- input_allele[,c(1:6)]
new <- colnames(input_allele[,.SD,.SDcols=c(7:11)])

## Using `set` :

for (i in new) {
  set(copy1, j = i, value = 0)[]
}
head(copy1)
##     FID   IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
## 1: gid1 IID11   1   0   1         1    0    0    0    0    0
## 2: gid2 IID12   2   0   1         1    0    0    0    0    0
## 3: gid3 IID13   3   0   1         1    0    0    0    0    0
## 4: gid4 IID14   4   0   1         1    0    0    0    0    0
## 5: gid5 IID15   5   0   1         1    0    0    0    0    0
## 6: gid6 IID16   6   0   1         1    0    0    0    0    0
   
## Using `:=` :

for (i in new) {
  copy2[, (i) := 0][]
}
head(copy2)
##     FID   IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
## 1: gid1 IID11   1   0   1         1    0    0    0    0    0
## 2: gid2 IID12   2   0   1         1    0    0    0    0    0
## 3: gid3 IID13   3   0   1         1    0    0    0    0    0
## 4: gid4 IID14   4   0   1         1    0    0    0    0    0
## 5: gid5 IID15   5   0   1         1    0    0    0    0    0
## 6: gid6 IID16   6   0   1         1    0    0    0    0    0

请注意，这些不需要

quote

和

eval

即使使用这个小数据集，

set

和在循环中使用

：=

之间的性能差异也是可以测量的：

fun1 <- function() { for (i in new) { set(copy1, j = i, value = 0)[] }; copy1 }
fun2 <- function() { for (i in new) { copy2[, (i) := 0][] } ; copy2 }
fun3 <- function() copy3[, (new) := as.list(rep(0, length(new)))][]

bench::mark(fun1(), fun2(), fun3())
## # A tibble: 3 x 13
##   expression     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##   <bch:expr> <bch:t> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
## 1 fun1()      64.9µs  69.63µs    13932.        0B     4.17  6689     2
## 2 fun2()       993µs   1.07ms      910.   377.6KB     4.23   430     2
## 3 fun3()     241.9µs 255.12µs     3793.    16.4KB     4.30  1763     2
## # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## #   time <list>, gc <list>

fun1我会使用set
而不是：=
。类似于：for（colnames中的I（输入等位基因[，.SD，.SDcols=c（7:11）]）set（乘以的值，j=I，值=0）；乘以的值[]
。但是你也可以这样做：for（temp\u-snp-in（colnames（输入等位基因[，.SD，.SDcols=c（7:11）]）乘以的值[，（temp\u-snp:]我明白了。我应该使用[]
在这里我必须输入两次变量名。对于（temp\u snp in（colnames（input\u allel[，.SD，.SDcols=c（7:11）]）乘以值[，（temp\u snp）：=0][/code>由于变量（j和I），我无法理解您的第一个代码段。如何设置？如果您查看？set
（此处还演示了：=
）的帮助页面，最后您将看到向数据表添加多列的不同方式的计时。表
。最后一个[]将在使用任何就地修改后打印。
copy3[, (new) := as.list(rep(0, length(new)))][]
##       FID   IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
##  1:  gid1 IID11   1   0   1         1    0    0    0    0    0
##  2:  gid2 IID12   2   0   1         1    0    0    0    0    0
##  3:  gid3 IID13   3   0   1         1    0    0    0    0    0
##  4:  gid4 IID14   4   0   1         1    0    0    0    0    0
##  5:  gid5 IID15   5   0   1         1    0    0    0    0    0
##  6:  gid6 IID16   6   0   1         1    0    0    0    0    0
##  7:  gid7 IID17   7   0   1         1    0    0    0    0    0
##  8:  gid8 IID18   8   0   1         1    0    0    0    0    0
##  9:  gid9 IID19   9   0   1         1    0    0    0    0    0
## 10: gid10 IID20  10   0   1         1    0    0    0    0    0

fun1 <- function() { for (i in new) { set(copy1, j = i, value = 0)[] }; copy1 }
fun2 <- function() { for (i in new) { copy2[, (i) := 0][] } ; copy2 }
fun3 <- function() copy3[, (new) := as.list(rep(0, length(new)))][]

bench::mark(fun1(), fun2(), fun3())
## # A tibble: 3 x 13
##   expression     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##   <bch:expr> <bch:t> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
## 1 fun1()      64.9µs  69.63µs    13932.        0B     4.17  6689     2
## 2 fun2()       993µs   1.07ms      910.   377.6KB     4.23   430     2
## 3 fun3()     241.9µs 255.12µs     3793.    16.4KB     4.30  1763     2
## # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## #   time <list>, gc <list>