对data.table by row应用自定义函数会返回不正确的值_R_Data.table_Genomicranges

对data.table by row应用自定义函数会返回不正确的值

对data.table by row应用自定义函数会返回不正确的值,r,data.table,genomicranges,R,Data.table,Genomicranges,我对数据表有点陌生，我有一个包含DNA基因组坐标的表，如下所示： chrom pause strand coverage 1: 1 3025794 + 1 2: 1 3102057 + 2 3: 1 3102058 + 2 4: 1 3102078 + 1 5: 1 3108840 -

我对数据表有点陌生，我有一个包含DNA基因组坐标的表，如下所示：

       chrom   pause strand coverage
    1:     1 3025794      +        1
    2:     1 3102057      +        2
    3:     1 3102058      +        2
    4:     1 3102078      +        1
    5:     1 3108840      -        1
    6:     1 3133041      +        1

       chrom   pause strand coverage       transcriptID CDS
    1:     1 3025794      +        1 ENSMUST00000116652 196
    2:     1 3102057      +        2 ENSMUST00000116652  35
    3:     1 3102058      +        2 ENSMUST00000156816 888
    4:     1 3102078      +        1 ENSMUST00000156816 883
    5:     1 3108840      -        1 ENSMUST00000156816 882
    6:     1 3133041      +        1 ENSMUST00000156816 880

    new.table <- counts[, get_feature(.SD), by = .I]

我编写了一个自定义函数，我想应用到我的大约200万行表的每一行，它使用GenomicFeatures的MapToTranscript以字符串和新坐标的形式检索两个相关值。我想在两个新列中将它们添加到我的表中，如下所示：

       chrom   pause strand coverage
    1:     1 3025794      +        1
    2:     1 3102057      +        2
    3:     1 3102058      +        2
    4:     1 3102078      +        1
    5:     1 3108840      -        1
    6:     1 3133041      +        1

       chrom   pause strand coverage       transcriptID CDS
    1:     1 3025794      +        1 ENSMUST00000116652 196
    2:     1 3102057      +        2 ENSMUST00000116652  35
    3:     1 3102058      +        2 ENSMUST00000156816 888
    4:     1 3102078      +        1 ENSMUST00000156816 883
    5:     1 3108840      -        1 ENSMUST00000156816 882
    6:     1 3133041      +        1 ENSMUST00000156816 880

    new.table <- counts[, get_feature(.SD), by = .I]

功能如下：

    get_feature <- function(dt){

      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) 
      hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) 
      tx_id <- tx_names[as.character(seqnames(hit))] 
      cds_coordinate <- sapply(ranges(hit), '[[', 1)

      if(length(tx_id) == 0 || length(cds_coordinate) == 0) {  
        out <- list('NaN', 0)
      } else {
        out <- list(tx_id, cds_coordinate)
      }

      return(out)
    }

我得到了这个错误，表明函数返回的是两个长度比原始表短的列表，而不是每行一个新元素：

Warning messages:
    1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
    2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).

我假设使用.I运算符将以行为单位应用函数，并每行返回一个值。我还使用if语句确保函数没有返回空值

然后我尝试了这个模拟版本的函数：

    get_feature <- function(dt) {

      return('I should be returned once for each row')

    }

counts[, get_feature(.SD), by = NULL]

                   text rowNum
1: Number of rows in dt    100

然后使用基因组功能包中的mapFromTranscripts映射回基因组，这样我就可以使用data.tables连接从原始表中检索信息，这正是我试图做的事情的目的。

当我需要为数据中的每一行应用函数时，我这样做的方式。表是按行号分组的：

counts[, get_feature(.SD), by = 1:nrow(counts)]

如中所述，.I不打算在中使用，因为它应该返回分组生成的行索引序列。by=.I没有抛出错误的原因是data.table创建了object.I在data.table命名空间中等于NULL，因此by=.I相当于by=NULL

请注意，使用by=1:nrowdt按行号分组，并允许您的函数仅访问data.table:

require(data.table)
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
                     pause = sample((3 * 10^6):(3.2 * 10^6), size = 100), 
                     strand = sample(c('-','+'), size = 100, replace = TRUE),
                     coverage = sample.int(3, size = 100, replace = TRUE))

get_feature <- function(dt){
    coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
    rowNum <- nrow(coordinate)
    return(list(text = 'Number of rows in dt', rowNum = rowNum))  
}

counts[, get_feature(.SD), by = 1:nrow(counts)]

而by=NULL将向函数提供整个data.table：

    get_feature <- function(dt) {

      return('I should be returned once for each row')

    }

counts[, get_feature(.SD), by = NULL]

                   text rowNum
1: Number of rows in dt    100

这是by的预期工作方式。

当我需要为数据表中的每一行应用一个函数时，我的工作方式是按行号对其进行分组：

counts[, get_feature(.SD), by = 1:nrow(counts)]

请注意，使用by=1:nrowdt按行号分组，并允许您的函数仅访问data.table:

require(data.table)
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
                     pause = sample((3 * 10^6):(3.2 * 10^6), size = 100), 
                     strand = sample(c('-','+'), size = 100, replace = TRUE),
                     coverage = sample.int(3, size = 100, replace = TRUE))

get_feature <- function(dt){
    coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
    rowNum <- nrow(coordinate)
    return(list(text = 'Number of rows in dt', rowNum = rowNum))  
}

counts[, get_feature(.SD), by = 1:nrow(counts)]

而by=NULL将向函数提供整个data.table：

    get_feature <- function(dt) {

      return('I should be returned once for each row')

    }

counts[, get_feature(.SD), by = NULL]

                   text rowNum
1: Number of rows in dt    100

这是by的预期工作方式。

Nice-response@StatLearner。欢迎来到SO！的确如此，@StatLearner，我用by=NULL进行了检查，结果是一样的。使用by=1:NROWdt按照我想要的方式应用函数，但是速度非常慢，所以我不得不寻找另一种解决方法。我不能用我想要的方式使用这个函数，但是我今天学到了很多关于数据表的知识，非常感谢！PS：有趣的是，我得到了by=.I的概念，这是谷歌搜索每行数据表的apply函数时出现的第一个。我会把你提到的答案链接起来，以防有人和我有同样的想法。你很好，安东尼奥马蒂斯！我同意by=。我，这是违反直觉的，也是我尝试的第一件事。回答不错@StatLearner。欢迎来到SO！的确如此，@StatLearner，我用by=NULL进行了检查，结果是一样的。使用by=1:NROWdt按照我想要的方式应用函数，但是速度非常慢，所以我不得不寻找另一种解决方法。我不能用我想要的方式使用这个函数，但是我今天学到了很多关于数据表的知识，非常感谢！PS：有趣的是，我得到了by=.I的概念，这是谷歌搜索每行数据表的apply函数时出现的第一个。我会把你提到的答案链接起来，以防有人和我有同样的想法。你很好，安东尼奥马蒂斯！我同意by=.I，这是违反直觉的，也是我尝试的第一件事。