R 在具有多个值的列上合并

R 在具有多个值的列上合并,r,merge,apply,bioinformatics,R,Merge,Apply,Bioinformatics,我有一个数据框,cluster,其中一列,cluster$Genes,如下所示: ENSG00000134684 ENSG00000188846, ENSG00000181163, ENSG00000114391 ENSG00000134684, ENSG00000175390 ENSG00000134684 ENSG00000134684, ENSG00000175390 ... ENSGID a b ENSG00000134684 1 3 EN

我有一个数据框,
cluster
,其中一列,
cluster$Genes
,如下所示:

ENSG00000134684
ENSG00000188846, ENSG00000181163, ENSG00000114391
ENSG00000134684, ENSG00000175390
ENSG00000134684
ENSG00000134684, ENSG00000175390
...
ENSGID           a       b
ENSG00000134684  1       3
ENSG00000175390  2       0
ENSG00000000419  131.23  108.73
ENSG00000000457  7.11    8.68
ENSG00000000460  15.70   6.59
ENSG00000000938  0       0
ENSG00000000971  0.03    0.07
ENSG00000001036  59.22   58.3
...
列中每行的元素数是任意的。我还有另一个数据框,
expression
,看起来像这样:

ENSG00000134684
ENSG00000188846, ENSG00000181163, ENSG00000114391
ENSG00000134684, ENSG00000175390
ENSG00000134684
ENSG00000134684, ENSG00000175390
...
ENSGID           a       b
ENSG00000134684  1       3
ENSG00000175390  2       0
ENSG00000000419  131.23  108.73
ENSG00000000457  7.11    8.68
ENSG00000000460  15.70   6.59
ENSG00000000938  0       0
ENSG00000000971  0.03    0.07
ENSG00000001036  59.22   58.3
...
。。。大约有20000行。我想做的是:

  • 对于
    簇$Genes
    中每行中的所有元素,找到相应的
    a
    b
  • 计算
    簇$Genes中每一行
    a
    b
    (分别)的最小值、最大值和平均值
  • 集群
    数据框中创建六个新列,并用
    (最小值a、最大值a、平均值a、最小值b、最大值b、平均值b)
    值填充它们
  • 我试着想办法,但进展不顺利。在谷歌搜索帮助时,我想我可能会使用某种
    应用程序
    ,我得到了一些代码。我觉得它大部分都是胡言乱语,完全没有功能,我有点被卡住了。这就是我得到的:

    exp.lookup = function(genes) {
      genes.split = strsplit(genes, ', ')
      exp.hct = list()
      exp.hke = list()
      for ( gene in genes.split ) {
        exp.hct = c(exp.hct, merge(gene, means$hct, all.x=TRUE))
        exp.hke = c(exp.hke, merge(gene, means$hke, all.x=TRUE))
        return(c(exp.hct, exp.hke))
      }
    }
    
    apply(cluster['Genes'], 1, FUN=exp.lookup)
    

    有谁有更好的想法,这可能真的有用吗?

    假设每个
    ENSGID
    对应一对唯一的a和b值,我建议:


  • cluster$Genes
    分配给变量(换句话说,复制
    cluster
    数据框之外的变量)。例如,
    new_cluster_genes重新创建初始数据:

    library(data.table)
    
    cluster<- as.data.table(list(Genes = c("ENSG00000134684",
                                           "ENSG00000188846, ENSG00000181163, ENSG00000114391", 
                                           "ENSG00000134684, ENSG00000175390", 
                                           "ENSG00000134684", 
                                           "ENSG00000134684, ENSG00000175390")))
    
    expression<- as.data.table(list(ENSGID = c("ENSG00000134684", "ENSG00000175390",
                                               "ENSG00000000419", "ENSG00000000457",
                                               "ENSG00000000460", "ENSG00000000938",
                                               "ENSG00000000971", "ENSG00000001036"),
                                    a = c(1,2,131.23,7.11,15.70, 0, 0.03, 59.22),
                                    b = c(3,0,108.73,8.68,6.59,0,0.07,58.3)))
    setkey(cluster, Genes)
    setkey(expression, ENSGID)
    
    库(data.table)
    
    请把你的例子复制出来。请看如何做到这一点。您可以使用
    dput
    功能共享数据。感谢您的回复!我不确定我是否理解正确,但这样做(2)会不会让我丢失
    簇$Genes
    的每一行中哪些ID(即ENSGID)的信息?我希望对每行的所有ID进行
    a
    b
    的各种计算,而不是单独计算。例如,在最后一行中,[min,max,mean](a/b)=[1,2,1.5](a)/[0,3,1.5](b),非常感谢!我不太了解代码的所有细节,不过。。。我不理解这两个函数的第二行;你能解释一下吗?不仅要有工作代码,还要知道它是如何工作的,这将很好;-)当然您可能希望从data.table简介开始:熟悉基本语法。数据营的Data.table备忘单:可能也会有很大帮助。我将编辑我的帖子以添加更多评论。现在更新,请随时提问。我可能弄乱了术语,但我希望主要思想仍然清晰。
    library(data.table)
    
    result<- function() {
      colnames<- c("min.a", "max.a", "mean.a", "min.b", "max.b", "mean.b")
      # 1. "(colnames)" is parenthesized to insure we are adding new columns from
      # colnames variable by reference and evaluates to character vector with 
      # new columns names
      # 2. ":=" is for adding new columns to existing data.table by reference
      # 3. "count(Genes)" calls count() function over "Genes" column, but as long
      # as we are using grouping "by = Genes", count() works with each row turn
      # by turn. And each row is a character vector.
      cluster[,(colnames):=count(Genes), by = Genes]
    }
    
    # get Genes row
    count<- function(charvector) {
      ENSGIDc<- strsplit(charvector, ", ")
      # 4. subsetting "expression" data.table rows by splitted "Genes" character 
      # vector named "ENSGIDc"...
      # 5. ... and then calculating column's maxes, mins and means
      expression[ENSGIDc, .(min(a, na.rm = T), max(a, na.rm = T),
                            mean(a, na.rm = T), min(b, na.rm = T), 
                            max(b, na.rm = T), mean(b, na.rm = T))]
      # 6. at this point we are returning resulting 1 row 6 columns data.table     
      # back to calling function, where it's added to "cluster" data.table
    }
    
    suppressWarnings(result())