R 转换/更新表中基因ID的最快方法?

R 转换/更新表中基因ID的最快方法?,r,performance,bioconductor,R,Performance,Bioconductor,注意:我不是在问Bioconductor特定的问题,但我需要在示例代码中包含Bioconductor。请容忍我。 嗨 我有许多以制表符分隔的文件,其中包含关于特定基因的各种类型的信息。一个或多个列可以是我需要升级到最新基因符号注释的基因符号的别名 我正在使用Bioconductor的org.Hs.eg.db库(特别是org.Hs.egALIAS2EG和org.Hs.egSYMBOL对象) 报告的代码完成了这项工作,但速度非常慢,我想这是因为每次迭代时都会有查询org.Hs.eg.db数据库的嵌套

注意:我不是在问Bioconductor特定的问题,但我需要在示例代码中包含Bioconductor。请容忍我。

我有许多以制表符分隔的文件,其中包含关于特定基因的各种类型的信息。一个或多个列可以是我需要升级到最新基因符号注释的基因符号的别名

我正在使用Bioconductor的org.Hs.eg.db库(特别是org.Hs.egALIAS2EG和org.Hs.egSYMBOL对象)

报告的代码完成了这项工作,但速度非常慢,我想这是因为每次迭代时都会有查询org.Hs.eg.db数据库的嵌套for循环。有没有更快/更简单/更聪明的方法来达到同样的效果

library(org.Hs.eg.db)

myTable <- read.table("tab_delimited_file.txt", header=TRUE, sep="\t", as.is=TRUE)

for (i in 1:nrow(myTable)) {
    for (j in 1:ncol(myTable)) {
        repl <- org.Hs.egALIAS2EG[[myTable[i,j]]][1]
        if (!is.null(repl)) {
            repl <- org.Hs.egSYMBOL[[repl]][1]
            if (!is.null(repl)) {
                myTable[i,j] <- repl
            }
        }
    }
}

write.table(myTable, file="new_tab_delimited_file", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)
library(org.Hs.eg.db)

myTable您可以使用sapply并命名多个非矢量的变量,例如
org.Hs.eg.db
库中的对象:

library(org.Hs.eg.db)
myTable <- read.table("tab_delimited_file.txt", header=TRUE, sep="\t", as.is=TRUE)

myfunc <- function(idx,mytab,a2e,es){
            i = idx %/% nrow(mytab) + 1
            j = idx %% ncol(mytab) + 1
            repl <- a2e[[myTable[i,j]]][1];
            if (!is.null(repl)) {
              repl <- es[[repl]][1]
              if (!is.null(repl)) {
                return(repl)
              }
            }
            else {return("NA")}
          }

vec <- 0:(ncol(myTable)*nrow(myTable)-1)
out <- sapply(vec,mytab=myTable,a2e=org.Hs.egALIAS2EG,es=org.Hs.egSYMBOL,myfunc)
myTable <- matrix(out, nrow=nrow(myTable),ncol=ncol(myTable),byrow=T)
library(org.Hs.eg.db)

myTable您可以使用sapply并命名多个非矢量的变量,例如
org.Hs.eg.db
库中的对象:

library(org.Hs.eg.db)
myTable <- read.table("tab_delimited_file.txt", header=TRUE, sep="\t", as.is=TRUE)

myfunc <- function(idx,mytab,a2e,es){
            i = idx %/% nrow(mytab) + 1
            j = idx %% ncol(mytab) + 1
            repl <- a2e[[myTable[i,j]]][1];
            if (!is.null(repl)) {
              repl <- es[[repl]][1]
              if (!is.null(repl)) {
                return(repl)
              }
            }
            else {return("NA")}
          }

vec <- 0:(ncol(myTable)*nrow(myTable)-1)
out <- sapply(vec,mytab=myTable,a2e=org.Hs.egALIAS2EG,es=org.Hs.egSYMBOL,myfunc)
myTable <- matrix(out, nrow=nrow(myTable),ncol=ncol(myTable),byrow=T)
library(org.Hs.eg.db)

myTable使用
mget
功能

eg[i,] <- mget( myTable[i,],  org.Hs.egALIAS2EG )
symbol[i, ] <- mget( myTable[i,], org.Hs.egSYMBOL )
嗯。这是假设我们使用的是字符矩阵(或多或少)。但是,坦率地说,您只有一个包含两列的数据帧,因此不需要像拥有数百列那样真正地自动化代码。让我们写一个小函数。(如果我们假设您表中的所有元素都可以在org.Hs.egALIAS2EG中找到,那就更简单了)


convert2symbol使用
mget
功能

eg[i,] <- mget( myTable[i,],  org.Hs.egALIAS2EG )
symbol[i, ] <- mget( myTable[i,], org.Hs.egSYMBOL )
嗯。这是假设我们使用的是字符矩阵(或多或少)。但是,坦率地说,您只有一个包含两列的数据帧,因此不需要像拥有数百列那样真正地自动化代码。让我们写一个小函数。(如果我们假设您表中的所有元素都可以在org.Hs.egALIAS2EG中找到,那就更简单了)


convert2symbol这是我能想到的最好的了

首先编写一个函数:

alias2GS <- function(x) {
    for (i in 1:length(x)) {
        if (!is.na(x[i])) {
            repl <- org.Hs.egALIAS2EG[[x[i]]][1]
            if (!is.null(repl)) {
                repl <- org.Hs.egSYMBOL[[repl]][1]
                if (!is.null(repl)) {
                    x[i] <- repl
                }
            }
        }
    }
    return(x)
}

alias2GS这是我能想到的最好的了

首先编写一个函数:

alias2GS <- function(x) {
    for (i in 1:length(x)) {
        if (!is.na(x[i])) {
            repl <- org.Hs.egALIAS2EG[[x[i]]][1]
            if (!is.null(repl)) {
                repl <- org.Hs.egSYMBOL[[repl]][1]
                if (!is.null(repl)) {
                    x[i] <- repl
                }
            }
        }
    }
    return(x)
}

alias2GS只是一个快速警告:一个别名可以映射到多个Entrez基因ID

因此,您当前的解决方案假定列出的第一个ID是正确的(可能不是正确的)

如果您查看
?org.Hs.egALIAS2EG
的帮助,您会发现,不建议使用别名或符号作为主要基因标识符

## From the 'Details' section of the help:
# Since gene symbols are sometimes redundantly assigned in the literature, 
# users are cautioned that this map may produce multiple matching results 
# for a single gene symbol. Users should map back from the entrez gene IDs 
# produced to determine which result is the one they want when this happens.

# Because of this problem with redundant assigment of gene symbols, 
# is it never advisable to use gene symbols as primary identifiers.
如果没有人工管理,就不可能知道哪个ID是“正确的”。因此,最安全的方法是获取表中每个别名的所有可能ID和符号,同时保留关于哪些别名是受体、哪些别名是配体的信息:

# your example subset with "A1B" and "trash" added for complexity
myTable <- data.frame(
    ReceptorGene = c("A1B", "ACVR2B", "ACVR2B", "ACVR2B", "ACVR2B", "AMHR2", "BLR1", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A"),
    LigandGene = c("trash", "INHA", "INHBA", "INHBB", "INHBC", "AMH", "SCYB13", "BMP10", "BMP15", "BMP2", "BMP3", "BMP4"), 
    stringsAsFactors = FALSE
)

# unlist and rename
my.aliases <- unlist(myTable)
names(my.aliases) <- paste(names(my.aliases), my.aliases, sep = ".")

# determine which aliases have a corresponding Entrez Gene ID
has.key <- my.aliases %in% keys(org.Hs.egALIAS2EG)

# replace Aliases with character vectors of all possible entrez gene IDs 
my.aliases[has.key] <- sapply(my.aliases[has.key], function(x) {
    eg.ids <- unlist(mget(x, org.Hs.egALIAS2EG))
    symbols <- unlist(mget(eg.ids, org.Hs.egSYMBOL))
})

# my.aliases retains all pertinent information regarding the original alias
my.aliases[1:3]
# $ReceptorGene1.A1B
#       1    6641 
#  "A1BG" "SNTB1" 
# 
# $ReceptorGene2.ACVR2B
#       93 
# "ACVR2B" 
# 
# $ReceptorGene3.ACVR2B
#       93 
# "ACVR2B"
#您的示例子集添加了“A1B”和“trash”以增加复杂性

myTable只是一个快速警告:别名可以映射到多个Entrez基因ID

因此,您当前的解决方案假定列出的第一个ID是正确的(可能不是正确的)

如果您查看
?org.Hs.egALIAS2EG
的帮助,您会发现,不建议使用别名或符号作为主要基因标识符

## From the 'Details' section of the help:
# Since gene symbols are sometimes redundantly assigned in the literature, 
# users are cautioned that this map may produce multiple matching results 
# for a single gene symbol. Users should map back from the entrez gene IDs 
# produced to determine which result is the one they want when this happens.

# Because of this problem with redundant assigment of gene symbols, 
# is it never advisable to use gene symbols as primary identifiers.
如果没有人工管理,就不可能知道哪个ID是“正确的”。因此,最安全的方法是获取表中每个别名的所有可能ID和符号,同时保留关于哪些别名是受体、哪些别名是配体的信息:

# your example subset with "A1B" and "trash" added for complexity
myTable <- data.frame(
    ReceptorGene = c("A1B", "ACVR2B", "ACVR2B", "ACVR2B", "ACVR2B", "AMHR2", "BLR1", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A"),
    LigandGene = c("trash", "INHA", "INHBA", "INHBB", "INHBC", "AMH", "SCYB13", "BMP10", "BMP15", "BMP2", "BMP3", "BMP4"), 
    stringsAsFactors = FALSE
)

# unlist and rename
my.aliases <- unlist(myTable)
names(my.aliases) <- paste(names(my.aliases), my.aliases, sep = ".")

# determine which aliases have a corresponding Entrez Gene ID
has.key <- my.aliases %in% keys(org.Hs.egALIAS2EG)

# replace Aliases with character vectors of all possible entrez gene IDs 
my.aliases[has.key] <- sapply(my.aliases[has.key], function(x) {
    eg.ids <- unlist(mget(x, org.Hs.egALIAS2EG))
    symbols <- unlist(mget(eg.ids, org.Hs.egSYMBOL))
})

# my.aliases retains all pertinent information regarding the original alias
my.aliases[1:3]
# $ReceptorGene1.A1B
#       1    6641 
#  "A1BG" "SNTB1" 
# 
# $ReceptorGene2.ACVR2B
#       93 
# "ACVR2B" 
# 
# $ReceptorGene3.ACVR2B
#       93 
# "ACVR2B"
#您的示例子集添加了“A1B”和“trash”以增加复杂性

myTable谢谢,您的代码非常优雅,但有点问题。在代码段#2中,
v
ls(org.Hs.egALIAS2EG)
应该给出映射中键的名称(有效别名)<如果第一个集合(v)的元素在第二个集合中,则%
中的code>%返回TRUE;如果不在第二个集合中,则返回FALSE。此操作不应返回数据帧。你能举个例子吗?(像myTable[1:10,1:10])?你是对的,但是如果你看代码,你会把它作为一个子集赋给v:
v是的,但是如果v是一个向量,怎么会这样呢。下面是myTable的子集:。回答您的问题:也许unique()将其转换回数据帧?谢谢,您的代码非常优雅,但它有点问题。在代码段#2中,
v
ls(org.Hs.egALIAS2EG)
应该给出映射中键的名称(有效别名)<如果第一个集合(v)的元素在第二个集合中,则%
中的code>%返回TRUE;如果不在第二个集合中,则返回FALSE。此操作不应返回数据帧。你能举个例子吗?(像myTable[1:10,1:10])?你是对的,但是如果你看代码,你会把它作为一个子集赋给v:
v是的,但是如果v是一个向量,怎么会这样呢。下面是myTable的子集:。回答您的问题:也许unique()会将其转换回数据帧?
## From the 'Details' section of the help:
# Since gene symbols are sometimes redundantly assigned in the literature, 
# users are cautioned that this map may produce multiple matching results 
# for a single gene symbol. Users should map back from the entrez gene IDs 
# produced to determine which result is the one they want when this happens.

# Because of this problem with redundant assigment of gene symbols, 
# is it never advisable to use gene symbols as primary identifiers.
# your example subset with "A1B" and "trash" added for complexity
myTable <- data.frame(
    ReceptorGene = c("A1B", "ACVR2B", "ACVR2B", "ACVR2B", "ACVR2B", "AMHR2", "BLR1", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A"),
    LigandGene = c("trash", "INHA", "INHBA", "INHBB", "INHBC", "AMH", "SCYB13", "BMP10", "BMP15", "BMP2", "BMP3", "BMP4"), 
    stringsAsFactors = FALSE
)

# unlist and rename
my.aliases <- unlist(myTable)
names(my.aliases) <- paste(names(my.aliases), my.aliases, sep = ".")

# determine which aliases have a corresponding Entrez Gene ID
has.key <- my.aliases %in% keys(org.Hs.egALIAS2EG)

# replace Aliases with character vectors of all possible entrez gene IDs 
my.aliases[has.key] <- sapply(my.aliases[has.key], function(x) {
    eg.ids <- unlist(mget(x, org.Hs.egALIAS2EG))
    symbols <- unlist(mget(eg.ids, org.Hs.egSYMBOL))
})

# my.aliases retains all pertinent information regarding the original alias
my.aliases[1:3]
# $ReceptorGene1.A1B
#       1    6641 
#  "A1BG" "SNTB1" 
# 
# $ReceptorGene2.ACVR2B
#       93 
# "ACVR2B" 
# 
# $ReceptorGene3.ACVR2B
#       93 
# "ACVR2B"
myTable$receptor.id <- c("1", "93", "93", "93", "93", "269", "643", "657", "657", "657", "657", "657") 
myTable$ligand.id   <- c(NA, "3623", "3624", "3625", "3626", "268", "10563", "27302", "9210", "650", "651", "652")
has.key <- myTable$receptor.id %in% keys(org.Hs.egSYMBOL)
myTable$ReceptorGene[has.key] <- unlist(mget(myTable$receptor.id[has.key], org.Hs.egSYMBOL))

has.key <- myTable$ligand.id %in% keys(org.Hs.egSYMBOL)
myTable$LigandGene[has.key] <- unlist(mget(myTable$ligand.id[has.key], org.Hs.egSYMBOL))

head(myTable)
#   ReceptorGene LigandGene receptor.id ligand.id
# 1         A1BG      trash           1      <NA>
# 2       ACVR2B       INHA          93      3623
# 3       ACVR2B      INHBA          93      3624
# 4       ACVR2B      INHBB          93      3625
# 5       ACVR2B      INHBC          93      3626
# 6        AMHR2        AMH         269       268