r中的翻译(重新编码)错误

r中的翻译(重新编码)错误,r,replace,dataframe,R,Replace,Dataframe,下面是一个小例子: X1 <- c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC") X2 <- c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC") X3 <- c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA") mydf1 <- data.frame(X1, X2, X3) 功能 # Function atgc <- f

下面是一个小例子:

X1 <- c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC")
X2 <- c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC")
X3 <- c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA")
mydf1 <- data.frame(X1, X2, X3)
功能

# Function 
atgc <- function(x) {
 xlate <- c( "AA" = 11, "AC" = 12, "AG" = 13, "AT" = 14,
"CA"= 12, "CC" = 22, "CG"= 23,"CT"= 24,
 "GA" = 13, "GC" = 23, "GG"= 33,"GT"= 34,
 "TA"= 14,  "TC" = 24, "TG"= 34,"TT"=44,
"ID"= 56, "DI"= 56, "DD"= 55, "II"= 66
 )
  x =   xlate[x]
 }
outdataframe <- sapply (mydf1, atgc)
outdataframe
   X1 X2 X3
AA 11 11 12
AA 11 11 12
AA 11 11 12
AG 13 13 12
CA 12 12 11
AC 12 13 13
AT 14 11 12
AT 14 14 14
#函数

atgc只需使用
apply
和转置:

t(apply (mydf1, 1, atgc))
要使用
sapply
,请使用:

  • stringsAsFactors=FALSE
    创建数据帧时,即

    mydf1 <- data.frame(X1, X2, X3, stringsAsFactors=FALSE)
    

    mydf1match函数可以将因子参数与“character”类的目标匹配向量一起使用:


    atgc通过这种方式,您只需为矩阵中的每个字母提供替换值,而无需再次检查以确保您考虑了所有组合并正确匹配它们,尽管您的示例中的组合是有限的

    使用值及其替代项定义列表:

    trans <- list(c("A","1"),c("C","2"),c("G","3"),c("T","4"),
      c("I","6"),c("D","5"))
    
    使用替换的值创建一个矩阵(在这种情况下,将
    mydf1
    转换为
    gsub()
    所需的矩阵返回字符值,但您需要在继续之前检查这是否适用于任何其他数据)

    对象ansVec
    是一个向量,因此将其转换回data.frame

    ( mydf2 <- data.frame( matrix( ansVec, nrow = nrow(mydf1) ) ) )
    #   X1 X2 X3
    # 1 12 12 12
    # 2 12 12 12
    # 3 12 12 12
    # 4 12 12 12
    # 5 14 14 11
    # 6 14 12 14
    # 7 22 12 22
    # 8 22 24 12
    

    (mydf2实际上,我认为您希望将原始向量表示为因子,因为它们表示一组有限的水平(DNA二核苷酸),而不是任意的字符值

    lvls = c("AA", "AC", "AG", "AT", "CA", "CC", "CG", "CT", "GA", "GC", 
             "GG", "GT", "TA", "TC", "TG", "TT", "ID", "DI", "DD", "II")
    X1 <- factor(c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC"), levels=lvls)
    X2 <- factor(c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC"), levels=lvls)
    X3 <- factor(c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA"), levels=lvls)
    mydf1 <- data.frame(X1, X2, X3)
    
    lvls=c(“AA”、“AC”、“AG”、“AT”、“CA”、“CC”、“CG”、“CT”、“GA”、“GC”,
    “GG”、“GT”、“TA”、“TC”、“TG”、“TT”、“ID”、“DI”、“DD”、“II”)
    
    X1如果他们使用
    stringsAsFactors=FALSE
    来避免这些因素,我认为
    sapply
    会起作用,但我认为这可能会更好。@johnck,你也可以看看
    car
    包中的
    recode
    函数,它完成了我认为你想要的
    atgc
    函数。这是最简单的解决方案可能只是将
    x=xlate[x]
    编辑为
    x=xlate[as.character(x)]
    ,因为这是导致错误的位。(
    x
    是类“factor”的向量,在索引中使用因子的整数值(而不是相关的字符串)另外,要去掉行名,只需执行
    rownames(mydf)
    
    atgc2 <- function(myData, x) gsub(x[1], x[2], myData)
    
    mymat <- Reduce(atgc2, trans, init = as.matrix(mydf1))
    
    ansVec <- sapply( strsplit( mymat, split = ""),
      function(x) as.numeric( paste0( sort( as.numeric(x) ), collapse = "")))
    
    ( mydf2 <- data.frame( matrix( ansVec, nrow = nrow(mydf1) ) ) )
    #   X1 X2 X3
    # 1 12 12 12
    # 2 12 12 12
    # 3 12 12 12
    # 4 12 12 12
    # 5 14 14 11
    # 6 14 12 14
    # 7 22 12 22
    # 8 22 24 12
    
    lvls = c("AA", "AC", "AG", "AT", "CA", "CC", "CG", "CT", "GA", "GC", 
             "GG", "GT", "TA", "TC", "TG", "TT", "ID", "DI", "DD", "II")
    X1 <- factor(c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC"), levels=lvls)
    X2 <- factor(c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC"), levels=lvls)
    X3 <- factor(c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA"), levels=lvls)
    mydf1 <- data.frame(X1, X2, X3)
    
    xlate <- c("AA" = "11", "AC" = "12", "AG" = "13", "AT" = "14",
               "CA"= "12", "CC" = "22", "CG"= "23","CT"= "24",
               "GA" = "13", "GC" = "23", "GG"= "33","GT"= "34",
               "TA"= "14",  "TC" = "24", "TG"= "34","TT"="44",
               "ID"= "56", "DI"= "56", "DD"= "55", "II"= "66")
    
    levels(X1) <- xlate
    
    as.data.frame(lapply(mydf1, `levels<-`, xlate))