R 字典样式替换多个项

R 字典样式替换多个项,r,dataframe,bioinformatics,R,Dataframe,Bioinformatics,我有一个大的data.frame字符数据,我想根据其他语言中通常称为字典的内容进行转换 目前我正在这样做: foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), snp2 = c("AA", "AT", "AG", "AA"), snp3 = c(NA, "GG", "GG", "GC"), stringsAsFactors=FALSE) foo <- replace(foo, foo == "AA", "0101") foo <

我有一个大的data.frame字符数据,我想根据其他语言中通常称为字典的内容进行转换

目前我正在这样做:

foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), snp2 = c("AA", "AT", "AG", "AA"), snp3 = c(NA, "GG", "GG", "GC"), stringsAsFactors=FALSE)
foo <- replace(foo, foo == "AA", "0101")
foo <- replace(foo, foo == "AC", "0102")
foo <- replace(foo, foo == "AG", "0103")

foo这里有一些简单的方法可以完成这项工作:

key <- c('AA','AC','AG')
val <- c('0101','0102','0103')

lapply(1:3,FUN = function(i){foo[foo == key[i]] <<- val[i]})
foo

 snp1 snp2 snp3
1 0101 0101 <NA>
2 0103   AT   GG
3 0101 0103   GG
4 0101 0101   GC

key这里有一个快速解决方案

dict = list(AA = '0101', AC = '0102', AG = '0103')
foo2 = foo
for (i in 1:3){foo2 <- replace(foo2, foo2 == names(dict[i]), dict[i])}
dict=list(AA='0101',AC='0102',AG='0103')
foo2=foo
因为(我在1:3中){foo2

当有数百万个SNP和数千个样本时,矩阵和数据帧变体都会与R对向量大小的2^31-1限制相冲突。

如果您愿意使用软件包,
plyr
是一款非常流行的软件,它有一个方便的功能,可以满足您的需求:

foo <- mapvalues(foo, from=c("AA", "AC", "AG"), to=c("0101", "0102", "0103"))

foo使用了上面@Ramnath的答案,但让它从文件中读取(替换什么和替换什么),并使用gsub而不是replace

hrw <- read.csv("hgWords.txt", header=T, stringsAsFactor=FALSE, encoding="UTF-8", sep="\t") 

for (i in nrow(hrw)) 
{
document <- gsub(hrw$from[i], hrw$to[i], document, ignore.case=TRUE)
}

注意此答案一开始是为了解决发布在中的更简单的问题。不幸的是,此问题与实际问题重复。因此,我将尝试提出一种基于替换两种情况下的因子水平的解决方案,如下所示


如果只有一个向量(或一个数据帧列) 其值需要替换,且使用因子没有异议。我们可以强制向量使用因子,并根据需要更改因子级别:

x <- c(1, 1, 4, 4, 5, 5, 1, 1, 2)
x <- factor(x)
x
#[1] 1 1 4 4 5 5 1 1 2
#Levels: 1 2 4 5
replacement_vec <- c("A", "T", "C", "G")
levels(x) <- replacement_vec
x
#[1] A A C C G G A A T
#Levels: A T C G

如果需要替换一个数据帧中多列的所有值,则可以扩展该方法

foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), 
                  snp2 = c("AA", "AT", "AG", "AA"), 
                  snp3 = c(NA, "GG", "GG", "GC"), 
                  stringsAsFactors=FALSE)

level_vec <- c("AA", "AC", "AG", "AT", "GC", "GG")
replacement_vec <- c("0101", "0102", "0103", "0104", "0302", "0303")
foo[] <- lapply(foo, function(x) forcats::lvls_revalue(factor(x, levels = level_vec), 
                                                       replacement_vec))
foo
#  snp1 snp2 snp3
#1 0101 0101 <NA>
#2 0103 0104 0303
#3 0101 0103 0303
#4 0101 0101 0302

foo由于距离上次回答已经有几年了,今晚有一个关于这个话题的新问题出现,主持人将其关闭,我将把它添加到这里。海报有一个包含0、1和2的大数据框,并希望将它们分别改为AA、AB和BB

使用
plyr

> df <- data.frame(matrix(sample(c(NA, c("0","1","2")), 100, replace = TRUE), 10))
> df
     X1   X2   X3 X4   X5   X6   X7   X8   X9  X10
1     1    2 <NA>  2    1    2    0    2    0    2
2     0    2    1  1    2    1    1    0    0    1
3     1    0    2  2    1    0 <NA>    0    1 <NA>
4     1    2 <NA>  2    2    2    1    1    0    1
... to 10th row

> df[] <- lapply(df, as.character)
使用dplyr::重新编码:

库(dplyr)
突变所有(foo,funs)(重新编码(,“AA”=“0101”,“AC”=“0102”,“AG”=“0103”),
.default=NA_字符()
#snp1 snp2 snp3
# 1 0101 0101 
# 2 0103  
# 3 0101 0103 
# 4 0101 0101 

library(dplyr)

foo %>%
   mutate_all(~case_when(. == "AA" ~ "0101", 
                         . == "AC" ~ "0102", 
                         . == "AG" ~ "0103", 
                         TRUE ~ .))

#  snp1 snp2 snp3
#1 0101 0101 <NA>
#2 0103   AT   GG
#3 0101 0103   GG
#4 0101 0101   GC
如果上述条件均不满足,则会将值更改为NA


另一个仅使用基本R的选项是使用新旧值创建
查找
数据框,
取消列出
数据框,
将它们与旧值匹配,获取相应的新值并替换

lookup <- data.frame(old_val = c("AA", "AC", "AG"), 
                     new_val = c("0101", "0102", "0103"))

foo[] <- lookup$new_val[match(unlist(foo), lookup$old_val)]

lookup用字典替换字符串或字符串向量中的值的最可读方法之一是
stringr::str_replace_all
,来自
stringr
包。str_replace_all
所需的模式可以是字典,例如

# 1. Made your dictionnary
dictio_replace= c("AA"= "0101", 
                  "AC"= "0102",
                  "AG"= "0103") # short example of dictionnary.

 # 2. Replace all pattern, according to the dictionary-values (only a single vector of string, or a single string)
 foo$snp1 <- stringr::str_replace_all(string = foo$snp1,
                                      pattern= dictio_replace)  # we only use the 'pattern' option here: 'replacement' is useless since we provide a dictionnary.
#1.做了你的字典
措辞替换=c(“AA”=“0101”,
“AC”=“0102”,
“AG”=“0103”)#用词单位的简短示例。
#2.根据字典值替换所有模式(仅单个字符串向量或单个字符串)

foo$snp1是您的字典和R列表吗?目前还没有,但很容易将其编成一个。也许这个问题会有帮助:,。我不建议使用全局赋值运算符
@Ramnath同意,
这是唯一可以处理原始密钥为0:2且任务是转换为等效密钥的变体的答案ent字符值。投票率最高的答案失败,因为0不是可接受的索引。Ramnaths和c.gutierrez的答案在我手中也失败了。(我没有测试所有答案。)这是问题的链接:我喜欢这个答案,因为它将键和值放在一起。将键和值放在单独的字符向量中意味着,如果其中一个向量的顺序错误,字典会自动错误地标记所有顺序错误的条目。我建议的唯一区别是使用R的向量化notat第三行的ion,例如:sappy(1:3,函数(i)replace(foo2,foo2==names(dict[i]),dict[i])
*apply
函数与矢量化函数不同。不幸的是,这会在plyr::mapvalues(foo,from=c(“AA”,“AC”,“AG”)到=c(“0101”)中抛出一个错误,:
x
必须是一个原子向量。这也记录在
?mapvalues
中。这工作非常好!谢谢c.gutierrez。看起来你的输入是一个数据帧,输出是一个矩阵。不过,我想你可以在最后强制回去。看起来是我的最佳选择,但出于某种原因,我无法让它工作。输出s little sense.FYI-如果使用
tidyverse
并将foo作为tible,则必须在分配
map[unlist(foo)]
之前将其强制为data.frame,否则分配的与现有数据的行数将不同。
> df <- data.frame(matrix(sample(c(NA, c("0","1","2")), 100, replace = TRUE), 10))
> df
     X1   X2   X3 X4   X5   X6   X7   X8   X9  X10
1     1    2 <NA>  2    1    2    0    2    0    2
2     0    2    1  1    2    1    1    0    0    1
3     1    0    2  2    1    0 <NA>    0    1 <NA>
4     1    2 <NA>  2    2    2    1    1    0    1
... to 10th row

> df[] <- lapply(df, as.character)
> library(plyr)
> apply(df, 2, function(x) {x <- revalue(x, c("0"="AA","1"="AB","2"="BB")); x})
      X1   X2   X3   X4   X5   X6   X7   X8   X9   X10 
 [1,] "AB" "BB" NA   "BB" "AB" "BB" "AA" "BB" "AA" "BB"
 [2,] "AA" "BB" "AB" "AB" "BB" "AB" "AB" "AA" "AA" "AB"
 [3,] "AB" "AA" "BB" "BB" "AB" "AA" NA   "AA" "AB" NA  
 [4,] "AB" "BB" NA   "BB" "BB" "BB" "AB" "AB" "AA" "AB"
... and so on
library(dplyr)

mutate_all(foo, funs(recode(., "AA" = "0101", "AC" = "0102", "AG" = "0103",
                            .default = NA_character_)))

#   snp1 snp2 snp3
# 1 0101 0101 <NA>
# 2 0103 <NA> <NA>
# 3 0101 0103 <NA>
# 4 0101 0101 <NA>
library(dplyr)

foo %>%
   mutate_all(~case_when(. == "AA" ~ "0101", 
                         . == "AC" ~ "0102", 
                         . == "AG" ~ "0103", 
                         TRUE ~ .))

#  snp1 snp2 snp3
#1 0101 0101 <NA>
#2 0103   AT   GG
#3 0101 0103   GG
#4 0101 0101   GC
foo %>%
  mutate_all(~case_when(. == "AA" ~ "0101", 
                        . == "AC" ~ "0102", 
                        . == "AG" ~ "0103"))

#  snp1 snp2 snp3
#1 0101 0101 <NA>
#2 0103 <NA> <NA>
#3 0101 0103 <NA>
#4 0101 0101 <NA>
lookup <- data.frame(old_val = c("AA", "AC", "AG"), 
                     new_val = c("0101", "0102", "0103"))

foo[] <- lookup$new_val[match(unlist(foo), lookup$old_val)]
# 1. Made your dictionnary
dictio_replace= c("AA"= "0101", 
                  "AC"= "0102",
                  "AG"= "0103") # short example of dictionnary.

 # 2. Replace all pattern, according to the dictionary-values (only a single vector of string, or a single string)
 foo$snp1 <- stringr::str_replace_all(string = foo$snp1,
                                      pattern= dictio_replace)  # we only use the 'pattern' option here: 'replacement' is useless since we provide a dictionnary.