Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/67.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 用r中的条件替换数据帧中字符串的部分_Regex_R_Dna Sequence - Fatal编程技术网

Regex 用r中的条件替换数据帧中字符串的部分

Regex 用r中的条件替换数据帧中字符串的部分,regex,r,dna-sequence,Regex,R,Dna Sequence,我有这样一个数据框: df = read.table(text="REF Alt S00001 S00002 S00003 S00004 S00005 TAAGAAG TAAG TAAGAAG/TAAGAAG TAAGAAG/TAAG TAAG/TAAG TAAGAAG/TAAGAAG TAAGAAG/TAAGAAG T TG T/T -/- TG/TG T/T T/T CAAAA CAAA CAAAA/CAAAA CAAAA/CAAA

我有这样一个数据框:

df = read.table(text="REF   Alt S00001  S00002  S00003  S00004  S00005
 TAAGAAG    TAAG    TAAGAAG/TAAGAAG TAAGAAG/TAAG    TAAG/TAAG   TAAGAAG/TAAGAAG TAAGAAG/TAAGAAG
 T  TG  T/T -/- TG/TG   T/T T/T
 CAAAA  CAAA    CAAAA/CAAAA CAAAA/CAAA  CAAAA/CAAAA -/- CAAAA/CAAAA
 TTGT   TTGTGT  TTGT/TTGT   TTGT/TTGT   TTGT/TTGT   TTGTGT/TTGTGT   TTGT/TTGTGT
 GTTT   GTTTTT  GTTT/GTTTTT GTTT/GTTT   GTTT/GTTT   GTTT/GTTT   GTTTTT/GTTTTT", header=T, stringsAsFactors=F)
我想根据“REF”和“Alt”列中字符串的长度,将“/”分隔的字符元素替换为“D”或“I”。如果这些元素与最长的元素匹配,它们将替换为“I”,否则替换为“D”。但“一”字不变。因此,预期结果如下:

REF Alt S00001  S00002  S00003  S00004  S00005
TAAGAAG TAAG    I/I I/D D/D I/I I/I
T   TG  D/D -/- I/I D/D D/D
CAAAA   CAAA    I/I I/D I/I -/- I/I
TTGT    TTGTGT  D/D D/D D/D I/I D/I
GTTT    GTTTTT  D/I D/D D/D D/D I/I

您可以使用
REF
Alt
的所有组合以及
I
D
的相应组合创建地图:

refalt <- data.frame(
    from=c(df$REF, df$Alt),
    to=c(rep('I', length(df$REF)), rep('D', length(df$Alt))),
    stringsAsFactors=FALSE)
refalt <- rbind(refalt, c('-', '-'))
from <- expand.grid(refalt$from, refalt$from)
to <- expand.grid(refalt$to, refalt$to)
map <- paste(to[,1], to[,2], sep='/')
names(map) <- paste(from[,1], from[,2], sep='/')

refalt这里有一种方法。我使用了
stringi
包,因为它可以很好地处理模式向量和字符串向量以进行搜索

首先确定哪个字符串较短,哪个字符串较长:

short <- ifelse(nchar(df$Alt) > nchar(df$REF), df$REF, df$Alt)
long <- ifelse(nchar(df$REF) > nchar(df$Alt), df$REF, df$Alt)
短nchar(df$REF)、df$REF、df$Alt)
长nchar(df$Alt)、df$REF、df$Alt)
使用这些选项并在列上循环,根据需要指定替换项。首先替换长模式,以避免出现与短模式和长模式匹配的字符串问题:

library(stringi)

df[,!(names(df) %in% c("REF", "Alt"))] <- # assign into original df
  lapply(1:(ncol(df) - 2), # - 2 because there are two columns we don't use
    function(ii) stri_replace_all_fixed(df[ ,ii + 2], long, "I")) # + 2 to skip first 2 columns

df[,!(names(df) %in% c("REF", "Alt"))] <- 
  lapply(1:(ncol(df) - 2),
    function(ii) stri_replace_all_fixed(df[ ,ii + 2], short, "D"))

#      REF    Alt S00001 S00002 S00003 S00004 S00005
#1 TAAGAAG   TAAG    I/I    I/D    D/D    I/I    I/I
#2       T     TG    D/D    -/-    I/I    D/D    D/D
#3   CAAAA   CAAA    I/I    I/D    I/I    -/-    I/I
#4    TTGT TTGTGT    D/D    D/D    D/D    I/I    D/I
#5    GTTT GTTTTT    D/I    D/D    D/D    D/D    I/I
库(stringi)
df[,!(名称(df)%in%c(“REF”,“Alt”))]
library(stringi)

df[,!(names(df) %in% c("REF", "Alt"))] <- # assign into original df
  lapply(1:(ncol(df) - 2), # - 2 because there are two columns we don't use
    function(ii) stri_replace_all_fixed(df[ ,ii + 2], long, "I")) # + 2 to skip first 2 columns

df[,!(names(df) %in% c("REF", "Alt"))] <- 
  lapply(1:(ncol(df) - 2),
    function(ii) stri_replace_all_fixed(df[ ,ii + 2], short, "D"))

#      REF    Alt S00001 S00002 S00003 S00004 S00005
#1 TAAGAAG   TAAG    I/I    I/D    D/D    I/I    I/I
#2       T     TG    D/D    -/-    I/I    D/D    D/D
#3   CAAAA   CAAA    I/I    I/D    I/I    -/-    I/I
#4    TTGT TTGTGT    D/D    D/D    D/D    I/I    D/I
#5    GTTT GTTTTT    D/I    D/D    D/D    D/D    I/I