通过与r中的一列进行比较,从多个列中查找唯一的元素
我想找出从第4列到最后一列的每个字符串中唯一的元素。每个字符串有两个元素,以“/”分隔。添加一列“Alt”以保存在“REF”列中不是字符值但不计算“-”字符的任何唯一元素。添加另一列“Num”以保存唯一元素的数量。 数据帧:通过与r中的一列进行比较,从多个列中查找唯一的元素,r,R,我想找出从第4列到最后一列的每个字符串中唯一的元素。每个字符串有两个元素,以“/”分隔。添加一列“Alt”以保存在“REF”列中不是字符值但不计算“-”字符的任何唯一元素。添加另一列“Num”以保存唯一元素的数量。 数据帧: CHROM POS REF sample1 sample2 sample3 sample4 sample5 Chr20 84 C C/C C/G C/A C/C C/C Chr20 102 TAA TAA/TAA TAA/TAA TAA/TA TAA/T
CHROM POS REF sample1 sample2 sample3 sample4 sample5
Chr20 84 C C/C C/G C/A C/C C/C
Chr20 102 TAA TAA/TAA TAA/TAA TAA/TA TAA/TAA TA/TA
Chr20 104 ACCCCC ACCCCC/ACCCCCC ACCCCCC/ACCCCCC ACCCCC/ACCCCC ACCCCC/ACCCCC ACCCCC/ACCCCC
Chr20 109 C C/T C/T -/- C/T C/C
Chr20 118 AT A/AT A/A AT/AT AT/ATT AT/T
预期的结果是:
CHROM POS REF sample1 sample2 sample3 sample4 sample5 Alt Num
Chr20 84 C C/C C/G C/A C/C -/- A,G 2
Chr20 102 TAA TAA/TAA TAA/TAA TAA/TA TAA/TAA TA/TA TA 1
Chr20 104 ACCCCC ACCCCC/ACCCCCC ACCCCCC/ACCCCCC ACCCCC/ACCCCC ACCCCC/ACCCCC ACCCCC/ACCCCC ACCCCCC 1
Chr20 109 C C/T C/T -/- C/T C/C T 1
Chr20 118 AT A/AT A/A AT/AT AT/ATT AT/T A,ATT,T 3
感谢您的帮助。您可以使用
apply
和MARGIN=1
在行上循环。我们可以通过“/”分割元素(strsplit(…)
),将“REF”与“sample”进行比较,并提取不匹配的元素
df1[c('Alt', 'Num')] <- t(apply(df1[-(1:2)],1, FUN=function(x) {
x1 <- unlist(strsplit(x[2:length(x)], '/'))
x2 <- unlist(strsplit(x[1], '/'))
x3 <- unique(x1[!x1 %in% c(x2,'-')])
c(toString(x3), length(x3))}))
df1$Num <- as.numeric(df1$Num)
df1
# CHROM POS REF sample1 sample2 sample3 sample4
#1 Chr20 84 C C/C C/G C/A C/C
#2 Chr20 102 TAA TAA/TAA TAA/TAA TAA/TA TAA/TAA
#3 Chr20 104 ACCCCC ACCCCC/ACCCCCC ACCCCCC/ACCCCCC ACCCCC/ACCCCC ACCCCC/ACCCCC
#4 Chr20 109 C C/T C/T -/- C/T
#5 Chr20 118 AT A/AT A/A AT/AT AT/ATT
# sample5 Alt Num
#1 C/C G, A 2
#2 TA/TA TA 1
#3 ACCCCC/ACCCCC ACCCCCC 1
#4 C/C T 1
#5 AT/T A, ATT, T 3
df1[c('Alt','Num')]请提供一个。有关如何提问的其他提示,请阅读“”。
df1 <- structure(list(CHROM = c("Chr20", "Chr20", "Chr20", "Chr20",
"Chr20"), POS = c(84L, 102L, 104L, 109L, 118L), REF = c("C",
"TAA", "ACCCCC", "C", "AT"), sample1 = c("C/C", "TAA/TAA", "ACCCCC/ACCCCCC",
"C/T", "A/AT"), sample2 = c("C/G", "TAA/TAA", "ACCCCCC/ACCCCCC",
"C/T", "A/A"), sample3 = c("C/A", "TAA/TA", "ACCCCC/ACCCCC",
"-/-", "AT/AT"), sample4 = c("C/C", "TAA/TAA", "ACCCCC/ACCCCC",
"C/T", "AT/ATT"), sample5 = c("C/C", "TA/TA", "ACCCCC/ACCCCC",
"C/C", "AT/T")), .Names = c("CHROM", "POS", "REF", "sample1",
"sample2", "sample3", "sample4", "sample5"), class = "data.frame",
row.names = c(NA, -5L))