如何有条件地删除R中的行?
我有一个date.frame,看起来像:如何有条件地删除R中的行?,r,dataframe,R,Dataframe,我有一个date.frame,看起来像: SNP CLST A1 A2 FRQ IMP POS CHR BVAL 1 rs2803291 Brahui C T 0.660000 0 1882185 1 878 2 rs2803291 Balochi C T 0.750000 0 1882185 1 878 3 rs2803291
SNP CLST A1 A2 FRQ IMP POS CHR BVAL
1 rs2803291 Brahui C T 0.660000 0 1882185 1 878
2 rs2803291 Balochi C T 0.750000 0 1882185 1 878
3 rs2803291 Hazara C T 0.772727 0 1882185 1 878
4 rs2803291 Makrani C T 0.620000 0 1882185 1 878
5 rs2803291 Sindhi C T 0.770833 0 1882185 1 878
6 rs2803291 Pathan C T 0.681818 0 1882185 1 878
53 rs12060022 Brahui T C 0.0600000 1 3108186 1 982
54 rs12060022 Balochi T C 0.0416667 1 3108186 1 982
55 rs12060022 Hazara T C 0.0000000 1 3108186 1 982
56 rs12060022 Makrani T C 0.0200000 1 3108186 1 982
57 rs12060022 Sindhi T C 0.0625000 1 3108186 1 982
58 rs12060022 Pathan T C 1 1 3108186 1 982
105 rs870171 Brahui T G 0.2200000 0 3332664 1 976
106 rs870171 Balochi T G 0.3333330 0 3332664 1 976
107 rs870171 Hazara T G 1 0 3332664 1 976
108 rs870171 Makrani T G 1 0 3332664 1 976
109 rs870171 Sindhi T G 0.2083330 0 3332664 1 976
110 rs870171 Pathan T G 1 0 3332664 1 976
157 rs4282783 Brahui G T 1 1 4090545 1 992
158 rs4282783 Balochi G T 1 1 4090545 1 992
159 rs4282783 Hazara G T 1 1 4090545 1 992
160 rs4282783 Makrani G T 1 1 4090545 1 992
161 rs4282783 Sindhi G T 1 1 4090545 1 992
162 rs4282783 Pathan G T 1 1 4090545 1 992
我想删除所有行,其中给定SNP的每一行在
FRQ
列中的值为1。例如,每个rs4282783在FRQ列中的值为1,因此我想删除所有这些行。但我不想删除第58行,例如,它在FRQ中的值为1。有人有什么建议吗?这里有一个使用子集和ave
的基本R方法<代码>平均值构建组水平(SNP水平)最大值,用于通过观察对数据进行子集划分:
df[ave(df$FRQ, df$SNP, FUN=max) < 0.99999,]
SNP CLST A1 A2 FRQ IMP POS CHR BVAL
1 rs2803291 Brahui C T 0.660000 0 1882185 1 878
2 rs2803291 Balochi C T 0.750000 0 1882185 1 878
3 rs2803291 Hazara C T 0.772727 0 1882185 1 878
4 rs2803291 Makrani C T 0.620000 0 1882185 1 878
5 rs2803291 Sindhi C T 0.770833 0 1882185 1 878
6 rs2803291 Pathan C T 0.681818 0 1882185 1 878
df[ave(df$FRQ,df$SNP,FUN=max)<0.99999,]
SNP CLST A1 A2 FRQ IMP POS CHR BVAL
1 rs2803291 Brahui C T 0.660000 0 1882185 1 878
2 rs2803291俾路支C T 0.750000 1882185 1 878
3 rs2803291 Hazara C T 0.772727 0 1882185 1 878
4 rs2803291 Makrani C T 0.620000 0 1882185 1 878
5 rs2803291 Sindhi C T 0.770833 0 1882185 1 878
6 rs2803291帕坦C T 0.681818 0 1882185 1 878
请注意,为了避免或减少数值不精确的问题,我使用了0.99999而不是1
数据
df <- read.table(header=T, text="SNP CLST A1 A2 FRQ IMP POS CHR BVAL
1 rs2803291 Brahui C T 0.660000 0 1882185 1 878
2 rs2803291 Balochi C T 0.750000 0 1882185 1 878
3 rs2803291 Hazara C T 0.772727 0 1882185 1 878
4 rs2803291 Makrani C T 0.620000 0 1882185 1 878
5 rs2803291 Sindhi C T 0.770833 0 1882185 1 878
6 rs2803291 Pathan C T 0.681818 0 1882185 1 878
53 rs12060022 Brahui T C 0.0600000 1 3108186 1 982
54 rs12060022 Balochi T C 0.0416667 1 3108186 1 982
55 rs12060022 Hazara T C 0.0000000 1 3108186 1 982
56 rs12060022 Makrani T C 0.0200000 1 3108186 1 982
57 rs12060022 Sindhi T C 0.0625000 1 3108186 1 982
58 rs12060022 Pathan T C 1 1 3108186 1 982
105 rs870171 Brahui T G 0.2200000 0 3332664 1 976
106 rs870171 Balochi T G 0.3333330 0 3332664 1 976
107 rs870171 Hazara T G 1 0 3332664 1 976
108 rs870171 Makrani T G 1 0 3332664 1 976
109 rs870171 Sindhi T G 0.2083330 0 3332664 1 976
110 rs870171 Pathan T G 1 0 3332664 1 976
157 rs4282783 Brahui G T 1 1 4090545 1 992
158 rs4282783 Balochi G T 1 1 4090545 1 992
159 rs4282783 Hazara G T 1 1 4090545 1 992
160 rs4282783 Makrani G T 1 1 4090545 1 992
161 rs4282783 Sindhi G T 1 1 4090545 1 992
162 rs4282783 Pathan G T 1 1 4090545 1 992")
df要删除SNP
当FRQ
的所有值都等于1时,您可以尝试:
library(dplyr)
df %>%
group_by(SNP) %>%
filter(!all(FRQ == 1))
其中:
# SNP CLST A1 A2 FRQ IMP POS CHR BVAL
# <fctr> <fctr> <fctr> <fctr> <dbl> <int> <int> <int> <int>
#1 rs2803291 Brahui C T 0.6600000 0 1882185 1 878
#2 rs2803291 Balochi C T 0.7500000 0 1882185 1 878
#3 rs2803291 Hazara C T 0.7727270 0 1882185 1 878
#4 rs2803291 Makrani C T 0.6200000 0 1882185 1 878
#5 rs2803291 Sindhi C T 0.7708330 0 1882185 1 878
#6 rs2803291 Pathan C T 0.6818180 0 1882185 1 878
#7 rs12060022 Brahui T C 0.0600000 1 3108186 1 982
#8 rs12060022 Balochi T C 0.0416667 1 3108186 1 982
#9 rs12060022 Hazara T C 0.0000000 1 3108186 1 982
#10 rs12060022 Makrani T C 0.0200000 1 3108186 1 982
#11 rs12060022 Sindhi T C 0.0625000 1 3108186 1 982
#12 rs12060022 Pathan T C 1.0000000 1 3108186 1 982
#13 rs870171 Brahui T G 0.2200000 0 3332664 1 976
#14 rs870171 Balochi T G 0.3333330 0 3332664 1 976
#15 rs870171 Hazara T G 1.0000000 0 3332664 1 976
#16 rs870171 Makrani T G 1.0000000 0 3332664 1 976
#17 rs870171 Sindhi T G 0.2083330 0 3332664 1 976
#18 rs870171 Pathan T G 1.0000000 0 3332664 1 976
#SNP CLST A1 A2 FRQ IMP POS CHR BVAL
#
#1 rs2803291 Brahui C T 0.6600000 0 1882185 1 878
#2 rs2803291俾路支C T 0.7500000 1882185 1 878
#3 rs2803291 Hazara C T 0.7727270 0 1882185 1 878
#4 rs2803291 Makrani C T 0.6200000 0 1882185 1 878
#5 rs2803291 Sindhi C T 0.7708330 0 1882185 1 878
#6 rs2803291帕坦C T 0.6818180 0 1882185 1 878
#7 rs12060022 Brahui T C 0.06000001 3108186 1 982
#8 rs12060022俾路支T C 0.0416667 1 3108186 1 982
#9 rs12060022哈扎拉T C 0.00000001 3108186 1 982
#10 rs12060022 Makrani T C 0.0200000 1 3108186 1 982
#11 rs12060022信德电信0.0625000 1 3108186 1 982
#12 rs12060022 Pathan T C 1.0000000 1 3108186 1 982
#13 rs870171 Brahui T G 0.2200000 3332664 1 976
#14 rs870171俾路支T G 0.3333330 3332664 1 976
#15 rs870171哈扎拉T G 1.0000000 3332664 1 976
#16 rs870171 Makrani T G 1.0000000 3332664 1 976
#17 rs870171信德集团0.2083330 0 3332664 1 976
#18 rs870171帕桑T G 1.0000000 3332664 1 976
@imo的答案更为简洁,但正如我所做的那样,我将添加它。
在我看来,逻辑稍微清晰一些
# which SNPs are always 1
# For each SNP value, take the rows with that SNP, and test if all FRQ values are 1
rmSNPs <- sapply(unique(dd$SNP), function(x) all(dd$FRQ[dd$SNP == x] == 1))
# new data is old data minus row where dd$SNP is not one of those found above
newdata <- dd[dd$SNP != unique(dd$SNP)[rmSNPs], ]
#哪些SNP总是1
#对于每个SNP值,取具有该SNP的行,并测试所有FRQ值是否为1
rmSNPs这里是一个基于数据表的方法。将“data.frame”转换为“data.table”(setDT(df)
),按“SNP”分组,如果不是(!
)则“FRQ”中的所有元素都是1,然后获取data.table的子集
library(data.table)
setDT(df)[, if(!(all(FRQ==1))) .SD , by = SNP]
或者,如果我猜OP是指删除所有只有一个“FRQ”的SNP,那么我们可以使用uniqueN
来查找唯一的元素的数量,并在if
条件中使用,以仅保留那些具有1个以上的SNP
setDT(df)[, if(uniqueN(FRQ) > 1) .SD , by = SNP]
不错。当我们有非常相似的方法时,也许最好给另一个留下评论,这样我们就可以整合到同一个答案中+1@StevenBeaupr我认为data.table语法可以作为一个单独的答案。@StevenBeaupréI添加了另一种方法