R 如果组的任何行符合条件，请删除组_R

R 如果组的任何行符合条件，请删除组

R 如果组的任何行符合条件，请删除组,r,R,样本数据 data =data.frame(id=c(1,1,1,2,2,2,3,3,3,4,4,4), score=c(5,7,6,9,8,4,NA,11,3,7,NA,10)) 因此，在本例中，如果id的任何分数等于7，我希望删除这些id以获得新的数据帧，例如： data2 =data.frame(id=c(2,2,2,3,3,3), score=c(9,8,4,NA,11,3)) 我尝试了data[data$s

样本数据

data =data.frame(id=c(1,1,1,2,2,2,3,3,3,4,4,4),
                 score=c(5,7,6,9,8,4,NA,11,3,7,NA,10))

因此，在本例中，如果id的任何分数等于7，我希望删除这些id以获得新的数据帧，例如：

data2 =data.frame(id=c(2,2,2,3,3,3),
                     score=c(9,8,4,NA,11,3))

我尝试了data[data$score！=7]，但这只对一行有效，对组无效。

我们可以：

library(dplyr)

data %>%
  anti_join(data %>% filter(score == 7), by = "id")

输出：

在dplyr中，我们可以分组_，并在每个组中过滤是否存在，因为我们使用！分数变量中的7分之一：

library(dplyr)
data %>%
    group_by(id) %>%
    filter(!(7 %in% score))

# A tibble: 6 x 2
# Groups:   id [2]
     id score
  <dbl> <dbl>
1     2     9
2     2     8
3     2     4
4     3    NA
5     3    11
6     3     3

使用任何一个组！anyx==7，na.rm=TRUE为TRUE。这一衬里仅使用底部R

subset(data, !ave(score, id, FUN = function(x) any(x == 7, na.rm = TRUE)))

给予：

如果您想要一个不需要任何软件包的解决方案，您可以尝试：

data[!(data$id %in% data$id[data$score == 7]) , ]


  id score
4  2     9
5  2     8
6  2     4
7  3    NA
8  3    11
9  3     3

为了解释一个位，数据$id[data$score==7]位在分数为7时查找id。然后，当原始数据帧中的id是%data$id[data$score==7]中的data$id%时，我们使用%in%来查找逻辑向量。然后我们把它包围起来！删除这些ID

也许这里有一个可笑的高水平的过度杀伤力，但我们可以对迄今为止提出的所有选项进行基准测试：

library(dplyr)
library(microbenchmark)

microbenchmark(`G. Grothendieck` = subset(data, !ave(score, id, FUN = function(x) any(x == 7, na.rm = TRUE))), 
           `Nick Criswell` = data[!(data$id %in% data$id[data$score == 7]) , ],
           divibisan = data %>%
             group_by(id) %>%
             filter(!(7 %in% score)),
           arg0naut = data %>%
             anti_join(data %>% filter(score == 7), by = "id"),
           tmfmnk = data %>%
             group_by(id) %>%
             filter(!any(score == 7, na.rm = TRUE)),
           `d.b` = data[!data$id %in% split(data$id, data$score)$`7`,])


     Unit: microseconds
            expr     min      lq     mean   median       uq       max neval
 G. Grothendieck 160.001 177.455 189.4648 185.4545 195.6370   269.576   100
   Nick Criswell  37.819  45.091  52.2820  53.8190  57.2130    93.576   100
       divibisan 443.636 456.000 480.1211 464.0000 489.4545   904.726   100
        arg0naut 733.091 757.818 806.7143 766.0600 805.3325  1543.755   100
          tmfmnk 444.121 457.939 704.8916 463.0300 479.5150 22332.079   100
             d.b 103.759 114.424 125.3291 122.1825 131.8800   202.182   100

dplyr的另一种可能性：

data %>%
 group_by(id) %>%
 filter(!any(score == 7, na.rm = TRUE))

     id score
  <dbl> <dbl>
1     2     9
2     2     8
3     2     4
4     3    NA
5     3    11
6     3     3

或与base相同：

data[!ave(data$score, data$id, FUN = function(x) any(cumsum(ifelse(is.na(x), 0, x) == 7) >= 1)), ]

  id score
4  2     9
5  2     8
6  2     4
7  3    NA
8  3    11
9  3     3

或类似于@G.Grothendieck的可能性，但没有子集：

哦，我错编辑了你的帖子而不是我的。非常抱歉！没问题，它会发生；比如说，您只想创建一个“指示符”变量，而不是对数据进行子集设置？

data[!ave(data$score, data$id, FUN = function(x) any(cumsum(ifelse(is.na(x), 0, x) == 7) >= 1)), ]

  id score
4  2     9
5  2     8
6  2     4
7  3    NA
8  3    11
9  3     3

data[!ave(data$score, data$id, FUN = function(x) any(x == 7, na.rm = TRUE)), ]

data[!data$id %in% split(data$id, data$score)$`7`,]
#  id score
#4  2     9
#5  2     8
#6  2     4
#7  3    NA
#8  3    11
#9  3     3