Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/78.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用dplyr根据列表中项目的比例选择组_R_Dplyr - Fatal编程技术网

使用dplyr根据列表中项目的比例选择组

使用dplyr根据列表中项目的比例选择组,r,dplyr,R,Dplyr,你好,我有一个df,比如 Groups COL1 G1 horse G1 donkey G1 unknown G1 snake G1 horse G2 dog G2 dog G2 unknown G2 unknown G3 donkey G3 dog G4 Mule G4 dog G4 cat G4 cat G5 mule G5 donkey G5 mule 和一份清单 list_not_accepted=c("horse","donkey",&q

你好,我有一个df,比如

Groups COL1
G1 horse
G1 donkey
G1 unknown
G1 snake 
G1 horse 
G2 dog
G2 dog
G2 unknown
G2 unknown
G3 donkey
G3 dog
G4 Mule
G4 dog
G4 cat 
G4 cat
G5 mule
G5 donkey
G5 mule
和一份清单

list_not_accepted=c("horse","donkey","mule")

因此,基本上,我们的想法是只选择列表中COL1元素的数量不被接受/所有COL1值的数量的组。我建议下一种方法,您可以将原始数据与小数据帧连接起来,计算比率并制作过滤器。我们使用了一个内部连接来根据阈值添加过滤组的结果。我们还使用left_join来合并COL1中的案例数和应用过滤器后的案例数,其中list_not_accepted。根据这些结果,计算比率变量并应用滤波器。最后,达到了预期的输出。代码如下:

library(tidyverse)
#Data
list_not_accepted=c("horse","donkey","mule")
#Compute
#Join data
mydf %>% 
  inner_join(mydf %>%
              #Compute n obs across all groups
              group_by(Groups) %>%
              summarise(N=n()) %>% 
              #Lef join with n obs based on vector list_not_accepted
              left_join(mydf %>% 
                          filter(COL1 %in% list_not_accepted) %>%
                          group_by(Groups) %>%
                          summarise(N1=n())) %>% replace(is.na(.),0) %>%
              #Compute ratio
              mutate(Ratio=N1/N) %>%
              filter(Ratio<0.6)) %>% select(Groups,COL1)
使用的一些数据:

#Data
mydf <- structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G2", 
"G2", "G2", "G2", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5", 
"G5"), COL1 = c("horse", "donkey", "unknown", "snake", "horse", 
"dog", "dog", "unknown", "unknown", "donkey", "dog", "Mule", 
"dog", "cat", "cat", "mule", "donkey", "mule")), class = "data.frame", row.names = c(NA, 
-18L))
带有过滤器的dplyr解决方案:

G4中的第一个元素是Mule。在您的描述中,它应该匹配列表的mule\u not\u accepted,所以我在匹配之前将所有COL1转换为小写

library(tidyverse)
#Data
list_not_accepted=c("horse","donkey","mule")
#Compute
#Join data
mydf %>% 
  inner_join(mydf %>%
              #Compute n obs across all groups
              group_by(Groups) %>%
              summarise(N=n()) %>% 
              #Lef join with n obs based on vector list_not_accepted
              left_join(mydf %>% 
                          filter(COL1 %in% list_not_accepted) %>%
                          group_by(Groups) %>%
                          summarise(N1=n())) %>% replace(is.na(.),0) %>%
              #Compute ratio
              mutate(Ratio=N1/N) %>%
              filter(Ratio<0.6)) %>% select(Groups,COL1)
   Groups    COL1
1      G2     dog
2      G2     dog
3      G2 unknown
4      G2 unknown
5      G3  donkey
6      G3     dog
7      G4    Mule
8      G4     dog
9      G4     cat
10     G4     cat
#Data
mydf <- structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G2", 
"G2", "G2", "G2", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5", 
"G5"), COL1 = c("horse", "donkey", "unknown", "snake", "horse", 
"dog", "dog", "unknown", "unknown", "donkey", "dog", "Mule", 
"dog", "cat", "cat", "mule", "donkey", "mule")), class = "data.frame", row.names = c(NA, 
-18L))
library(dplyr)

df %>%
  group_by(Groups) %>%
  filter(sum(tolower(COL1) %in% list_not_accepted) / n() < 0.6)

# A tibble: 10 x 2
# Groups:   Groups [3]
#    Groups COL1   
#    <chr>  <chr>  
#  1 G2     dog    
#  2 G2     dog    
#  3 G2     unknown
#  4 G2     unknown
#  5 G3     donkey 
#  6 G3     dog    
#  7 G4     Mule   
#  8 G4     dog    
#  9 G4     cat    
# 10 G4     cat