R:按变量分组,然后按另一个变量的出现次数计数/过滤

R:按变量分组,然后按另一个变量的出现次数计数/过滤,r,dplyr,group-by,count,R,Dplyr,Group By,Count,我有一个分类变量和状态的数据框架。对于每个状态,我想找到最常见的分类变量,并过滤掉其余的 例如 对于阿拉巴马州,cat_变量_2是最常见的-因此,在阿拉巴马州的数据框中,包含cat_变量_2的行将是所有剩余的行。这将适用于每个州 1 Alabama cat_variable_2 2 Alabama cat_variable_2 谢谢你 您可以过滤每个状态中出现次数最多的变量 library(dplyr) df %>% group_by(state) %>% filter

我有一个分类变量和状态的数据框架。对于每个状态,我想找到最常见的分类变量,并过滤掉其余的

例如

对于阿拉巴马州,cat_变量_2是最常见的-因此,在阿拉巴马州的数据框中,包含cat_变量_2的行将是所有剩余的行。这将适用于每个州

1  Alabama   cat_variable_2
2  Alabama   cat_variable_2

谢谢你

您可以过滤每个
状态中出现次数最多的变量

library(dplyr)
df %>% group_by(state) %>% filter(variable == names(which.max(table(variable))))

#   state   variable      
#  <chr>   <chr>         
#1 Alabama cat_variable_2
#2 Alabama cat_variable_2
数据。表

library(data.table)
setDT(df)[, .SD[variable == names(which.max(table(variable)))], state]
数据

df <- structure(list(state = c("Alabama", "Alabama", "Alabama", "Alabama"
), variable = c("cat_variable_1", "cat_variable_2", "cat_variable_2", 
"cat_variable_3")), row.names = c(NA, -4L), class = "data.frame")

df一种方法是使用所需的组合创建一个新的df,然后在原始df上使用
dplyr::internal_join
仅保留这些组合

library(dplyr)

## An example df with two "states" with different most common cat_var.
df <- tibble(
  state = gl(2, 50, labels = c("AL", "NY")),
  cat_var = case_when(
    state == "AL" ~ sample(1:3, 100, TRUE, prob = c(.2, .3, .5)),
    state == "NY" ~ sample(1:3, 100, TRUE, prob = c(.5, .3, .2))
  ),
  y = rnorm(100)
)

## Keeps the cat_var in each state that is most common, giving a df
## with each state--cat_var comb that we can filter against.
state_vars <-
  df %>%
  count(state, cat_var, sort = TRUE) %>%
  group_by(state) %>%
  slice(1) %>%
  ungroup()

## Use `inner_join` to only keep those comb in `state_vars`.
inner_join(df, state_vars, by = c("state", "cat_var"))
库(dplyr)
##示例df具有两个“状态”,具有不同的最常见的cat_var。
df%
按(州)分组%>%
切片(1)%>%
解组()
##使用'inner\u join'仅将这些梳子保持在'state\u vars'状态。
内部连接(df,状态变量,by=c(“状态”,“类别变量”))
df <- structure(list(state = c("Alabama", "Alabama", "Alabama", "Alabama"
), variable = c("cat_variable_1", "cat_variable_2", "cat_variable_2", 
"cat_variable_3")), row.names = c(NA, -4L), class = "data.frame")
library(dplyr)

## An example df with two "states" with different most common cat_var.
df <- tibble(
  state = gl(2, 50, labels = c("AL", "NY")),
  cat_var = case_when(
    state == "AL" ~ sample(1:3, 100, TRUE, prob = c(.2, .3, .5)),
    state == "NY" ~ sample(1:3, 100, TRUE, prob = c(.5, .3, .2))
  ),
  y = rnorm(100)
)

## Keeps the cat_var in each state that is most common, giving a df
## with each state--cat_var comb that we can filter against.
state_vars <-
  df %>%
  count(state, cat_var, sort = TRUE) %>%
  group_by(state) %>%
  slice(1) %>%
  ungroup()

## Use `inner_join` to only keep those comb in `state_vars`.
inner_join(df, state_vars, by = c("state", "cat_var"))