高效的dplyr风格操作
我有两个玩具示例表: 表1: 表2:高效的dplyr风格操作,r,dplyr,R,Dplyr,我有两个玩具示例表: 表1: 表2: all_students <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456",'RC456'), school_id = c(1,1,1,1,1,2), grade_level = c(10,10,9,9,11,11),
all_students <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456",'RC456'),
school_id = c(1,1,1,1,1,2),
grade_level = c(10,10,9,9,11,11),
date_of_birth = c("1990-02-02","1990-02-02","1991-01-01","1991-02-01","1989-02-02","1989-02-02"),
hometown = c("farm","farm","farm","farm","farm","city"),
stringsAsFactors = F)
> all_students
student_id school_id grade_level date_of_birth hometown
1 RA123 1 10 1990-02-02 farm
2 RB123 1 10 1990-02-02 farm
3 RC123 1 9 1991-01-01 farm
4 RA456 1 9 1991-02-01 farm
5 RB456 1 11 1989-02-02 farm
6 RC456 2 11 1989-02-02 city
如果学生当天缺席,则出勤事件的出勤率为0
我的问题是,在R中,找出2020-02-01和2020-02-02之间入学率下降幅度最大的年级的最有效方法是什么
我的代码是:
#Only include absences because it will be a smaller dataset
att_ws_alt <- inner_join(attendance_events, all_students[,c("student_id","grade_level")], by = "student_id") %>%
filter(attendance == 0)
#Set days to check between
date_from <- "2020-02-01"
date_to <- "2020-02-02"
#Continously pipe to not have to store and reference(?)
att_drop_alt <- att_ws_alt %>%
filter(dates %in% c(date_from, date_to)) %>%
group_by(grade_level,dates) %>%
summarize(absence_bydate = n()) %>%
dcast(grade_level ~ dates) %>%
sapply(FUN = function(x) { x[is.na(x)] <- 0; x}) %>%
as.data.frame() %>%
mutate("absence_change" = .[,3] - .[,2]) %>%
select(grade_level, absence_change) %>%
arrange(desc(absence_change))
>att_drop_alt
grade_level absence_change
1 10 2
2 11 1
3 9 0
然而,对于一个看起来相当简单的问题来说,这感觉有点复杂。我想看看R程序员回答这个问题的其他方法,理想情况下可以获得更好的性能,但即使是可读性也很好
谢谢社区 我想这更简洁一点: left\u Join出席\u活动,所有学生,by=student\u id%>% 按等级分组,日期%>% SummarseAttainance=SummarseAttainment%>% 组(按等级)级别%>% 汇总趋势变化=差异出席 >一个tibble:3x2 >年级\级别出勤\变更 > > 1 9 0 > 2 10 -2 > 3 11 -1 当然,如果你想计算缺勤率而不是出勤率,只需在最后一行的差异前加一个负号。如果这不能准确回答你的问题,很抱歉,但我不想不公平地指责学生比他们缺席的时间多
library(dplyr)
all_students %>%
left_join(attendance_events) %>%
mutate(dates = as.Date(dates)) %>%
group_by(grade_level, dates) %>%
summarise(NAbs = sum(ifelse(attendance == 0, 1, 0)),
N = n(),
pctAbs = NAbs / n() * 100) %>%
arrange(dates) %>%
mutate(change = pctAbs - lag(pctAbs)) %>%
ungroup() %>%
arrange(change)
# A tibble: 6 x 6
dates grade_level NAbs N pctAbs change
<date> <dbl> <dbl> <int> <dbl> <dbl>
1 2020-02-02 9 1 2 50 0
2 2020-02-02 11 1 2 50 50
3 2020-02-02 10 2 2 100 100
4 2020-02-01 9 1 2 50 NA
5 2020-02-01 10 0 2 0 NA
6 2020-02-01 11 0 2 0 NA
使用data.table
您不想评估相对出勤/缺勤的变化吗?我的意思是,如果你为了演示而极端假设,只需将当年的学生人数增加三倍,并将相对缺课人数减半,那么总缺课人数还会增加吗?对不起,如果我误解了什么是的,我同意,但这只是个问题。我想反点是,如果你需要做出的决定是基于学生,那么你关心的是绝对数字而不是百分比。是的,这很好,我真的没有想到像那样垂直使用diff。我想我只有在同时得到我想要的两个号码后才能得到零钱。将等待更多的答案,然后再标记为正确我想这一定是这个代码的最短版本!不幸的是,在我的回答中,我无法让diff工作。。。还是不确定why@dario我只使用了diff,因为我使用了summary,其中每个组只有两个日期,所以它给出了一个值。此方法无法跨多个日期工作。明白!谢谢你的精彩解释!现在有一个更短的data.table答案。我希望有一天能掌握它。但在那之前,代码在我看来就像ascii码的艺术,需要永远阅读;是的,这比我的版本@dario更健壮。我试着保持简短,但也许它只是需要额外的努力。
library(dplyr)
all_students %>%
left_join(attendance_events) %>%
mutate(dates = as.Date(dates)) %>%
group_by(grade_level, dates) %>%
summarise(NAbs = sum(ifelse(attendance == 0, 1, 0)),
N = n(),
pctAbs = NAbs / n() * 100) %>%
arrange(dates) %>%
mutate(change = pctAbs - lag(pctAbs)) %>%
ungroup() %>%
arrange(change)
# A tibble: 6 x 6
dates grade_level NAbs N pctAbs change
<date> <dbl> <dbl> <int> <dbl> <dbl>
1 2020-02-02 9 1 2 50 0
2 2020-02-02 11 1 2 50 50
3 2020-02-02 10 2 2 100 100
4 2020-02-01 9 1 2 50 NA
5 2020-02-01 10 0 2 0 NA
6 2020-02-01 11 0 2 0 NA
library(data.table)
setDT(attendance_events)[all_students, .SD[, .(sum(attendance)),
.(grade_level, dates)], on = .(student_id)][,
.(attendanace_change = diff(rev(V1))), .(grade_level)]
# grade_level attendanace_change
#1: 10 2
#2: 9 0
#3: 11 1