如何在R中使用dplyr运行高效的groupby语句
我有一个具有多个重复ID的数据集,这些ID具有不同的分类值。下面是一个示例数据集如何在R中使用dplyr运行高效的groupby语句,r,performance,dataframe,dplyr,data-cleaning,R,Performance,Dataframe,Dplyr,Data Cleaning,我有一个具有多个重复ID的数据集,这些ID具有不同的分类值。下面是一个示例数据集 suppressMessages(library(dplyr)) DUMMY_DATA <- data.frame(ID = c(11,22,22,33,33,33,44,44,55,55,55,55), CATEGORY1 = c("E","B","C","C","C","D","A","A","B","C","E","B"),
suppressMessages(library(dplyr))
DUMMY_DATA <- data.frame(ID = c(11,22,22,33,33,33,44,44,55,55,55,55),
CATEGORY1 = c("E","B","C","C","C","D","A","A","B","C","E","B"),
CATEGORY2 = c ("AA","AA","BB","CC","DD","BB","AA","EE","AA","CC","BB","EE"),
stringsAsFactors = FALSE)
> DUMMY_DATA
ID CATEGORY1 CATEGORY2
1 11 E AA
2 22 B AA
3 22 C BB
4 33 C CC
5 33 C DD
6 33 D BB
7 44 A AA
8 44 A EE
9 55 B AA
10 55 C CC
11 55 E BB
12 55 B EE
我想从另一个提供分类值排名的数据集中聚合ID值。具体如下
Category_Rank1 <- data.frame(VAR = c("A","B","C","D","E"),
RANK = c(1,2,3,4,5),stringsAsFactors = FALSE
)
> Category_Rank1
VAR RANK
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
Category_Rank2 <- data.frame(VAR = c("AA","BB","CC","DD","EE"),
RANK = c(1,2,3,4,5),stringsAsFactors = FALSE
)
> Category_Rank2
VAR RANK
1 AA 1
2 BB 2
3 CC 3
4 DD 4
5 EE 5
对于来自DUMMY_DAT的每组ID,我想查找类别的秩,然后将该类别查找到具有最佳秩的ID。以下是我的解决方案
hierarchyTransform <- function(x,dataset){
x <- unique(x)
dataset <- dataset%>%
filter(dataset[,1] %in% x)
dataset <- dataset%>%
filter(dataset[,2] == min(dataset[,2]))
return(dataset[1,1])
}
NEW_DATA <- DUMMY_DATA%>%
group_by(ID)%>%
summarise(CATEGORY1_CLEAN = hierarchyTransform(x=CATEGORY1,
dataset = Category_Rank1),
CATEGORY2_CLEAN = hierarchyTransform(x=CATEGORY2,
dataset = Category_Rank2))
我得到以下结果
> NEW_DATA
# A tibble: 5 × 3
ID CATEGORY1_CLEAN CATEGORY2_CLEAN
<dbl> <chr> <chr>
1 11 E AA
2 22 B AA
3 33 C BB
4 44 A AA
5 55 B AA
这正是我想要的,但问题是这个操作需要时间。我的原始数据集大约有一百万行,当我根据ID对其进行分组时,我得到了大约200000个组。所以hierarchyTransform函数应用于200000个组,单个变量大约需要15分钟,我必须对其他10个变量执行此操作,这会增加时间。是否有任何解决方案可以减少此操作所需的时间。如果您知道类别级别的排名顺序(在您的示例中为字母顺序),则可以将类别转换为根据所需排名排序的级别的因子。然后按类别排序,按ID分组,并为每个ID取第一行 更新:回应您的评论和更新的问题:下面的代码将为每个ID从每个类别列中选择最高等级的值
DUMMY_DATA$CATEGORY1 = factor(DUMMY_DATA$CATEGORY1, levels=LETTERS[1:5], ordered=TRUE)
DUMMY_DATA$CATEGORY2 = factor(DUMMY_DATA$CATEGORY2, levels=c("AA","BB","CC","DD","EE"), ordered=TRUE)
现在,您可以执行以下任一操作:
DUMMY_DATA %>% group_by(ID) %>%
summarise(CATEGORY1 = min(CATEGORY1),
CATEGORY2 = min(CATEGORY2))
DUMMY_DATA %>% group_by(ID) %>%
summarise_all(funs(min))
如果我的分类值是年龄组c60-70,70-75,75-80,80-85,85-90,90-95,95-120。在这种情况下我能做什么?是的,只要设置factordf$age.ranges,levels=c60-70,70-75,75-80,80-85,85-90,90-95,95-120,当您将该变量转换为因子时,ordered=TRUE。然后你可以按照我的回答进行排序和切片。我稍微改变了这个问题。实际上我有多个列,排列函数在当前情况下没有给出期望的结果。非常感谢@eipi10这正是我所需要的,它很快。
DUMMY_DATA$CATEGORY1 = factor(DUMMY_DATA$CATEGORY1, levels=LETTERS[1:5], ordered=TRUE)
DUMMY_DATA$CATEGORY2 = factor(DUMMY_DATA$CATEGORY2, levels=c("AA","BB","CC","DD","EE"), ordered=TRUE)
DUMMY_DATA %>% group_by(ID) %>%
summarise(CATEGORY1 = min(CATEGORY1),
CATEGORY2 = min(CATEGORY2))
DUMMY_DATA %>% group_by(ID) %>%
summarise_all(funs(min))
ID CATEGORY1 CATEGORY2
1 11 E AA
2 22 B AA
3 33 C BB
4 44 A AA
5 55 B AA