R 部分总结两个数据帧
我有两个数据帧。对于df1的某些行,df2中有一个匹配的行。现在,应该对df1的某些列进行操作,以便它们包含自己的值和df2中的等效值之和 在下面的示例中,列“count1”和“count2”应该求和,而不是列“type”R 部分总结两个数据帧,r,R,我有两个数据帧。对于df1的某些行,df2中有一个匹配的行。现在,应该对df1的某些列进行操作,以便它们包含自己的值和df2中的等效值之和 在下面的示例中,列“count1”和“count2”应该求和,而不是列“type” df1 <- data.frame(id = c("one_a", "two_a", "three_a", "four_a"), type = c(8,7,6,5), count1 = c(1,2,1,NA), count2 = c(NA,0,1,0), id_df2
df1 <- data.frame(id = c("one_a", "two_a", "three_a", "four_a"), type = c(8,7,6,5), count1 = c(1,2,1,NA), count2 = c(NA,0,1,0), id_df2 = c("one", "two", "three", "four"))
df2 <- data.frame(id = c("one", "two", "four"), type = c(8,7,5), count1 = c(0,1,1), count2 = c(0,0,1))
result <- data.frame(id = c("one_a", "two_a", "three_a", "four_a"), type = c(8,7,6,5), count1 = c(1,3,1,1), count2 = c(0,0,1,1))
> df1
id type count1 count2 id_df2
1 one_a 8 1 NA one
2 two_a 7 2 0 two
3 three_a 6 1 1 three
4 four_a 5 NA 0 four
> df2
id type count1 count2
1 one 8 0 0
2 two 7 1 0
3 four 5 1 1
> result
id type count1 count2
1 one_a 8 1 0
2 two_a 7 3 0
3 three_a 6 1 1
4 four_a 5 1 1
df1结果
id类型count1 count2
1 1_8 1 0
2 2_7 3 0
3三个a 6 1 1
4 4 4 a 5 1 1
也有类似的问题,我试图找到一个解决方案,将数据帧分开,然后合并。我只是想知道是否有更优雅的方式来做到这一点。我的原始数据集大约有300列,所以我正在寻找一个可伸缩的解决方案
提前谢谢
chuckmorris你可以做:
library(dplyr)
df1 %>% select(-id_df2) %>%
bind_rows(df2) %>%
mutate(id = gsub("_.*", "", id)) %>%
replace(., is.na(.), 0) %>%
group_by(id, type) %>%
summarise_at(vars(contains("count")), funs(sum))
其中输出为:
# A tibble: 4 x 4
# Groups: id [?]
id type count1 count2
<chr> <dbl> <dbl> <dbl>
1 four 5 1 1
2 one 8 1 0
3 three 6 1 1
4 two 7 3 0
如果您有兴趣保留部件
另一种方法是使用连接,转换为long,然后向后扩展,如:
library(tidyverse)
df1 %>%
left_join(df2, by = c("id_df2" = "id")) %>%
gather(var, val, -id) %>%
mutate(var = gsub("\\..*", "", var)) %>%
distinct(id, var, val) %>%
filter(!var == "id_df2") %>%
group_by(id, var) %>%
summarise(val = sum(as.numeric(val), na.rm = T)) %>%
spread(var, val)
给予:
# A tibble: 4 x 4
# Groups: id [4]
id count1 count2 type
<fct> <dbl> <dbl> <dbl>
1 four_a 1 1 5
2 one_a 1 0 8
3 three_a 1 1 6
4 two_a 3 0 7
#一个tible:4 x 4
#组别:id[4]
id count1 count2类型
一四零一一五
2 1_1 0 8
3三个a 11 6
4 2_3 0 7
如果\u a
结尾有特殊用途,例如,也有带有\u b
、\u c
等的组(在这种情况下,上述方法将失败)。稍微不那么优雅,但仍然有效:
result_2 <- df2 %>%
mutate(id = paste0(id, "_a")) %>%
bind_rows(df1) %>%
select(-id_df2) %>%
replace(., is.na(.), 0) %>%
group_by(id) %>%
summarise(count1 = sum(count1), count2 = sum(count2), type = max(type)) %>%
mutate(id_df2 = as.factor(id)) %>%
select(c(id_df2, type, count1, count2), -id)
结果2%
变异(id=0(id,“_a”))%>%
绑定_行(df1)%>%
选择(-id\u df2)%>%
替换(,is.na(.),0)%>%
分组依据(id)%>%
汇总(count1=总和(count1),count2=总和(count2),type=最大值(type))%>%
突变(id_df2=as.factor(id))%>%
选择(c(id\U df2,类型,计数1,计数2),-id)
我是否可以使用“id\u df2”列进行此操作?-原始数据集上的一些“type”列在df1和df2中包含不同的值-“id”字段最初看起来像“thr_a_ee”,在文章末尾,已经添加了一种可能的方法。
result_2 <- df2 %>%
mutate(id = paste0(id, "_a")) %>%
bind_rows(df1) %>%
select(-id_df2) %>%
replace(., is.na(.), 0) %>%
group_by(id) %>%
summarise(count1 = sum(count1), count2 = sum(count2), type = max(type)) %>%
mutate(id_df2 = as.factor(id)) %>%
select(c(id_df2, type, count1, count2), -id)