R 如何汇总分为许多列的数据?
我有一个数据集,其中包含一个“选择”和“应用”问题的答案,每个可能的答案在单独的列中。那么,假设我们的问题是什么颜色的衬衫是你可以接受的?它看起来像这样:R 如何汇总分为许多列的数据?,r,tidyr,R,Tidyr,我有一个数据集,其中包含一个“选择”和“应用”问题的答案,每个可能的答案在单独的列中。那么,假设我们的问题是什么颜色的衬衫是你可以接受的?它看起来像这样: id Q3_Red Q3_Blue Q3_Green Q3_Purple 9 8 Green Purple 7 Green 6 Red 5
id Q3_Red Q3_Blue Q3_Green Q3_Purple
9
8 Green Purple
7 Green
6 Red
5 Purple
4 Blue
3 Blue Purple
2 Red Blue Green
1 Red Purple
10 Red Purple
您可以通过以下方式将其制作成实际数据帧:
tmp <- data.frame("id" = c(009,008,007,006,005,004,003,002,001,010), "Q3_Red" = c("","","","Red","","","","Red","Red","Red"), "Q3_Blue" = c("","","","","","Blue","Blue","Blue","",""),
"Q3_Green" = c("","Green","Green","","","","","Green","",""),
"Q3_Purple" = c("","Purple","","","Purple","","Purple","","Purple","Purple")
)
我可以用tmp%>%countQ3_Red这样的工具对每一个进行计数,并将它们组织到各自的数据框架中,但似乎必须有一种方法可以使用重塑功能一次性完成这项工作。我已经看过了《收集与传播》,但我不知道如何将tidyr与count结合起来。dplyr和tidyr是你的朋友:
library(dplyr)
library(tidyr)
tmp %>%
pivot_longer(cols = -id, values_to = "response") %>% # pivot all columns but id
filter(response != "") %>% # remove blanks
group_by(response) %>% # group by response
summarize(count = n()) # summarize and count
# A tibble: 4 x 2
value count
<chr> <int>
1 Blue 3
2 Green 3
3 Purple 5
4 Red 4
在base R中,我们可以使用
summary(tmp[-1])
# Q3_Red Q3_Blue Q3_Green Q3_Purple
# :6 :7 :7 :5
# Red:4 Blue:3 Green:3 Purple:5
你可以试试这种方法 计算每个颜色列的频率
tmp2 <- colSums(tmp[, 2:5] != "", na.rm =TRUE)
将其转换为数据帧,然后从rowname转换为column,最后使用regex删除不必要的字母以获得预期的结果
tmp2 <- data.frame(tmp2) %>%
tibble::rownames_to_column(var = "Colors") %>%
mutate(Colors = str_replace_all(Colors, regex("(^.*_)"), "")) %>%
rename(freq = tmp2)
# Colors freq
# 1 Red 4
# 2 Blue 3
# 3 Green 3
# 4 Purple 5
您可以在dplyr中使用na_if转换为na,然后在tidyr中使用pivot_longer来堆叠从Q3开始的所有列
注意:使用na_if是为了使pivot_中的值drop_na=T工作更长时间
pivot_longer是tidyr的新聚集地package@BenToh谢谢你提醒我pivot_不再是tidyr的了。我更新了我的答案以包含对这两个包的引用。作为补充,group_byresponse%>%SummaryCount=n可以简化为countresponse,name=count,它不需要使用group_by。@DarrenTsai谢谢。我实际上是想弄清楚是否有理由使用group_by和SUMMARECOUNT=n而不是仅仅使用colSumstmp[,-1]!=但显然,更正式的整洁方式是someone@BenToh谢谢我肯定想用这个项目来掌握这个小人物。
tmp2 <- data.frame(tmp2) %>%
tibble::rownames_to_column(var = "Colors") %>%
mutate(Colors = str_replace_all(Colors, regex("(^.*_)"), "")) %>%
rename(freq = tmp2)
# Colors freq
# 1 Red 4
# 2 Blue 3
# 3 Green 3
# 4 Purple 5
library(dplyr)
library(tidyr)
tmp %>%
mutate(across(-id, na_if, "")) %>%
pivot_longer(-id, values_drop_na = T) %>%
count(value)
# # A tibble: 4 x 2
# value n
# <chr> <int>
# 1 Blue 3
# 2 Green 3
# 3 Purple 5
# 4 Red 4
tibble::enframe(colSums(tmp[-1] != ""))
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 Q3_Red 4
# 2 Q3_Blue 3
# 3 Q3_Green 3
# 4 Q3_Purple 5