Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/68.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 如何汇总分为许多列的数据?_R_Tidyr - Fatal编程技术网

R 如何汇总分为许多列的数据?

R 如何汇总分为许多列的数据?,r,tidyr,R,Tidyr,我有一个数据集,其中包含一个“选择”和“应用”问题的答案,每个可能的答案在单独的列中。那么,假设我们的问题是什么颜色的衬衫是你可以接受的?它看起来像这样: id Q3_Red Q3_Blue Q3_Green Q3_Purple 9 8 Green Purple 7 Green 6 Red 5

我有一个数据集,其中包含一个“选择”和“应用”问题的答案,每个可能的答案在单独的列中。那么,假设我们的问题是什么颜色的衬衫是你可以接受的?它看起来像这样:

id    Q3_Red Q3_Blue Q3_Green    Q3_Purple
9                    
8                    Green       Purple
7                    Green     
6     Red               
5                                Purple
4            Blue          
3            Blue                Purple
2     Red    Blue    Green     
1     Red                        Purple
10    Red                        Purple
您可以通过以下方式将其制作成实际数据帧:

tmp <- data.frame("id" = c(009,008,007,006,005,004,003,002,001,010), "Q3_Red" = c("","","","Red","","","","Red","Red","Red"), "Q3_Blue" = c("","","","","","Blue","Blue","Blue","",""),
  "Q3_Green" = c("","Green","Green","","","","","Green","",""),
  "Q3_Purple" = c("","Purple","","","Purple","","Purple","","Purple","Purple")
)
我可以用tmp%>%countQ3_Red这样的工具对每一个进行计数,并将它们组织到各自的数据框架中,但似乎必须有一种方法可以使用重塑功能一次性完成这项工作。我已经看过了《收集与传播》,但我不知道如何将tidyr与count结合起来。

dplyr和tidyr是你的朋友:

library(dplyr)
library(tidyr)
tmp %>% 
  pivot_longer(cols = -id, values_to = "response") %>%   # pivot all columns but id
  filter(response != "") %>%        # remove blanks
  group_by(response) %>%            # group by response
  summarize(count = n())            # summarize and count
# A tibble: 4 x 2
  value  count
  <chr>  <int>
1 Blue       3
2 Green      3
3 Purple     5
4 Red        4
在base R中,我们可以使用

summary(tmp[-1])
# Q3_Red  Q3_Blue   Q3_Green  Q3_Purple
#     :6       :7        :7         :5  
#  Red:4   Blue:3   Green:3   Purple:5  

你可以试试这种方法

计算每个颜色列的频率

tmp2 <- colSums(tmp[, 2:5] != "", na.rm =TRUE)
将其转换为数据帧,然后从rowname转换为column,最后使用regex删除不必要的字母以获得预期的结果

tmp2 <- data.frame(tmp2) %>% 
  tibble::rownames_to_column(var = "Colors") %>% 
  mutate(Colors = str_replace_all(Colors, regex("(^.*_)"), "")) %>% 
  rename(freq = tmp2)
#   Colors freq
# 1    Red    4
# 2   Blue    3
# 3  Green    3
# 4 Purple    5
您可以在dplyr中使用na_if转换为na,然后在tidyr中使用pivot_longer来堆叠从Q3开始的所有列

注意:使用na_if是为了使pivot_中的值drop_na=T工作更长时间


pivot_longer是tidyr的新聚集地package@BenToh谢谢你提醒我pivot_不再是tidyr的了。我更新了我的答案以包含对这两个包的引用。作为补充,group_byresponse%>%SummaryCount=n可以简化为countresponse,name=count,它不需要使用group_by。@DarrenTsai谢谢。我实际上是想弄清楚是否有理由使用group_by和SUMMARECOUNT=n而不是仅仅使用colSumstmp[,-1]!=但显然,更正式的整洁方式是someone@BenToh谢谢我肯定想用这个项目来掌握这个小人物。
tmp2 <- data.frame(tmp2) %>% 
  tibble::rownames_to_column(var = "Colors") %>% 
  mutate(Colors = str_replace_all(Colors, regex("(^.*_)"), "")) %>% 
  rename(freq = tmp2)
#   Colors freq
# 1    Red    4
# 2   Blue    3
# 3  Green    3
# 4 Purple    5
library(dplyr)
library(tidyr)

tmp %>% 
  mutate(across(-id, na_if, "")) %>% 
  pivot_longer(-id, values_drop_na = T) %>%
  count(value)

# # A tibble: 4 x 2
#   value      n
#   <chr>  <int>
# 1 Blue       3
# 2 Green      3
# 3 Purple     5
# 4 Red        4
tibble::enframe(colSums(tmp[-1] != ""))

# # A tibble: 4 x 2
#   name      value
#   <chr>     <dbl>
# 1 Q3_Red        4
# 2 Q3_Blue       3
# 3 Q3_Green      3
# 4 Q3_Purple     5