使用dplyr总结逻辑数据帧_R_Dplyr

使用dplyr总结逻辑数据帧

使用dplyr总结逻辑数据帧,r,dplyr,R,Dplyr,我试图用两个变量总结一个数据帧——我基本上想用变量2分解变量1，以便在一个100%堆叠条形图中绘制结果我有多个逻辑类型的列，可以在两个主要类别之间拆分，这两个类别将用于创建细分我尝试使用collectfromdplyr将数据帧转换为longform，但是输出不是我所期望的 topics_by_variable <- function (dataset, variable_1, variable_2) { #select variables columns variable_1

我试图用两个变量总结一个数据帧——我基本上想用变量2分解变量1，以便在一个100%堆叠条形图中绘制结果

我有多个逻辑类型的列，可以在两个主要类别之间拆分，这两个类别将用于创建细分

我尝试使用

collect

from

dplyr

将数据帧转换为longform，但是输出不是我所期望的

topics_by_variable <- function (dataset, variable_1, variable_2) {

  #select variables columns
  variable_1_columns <- dataset[, data.table::`%like%`(names(dataset), variable_1)]
  variable_2_columns <- dataset[, data.table::`%like%`(names(dataset), variable_2)]
  #create new dataframe including only relevant columns
  df <- cbind(variable_1_columns, variable_2_columns)
  #transform df to long form
  new_df <- tidyr::gather(df, variable_2, count, names(variable_2_columns[1]):names(variable_2_columns)[length(names(variable_2_columns))], factor_key=FALSE)

  #count topics
  topic_count <- function (x) {
                  t <- sum(x == TRUE)
  }
  #group by variable 2 and count
  new_df <- new_df %>%
            dplyr::group_by(variable_2) %>%
            dplyr::summarise_at(topic_names, .funs = topic_count)

  #transform new_df to longform
  final_df <- tidyr::gather(new_df, topic, volume, names(variable_1_columns[1]):names(variable_1_columns)[length(names(variable_1_columns))], factor_key=FALSE)
  final_df <- data.frame(final_df)

但是，当我使用“聚集”时，所需的输出将如下所示，体积图是行的总数，并且在所有品牌中重复

variable_2       topic                volume
   <chr>            <chr>              <int>
 1 brands_co     topic_su               10
 2 brands_ne     topic_su               17
 3 brands_seg    topic_su               10 
 4 brands_sen    topic_su               18
 5 brands_st     topic_su                0
 6 brands_ta     topic_su                1
 7 brands_tc     topic_su                0
 8 brands_co     topic_so               22
 9 brands_ne     topic_so               17
10 brands_seg    topic_so               11 
11 brands_sen    topic_so               23
12 brands_st     topic_so                0
13 brands_ta     topic_so                0
14 brands_tc     topic_so                0

variable_2主题卷
1品牌合作主题10
2个品牌\u ne topic \u su 17
3个品牌\u seg主题\u su 10
4个品牌\u森主题\u苏18
5个品牌_sttopic_su 0
6个品牌\u ta主题\u su 1
7个品牌\u tc主题\u su 0
8个品牌共同主题22
9个品牌\u n一个主题\u so 17
10个品牌_seg主题_so11
11个品牌(主题)23
12个品牌_sttopic_so 0
13个品牌_tatopic_so 0
14个品牌\u tc主题\u so 0

假设您的数据集是

dt

，您可以执行以下操作：

library(dplyr)

expand.grid(brand = names(dt)[grepl("brands", names(dt))],         
            topic = names(dt)[grepl("topic", names(dt))],
            stringsAsFactors = F) %>%
  rowwise() %>%
  mutate(volume = sum(dt[brand] == "TRUE" & dt[topic] == "TRUE")) %>%
  ungroup()

# # A tibble: 42 x 3
#   brand      topic    volume
#   <chr>      <chr>     <int>
# 1 brands_ne  topic_su     17
# 2 brands_st  topic_su      0
# 3 brands_co  topic_su     10
# 4 brands_seg topic_su     10
# 5 brands_sen topic_su     18
# 6 brands_ta  topic_su      1
# 7 brands_tc  topic_su      0
# 8 brands_ne  topic_so     17
# 9 brands_st  topic_so      0
#10 brands_co  topic_so     22
# # ... with 32 more rows

另一种tidyverse解决方案：

库（tidyverse）
##资料
头部（df）
#>主题中的主题su主题so主题cl主题品牌主题
#>1真假假假假真
#>2真假假假假真
#>3真假假假假真
#>4正确-正确-错误-错误-错误-正确
#>5真假假假假真
#>6真假假假假真
#>品牌\u st品牌\u合作品牌\u seg品牌\u sen品牌\u ta品牌\u tc
#>1假假假真假假假
#>2假假假真假假
#>3假假假真假假
#>4假假假真假假
#>5假假假真假假
#>6假假假真假假
全部变异（df，as.logical）%>%
聚集（key=“topic”，value=“topic\u value”，以（“topic”）开始）%>%
聚集（key=“variable\u 2”，value=“variable\u 2\u value”，-以（“主题”）开始）%>%
分组依据（主题，变量2）%>%
汇总（体积=总和（主题值和变量值））
#>#A tibble:42 x 3
#>#小组：专题[6]
#>主题变量_2卷
#>               
#>1主题品牌公司22
#>2主题品牌16
#>3主题_cl品牌_seg15
#>4主题品牌15
#>5主题品牌0
#>6主题品牌1
#>7主题_cl品牌_tc0
#>8品牌公司的主题23
#>9品牌主题16
#>品牌中的10个主题15
#>#…还有32行

由（v0.3.0）于2019-06-24创建。

如何获取值？你能把

brands\u co

和

topic\u su

之间的26分为几类吗？当你对品牌和主题都有TRUE时——例如，第一行中的10是“topic\u su”和“brands\u co”在数据中都为TRUE的次数。你说的是逻辑DF，但你显示的数据是字符数据。另一件事，您的

主题\u count

，尽管只有一行，但包含两条无效的指令，应该删除。首先，

==TRUE

对正确键入的数据是不可操作的，可以删除。其次，对

的赋值在函数之外没有影响。因此，函数应该简单地写成

topic\u count
library(dplyr)

expand.grid(brand = names(dt)[grepl("brands", names(dt))],         
            topic = names(dt)[grepl("topic", names(dt))],
            stringsAsFactors = F) %>%
  rowwise() %>%
  mutate(volume = sum(dt[brand] == "TRUE" & dt[topic] == "TRUE")) %>%
  ungroup()

# # A tibble: 42 x 3
#   brand      topic    volume
#   <chr>      <chr>     <int>
# 1 brands_ne  topic_su     17
# 2 brands_st  topic_su      0
# 3 brands_co  topic_su     10
# 4 brands_seg topic_su     10
# 5 brands_sen topic_su     18
# 6 brands_ta  topic_su      1
# 7 brands_tc  topic_su      0
# 8 brands_ne  topic_so     17
# 9 brands_st  topic_so      0
#10 brands_co  topic_so     22
# # ... with 32 more rows

# vectorised function
GetVolume = function(x,y) sum(dt[x] == "TRUE" & dt[y] == "TRUE")
GetVolume = Vectorize(GetVolume)

expand.grid(brand = names(dt)[grepl("brands", names(dt))],         
            topic = names(dt)[grepl("topic", names(dt))],
            stringsAsFactors = F) %>%
  mutate(volume = GetVolume(brand, topic))