在tidyr中按组扩展所有独特的作者组合_R_Dplyr_Combinations_Tidyr

在tidyr中按组扩展所有独特的作者组合

在tidyr中按组扩展所有独特的作者组合,r,dplyr,combinations,tidyr,R,Dplyr,Combinations,Tidyr,我有一个带有留言板信息的数据框。数据如下所示： require(dplyr) require(tidyr) df <- data.frame(author = c(2,4,8,16,32,64,128,256,512,1024), topic = c(101,101,101,101,301,301,501,501,501,501), time = c("2014-08-16 20:20:11", "2014-08-

我有一个带有留言板信息的数据框。数据如下所示：

    require(dplyr)
    require(tidyr)
    df <- data.frame(author = c(2,4,8,16,32,64,128,256,512,1024),
             topic = c(101,101,101,101,301,301,501,501,501,501),
             time = c("2014-08-16 20:20:11", "2014-08-16 21:10:00", "2014-08-17 06:30:10",
                        "2014-08-17 10:08:32", "2014-08-20 22:23:01","2014-08-20 23:03:03",
                        "2014-08-25 17:05:01", "2014-08-25 19:15:10",  "2014-08-25 20:07:11",
                        "2014-08-25 23:59:59"))

require（dplyr）
需要（三年）
df%展开（嵌套（作者），作者）
打印（测试，n=20）
#一个tibble:36x3
#分组：主题[3]
主题作者1
1  101.     2.2.
2  101.     2.4.
3  101.     2.8.
4  101.     2.16
5  101.     4.2.
6  101.     4.4.
7  101.     4.8.
8  101.     4.16
9  101.     8.2.
10  101.     8.4.
11  101.     8.8.
12  101.     8.16
13  101.    162.
14  101.    164.
15  101.    168.
16  101.    1616
17  301.    3232
18  301.    3264
19  301.    6432
20  301.    6464

我需要两件事的帮助：

如何删除交换的组合（例如第2行和第5行）

对于每个组合，我希望具有以下属性：

```
start
```
=主题的最早帖子（使用mutate，min=min（时间））
主题的持续时间（主题最后一篇文章的时间减去主题第一篇文章的时间，使用mutate duration=max（time）-min（time））
```
帖子的计数
```
（使用摘要）

我通过以下方式部分解决了我的问题：

test <- df %>% group_by(topic) %>%
            mutate(posts=n(), start=min(time), duration=(max(time)-min(time))/3600) %>%
            expand(nesting(author), author, posts, start, duration) %>% filter(author != author1)
test
# A tibble: 36 x 6
# Groups:   topic [3]
   topic author author1 posts start               duration
   <dbl>  <dbl>   <dbl> <int> <dttm>                 <dbl>
 2  101.     2.      4.     4 2014-08-16 20:20:11     13.8
 3  101.     2.      8.     4 2014-08-16 20:20:11     13.8
 4  101.     2.     16.     4 2014-08-16 20:20:11     13.8
 5  101.     4.      2.     4 2014-08-16 20:20:11     13.8
 7  101.     4.      8.     4 2014-08-16 20:20:11     13.8
 8  101.     4.     16.     4 2014-08-16 20:20:11     13.8
 9  101.     8.      2.     4 2014-08-16 20:20:11     13.8
10  101.     8.      4.     4 2014-08-16 20:20:11     13.8
# ... with 26 more rows

测试%groupby（主题）%>%
变异（posts=n（），start=min（时间），duration=（max（时间）-min（时间））/3600）%
展开（嵌套（作者）、作者、帖子、开始、持续时间）%>%filter（作者！=author1）
测试
#A tibble:36 x 6
#分组：主题[3]
主题作者author1发布开始持续时间
2  101.     2.4.4 2014-08-16 20:20:11     13.8
3  101.     2.8.4 2014-08-16 20:20:11     13.8
4  101.     2.164 2014-08-16 20:20:11     13.8
5  101.     4.2.4 2014-08-16 20:20:11     13.8
7  101.     4.8.4 2014-08-16 20:20:11     13.8
8  101.     4.164 2014-08-16 20:20:11     13.8
9  101.     8.2.4 2014-08-16 20:20:11     13.8
10  101.     8.4.4 2014-08-16 20:20:11     13.8
# ... 还有26行

仍然需要找出交换的组合

我通过以下方式部分解决了我的问题：

test <- df %>% group_by(topic) %>%
            mutate(posts=n(), start=min(time), duration=(max(time)-min(time))/3600) %>%
            expand(nesting(author), author, posts, start, duration) %>% filter(author != author1)
test
# A tibble: 36 x 6
# Groups:   topic [3]
   topic author author1 posts start               duration
   <dbl>  <dbl>   <dbl> <int> <dttm>                 <dbl>
 2  101.     2.      4.     4 2014-08-16 20:20:11     13.8
 3  101.     2.      8.     4 2014-08-16 20:20:11     13.8
 4  101.     2.     16.     4 2014-08-16 20:20:11     13.8
 5  101.     4.      2.     4 2014-08-16 20:20:11     13.8
 7  101.     4.      8.     4 2014-08-16 20:20:11     13.8
 8  101.     4.     16.     4 2014-08-16 20:20:11     13.8
 9  101.     8.      2.     4 2014-08-16 20:20:11     13.8
10  101.     8.      4.     4 2014-08-16 20:20:11     13.8
# ... with 26 more rows

测试%groupby（主题）%>%
变异（posts=n（），start=min（时间），duration=（max（时间）-min（时间））/3600）%
展开（嵌套（作者）、作者、帖子、开始、持续时间）%>%filter（作者！=author1）
测试
#A tibble:36 x 6
#分组：主题[3]
主题作者author1发布开始持续时间
2  101.     2.4.4 2014-08-16 20:20:11     13.8
3  101.     2.8.4 2014-08-16 20:20:11     13.8
4  101.     2.164 2014-08-16 20:20:11     13.8
5  101.     4.2.4 2014-08-16 20:20:11     13.8
7  101.     4.8.4 2014-08-16 20:20:11     13.8
8  101.     4.164 2014-08-16 20:20:11     13.8
9  101.     8.2.4 2014-08-16 20:20:11     13.8
10  101.     8.4.4 2014-08-16 20:20:11     13.8
# ... 还有26行

仍然需要找出交换的组合

您不一定要使用

tidyr:：expand（）

（它似乎是一个左连接）来尝试生成组合，您似乎得到了所有的置换：尤其是不需要的自-自组合，以及与author1、author2交换的组合（即置换）。类似地，内置的

base:：expand.grid（）

不进行排列而不是组合

使用内置的

combn（）

（它位于

utils:：combn（）

中）

关于

dplyr

groupby

combn

，您可以通过简单的搜索找到许多现有问题

一直在尝试发布工作代码，但我不知道

tidyr

那么好，我尝试的一切都不工作或语法错误

expand

需要一个数据帧，然后它引用变量。因此，

%%>%expand（author，author）

再次提供所有排列，而不仅仅是组合<代码>%>%完成（…）似乎没有用。我认为您需要tidyr语法在该分组级别调用

author

上的

combn

。对于每个分组级别，这可能需要一个嵌套的子调用，不管tidyr的do.call等价于什么。

您不一定要使用

tidyr:：expand（）

（它似乎是一个左连接）来尝试生成组合，您似乎得到了所有的排列，不需要的自-自组合，以及与author1、author2交换的组合（即置换）。类似地，内置的
base:：expand.grid（）
不进行排列而不是组合
使用内置的
combn（）
（它位于
utils:：combn（）
中）
关于
dplyr
groupby
combn
，您可以通过简单的搜索找到许多现有问题
一直在尝试发布工作代码，但我不知道
tidyr
那么好，我尝试的一切都不工作或语法错误
expand
需要一个数据帧，然后它引用变量。因此，
%%>%expand（author，author）
再次提供所有排列，而不仅仅是组合<代码>%>%完成（…）似乎没有用。我认为您需要tidyr语法在该分组级别调用
author
上的
combn
。对于每个分组级别，可能都需要一个嵌套的子调用，不管tidyr的do.call等效于什么。
最终解决方案：

time <- df %>% group_by(topic) %>% mutate(posts = n(), start = min(time), duration = (max(time) - min(time))/3600) %>% distinct(topic,start,duration) combo <- df %>% group_by(topic) %>% do(data.frame(t(combn(.$author,2)))) edges <- right_join(combo, time) edges # A tibble: 13 x 5 # Groups: topic [?] topic X1 X2 start duration <dbl> <dbl> <dbl> <dttm> <time> 1 101. 2. 4. 2014-08-16 20:20:11 13.8058333333333 2 101. 2. 8. 2014-08-16 20:20:11 13.8058333333333 3 101. 2. 16. 2014-08-16 20:20:11 13.8058333333333 4 101. 4. 8. 2014-08-16 20:20:11 13.8058333333333 5 101. 4. 16. 2014-08-16 20:20:11 13.8058333333333 6 101. 8. 16. 2014-08-16 20:20:11 13.8058333333333 7 301. 32. 64. 2014-08-20 22:23:01 0.667222222222222 8 501. 128. 256. 2014-08-25 17:05:01 6.91611111111111 9 501. 128. 512. 2014-08-25 17:05:01 6.91611111111111 10 501. 128. 1024. 2014-08-25 17:05:01 6.91611111111111 11 501. 256. 512. 2014-08-25 17:05:01 6.91611111111111 12 501. 256. 1024. 2014-08-25 17:05:01 6.91611111111111 13 501. 512. 1024. 2014-08-25 17:05:01 6.91611111111111

time%groupby（topic）%%>%mutate（posts=n（），start=min（time），duration=（max（time）-min（time））/3600）%%>%distinct（topic，start，duration）组合%group\U by（主题）%>%do（data.frame（t（combn（.$author，2）））边缘最终解决方案： time <- df %>% group_by(topic) %>% mutate(posts = n(), start = min(time), duration = (max(time) - min(time))/3600) %>% distinct(topic,start,duration) combo <- df %>% group_by(topic) %>% do(data.frame(t(combn(.$author,2)))) edges <- right_join(combo, time) edges # A tibble: 13 x 5 # Groups: topic [?] topic X1 X2 start duration <dbl> <dbl> <dbl> <dttm> <time> 1 101. 2. 4. 2014-08-16 20:20:11 13.8058333333333 2 101. 2. 8. 2014-08-16 20:20:11 13.8058333333333 3 101. 2. 16. 2014-08-16 20:20:11 13.8058333333333 4 101. 4. 8. 2014-08-16 20:20:11 13.8058333333333 5 101. 4. 16. 2014-08-16 20:20:11 13.8058333333333 6 101. 8. 16. 2014-08-16 20:20:11 13.8058333333333 7 301. 32. 64. 2014-08-20 22:23:01 0.667222222222222 8 501. 128. 256. 2014-08-25 17:05:01 6.91611111111111 9 501. 128. 512. 2014-08-25 17:05:01 6.91611111111111 10 501. 128. 1024. 2014-08-25 17:05:01 6.91611111111111 11 501. 256. 512. 2014-08-25 17:05:01 6.91611111111111 12 501. 256. 1024. 2014-08-25 17:05:01 6.91611111111111 13 501. 512. 1024. 2014-08-25 17:05:01 6.91611111111111 time%group\u by（topic）%%>%mutate（posts=n（），start=min（time），duration=（max( node <- df %>% distinct(author_id, vendor) %>% rename(id = author_id) library(iterpc) edge <- df %>% group_by(topic_id) %>% do(data.frame(getall(iterpc(table(.$author_id), 2, replace =TRUE)))) %>% filter(X1 != X2) %>% rename(from = X1, to = X2) %>% select(to, from, topic_id) library(igraph) test_net <- graph_from_data_frame(d = edge, directed = F, vertices = node) plot(test_net)