R 从两列映射对多对多关系进行分组_R_Data.table

R 从两列映射对多对多关系进行分组

R 从两列映射对多对多关系进行分组,r,data.table,R,Data.table,我有一个SQL表，可以映射作者和书籍。我想把有联系的作者和书籍（由同一作者写的书，以及合著一本书的作者）组合在一起，并确定这些群体有多大。例如，如果J.K.罗琳与朱诺·迪亚兹合著，朱诺·迪亚兹与扎迪·史密斯合著一本书，那么我希望三位作者都在同一组下面是一个玩具数据集（h/t Matthew Dowle），其中包含了我正在谈论的一些关系： set.seed(1) authors <- replicate(100,sample(1:3,1)) book_id <- rep(1:100,

我有一个SQL表，可以映射作者和书籍。我想把有联系的作者和书籍（由同一作者写的书，以及合著一本书的作者）组合在一起，并确定这些群体有多大。例如，如果J.K.罗琳与朱诺·迪亚兹合著，朱诺·迪亚兹与扎迪·史密斯合著一本书，那么我希望三位作者都在同一组

下面是一个玩具数据集（h/t Matthew Dowle），其中包含了我正在谈论的一些关系：

set.seed(1)
authors <- replicate(100,sample(1:3,1))
book_id <- rep(1:100,times=authors)
author_id <- c(lapply(authors,sample,x=1:100,replace=FALSE),recursive=TRUE)
aubk <- data.table(author_id = author_id,book_id = book_id)
aubk[order(book_id,author_id),]

编辑：正如@Josh O'Brien和@thelatemail所提到的，我的问题也可以表述为从两列列表中查找图形的连接组件，其中每个边都是一行，两列是连接的节点。

aubk[,list(author_list = list(sort(author_id))), by = book_id]

将给出作者组的列表

以下内容将为每组作者创建一个唯一标识符，然后返回一个带有

书的数量
图书ID的列表
图书ID的唯一标识符
作者人数

对于每个独特的作者组

aubk[, list(author_list = list(sort(author_id)), 
            group_id = paste0(sort(author_id), collapse=','), 
            n_authors = .N),by =  book_id][,
        list(n_books = .N, 
             n_authors = unique(n_authors), 
             book_list = list(book_id), 
             book_ids = paste0(book_id, collapse = ', ')) ,by = group_id]

如果作者顺序很重要，只需删除带有

作者列表

和

组id

定义的

排序

编辑注意，上述方法虽然有用，但不能进行适当的分组

也许以下几点会

# the unique groups of authors by book
unique_authors <- aubk[, list(sort(author_id)), by = book_id]
# some helper functions
# a filter function that allows arguments to be passed
.Filter <- function (f, x,...) 
{
  ind <- as.logical(sapply(x, f,...))
  x[!is.na(ind) & ind]
}

# any(x in y)?
`%%in%%` <- function(x,table){any(unlist(x) %in% table)}
# function to filter a list and return the unique elements from 
# flattened values
FilterList <- function(.list, table) {
  unique(unlist(.Filter(`%%in%%`, .list, table =table)))
}

# all the authors
all_authors <- unique(unlist(unique_authors))
# with names!
setattr(all_authors, 'names', all_authors)
# get for each author, the authors with whom they have
# collaborated in at least 1 book
lapply(all_authors, FilterList, .list = unique_authors)

#按书籍分类的独特作者群
独特的作者这里是我对一个老问题的回答，Josh O'Brien在评论（）中链接了这个问题。此答案使用igraph
库
# Dummy data that might be easier to interpret to show it worked
# Authors 1,2 and 3,4 should group. author 5 is a group to themselves
aubk <- data.frame(author_id=c(1,2,3,4,5),book_id=c(1,1,2,2,5))

# identify authors with a bit of leading text to prevent clashes 
# with the book ids
aubk$author_id2 <- paste0("au",aubk$author_id)

library(igraph)
#create a graph - this needs to be matrix input
au_graph <- graph.edgelist(as.matrix(aubk[c("author_id2","book_id")]))
# get the ids of the authors
result <- data.frame(author_id=names(au_graph[1]),stringsAsFactors=FALSE)
# get the corresponding group membership of the authors
result$group <- clusters(au_graph)$membership

# subset to only the authors data
result <- result[substr(result$author_id,1,2)=="au",]
# make the author_id variable numeric again
result$author_id <- as.numeric(substr(result$author_id,3,nchar(result$author_id)))

> result
  author_id group
1         1     1
3         2     1
4         3     2
6         4     2
7         5     3

#可能更容易解释的虚拟数据，以显示其有效性
#作者1,2和3,4应分组。作者5是他们自己的一个群体
aubk将500K个节点转换成邻接矩阵对我的计算机内存来说太多了，所以我无法使用igraph
。对于R版本2.15.1，RBGL
软件包没有更新，因此也没有更新
在写了很多似乎不起作用的愚蠢代码之后，我认为下面的内容让我找到了正确的答案
aubk[,grp := author_id]
num.grp.old <- aubk[,length(unique(grp))]
iterations <- 0
repeat {
    aubk[,grp := min(grp),by=author_id]
    aubk[,grp := min(grp), by=book_id]
    num.grp.new <- aubk[,length(unique(grp))] 
    if(num.grp.new == num.grp.old) {break}
    num.grp.old <- num.grp.new
    iterations <- iterations + 1
}

aubk[，grp:=作者id]
num.grp.old如果这个问题是关于SQL的，您可以包括数据库引擎吗？如果没有，请删除SQL标记。我想我最终需要在SQL中实现它，但我可以单独问这个问题。正在删除标记。SQL中的解决方案将比您想象的更难，并且高度依赖于数据库引擎。示例数据似乎有6本具有6位唯一作者的独特书籍，一个6行数据集。您能否提供一些包含您描述的链接的数据，以及一个非平凡的示例，说明结果应该是什么？e、 g.也许可以从50位作者的列表中随机挑选1到3位作者，每100本书，首先是set.seed（1）
。一个data.table
或哈希表解决方案似乎是一个不错的选择（不是SQL）。我会使用RBGL:：connectedComp（）来解决这个问题，如（在其他应用程序中）所示，这是一个很好的开始！但是IIUC如果A在一本书上与B合著，而B在另一本书上与C合著，这并不返回A、B、C组（第1段中的要求），是吗？
aubk[,grp := author_id]
num.grp.old <- aubk[,length(unique(grp))]
iterations <- 0
repeat {
    aubk[,grp := min(grp),by=author_id]
    aubk[,grp := min(grp), by=book_id]
    num.grp.new <- aubk[,length(unique(grp))] 
    if(num.grp.new == num.grp.old) {break}
    num.grp.old <- num.grp.new
    iterations <- iterations + 1
}