R 找出两对之间最常见的组合

R 找出两对之间最常见的组合,r,list,similarity,recommendation-engine,R,List,Similarity,Recommendation Engine,我有一份参加这些活动的活动和客人的名单。像这样,但文件要大得多: event guests birthday John Doe birthday Jane Doe birthday Mark White wedding John Doe wedding Jane Doe wedding Matthew Green bar mitzvah Janet Black bar mitzvah John Doe bar mitzvah Jane Doe

我有一份参加这些活动的活动和客人的名单。像这样,但文件要大得多:

event       guests
birthday    John Doe
birthday    Jane Doe
birthday    Mark White
wedding     John Doe
wedding     Jane Doe
wedding     Matthew Green
bar mitzvah Janet Black
bar mitzvah John Doe
bar mitzvah Jane Doe
bar mitzvah William Hill
retirement  Janet Black
retirement  Matthew Green
我想找出一起参加活动最多的两位客人最常见的组合。因此,在本例中,答案应该是
John Doe
Jane Doe
一起参加大多数活动,因为他们都参加了三次相同的活动。输出应该是这些对的列表

我从哪里开始呢?

从你的“一起参加大多数活动”这句话中,我假设你所说的相似性是指相交

您可以使用以下代码查找事件~名称之间的交集:

# All names that we have
nameAll <- unique(df$guests)
# Length of names vector
N <- length(nameAll)

# Function to find intersect between names
getSimilarity <- function(nameA, nameB, type = "intersect") {
    # Subset events for name A
    eventA <- subset(df, guests == nameA)$event
    # Subset events for name B
    eventB <- subset(df, guests == nameB)$event
    # Fint intersect length between events
    if (type == "intersect") {
        res <- length(intersect(eventA, eventB))
    }
    # Find Jaccard index between events
    if (type == "JC") {
        res <- length(intersect(eventA, eventB)) / length(union(eventA, eventB))
    }
    # Return result
    return(data.frame(type, value = res, nameA, nameB))
}

# Iterate over all possible combinations
# Using double loop for simpler representation    
result <- list()
for(i in 1:(N-1)) {
    for(j in (i+1):N) {
        result[[length(result) + 1]] <- getSimilarity(nameAll[i], nameAll[j])
    }
}
# Transform result to data.frame and order by similarity 
result <- do.call(rbind, result)
# Showing top 5 pairs
head(result[with(result, order(-value)), ])
也给出了相同的结果:


数据(
df
):


从社交网络/矩阵代数的角度来看,这是一种稍微不同的方法:

您的数据通过共享成员身份描述个人之间的链接。这是一个联系矩阵,我们可以计算个人$i$和$j$之间的联系矩阵,如下所示:

# Load as a data frame
df <- data.frame(event = c(rep("birthday", 3), 
                           rep("wedding", 3), 
                           rep("bar mitzvah", 4), 
                           rep("retirement", 2)), 
                  guests = c("John Doe", "Jane Doe", "Mark White", 
                             "John Doe", "Jane Doe", "Matthew Green",   
                              "Janet Black", "John Doe", "Jane Doe",
                              "William Hill", "Janet Black", "Matthew Green"))

# You can represent who attended which event as a matrix
M <- table(df$guests, df$event)
# Now we can compute how many times each individual appeared at an
# event with another with a simple matrix product
admat <- M %*% t(M)
admat


  ##################Jane Doe Janet Black John Doe Mark White Matthew Green William Hill
  #Jane Doe             3           1        3          1             1            1
  #Janet Black          1           2        1          0             1            1
  #John Doe             3           1        3          1             1            1
  #Mark White           1           0        1          1             0            0
  #Matthew Green        1           1        1          0             2            0
  #William Hill         1           1        1          0             0            1
显然,您可以通过重命名感兴趣的变量等来整理输出


这种一般的方法——我的意思是认识到你的数据描述了一个社交网络——可能会引起你的兴趣,以便进行进一步的分析(例如,如果人们与许多相同的人一起参加聚会,即使彼此不在一起,也可能有意义的联系)。如果你的数据集真的很大,你可以通过使用稀疏矩阵,或者通过加载igraph包并使用其中用于声明社交网络的函数,使矩阵代数更快一些。

我认为这里的答案很好。我只是想和大家分享一些想法。如果您正在处理一个大型数据集,其中包含许多来宾或许多事件。有许多条件是可能的。例如,两名以上的客人都参加了最多的相同活动,或者两组客人参加了两个不同的活动,但总数是相同的。如果是这样的话,找到前两位客人可能还不够

在这里,我想演示如何使用层次聚类来查找相似的来宾或组

我们可以首先用1和0构造一个矩阵,而1表示出勤,0表示不出勤

library(tidyverse)
library(vegan)

dat_m <- dat %>%
  mutate(value = 1) %>%
  spread(event, value, fill = 0) %>%
  column_to_rownames(var = "guests") %>%
  as.matrix()

dat_m
#               bar mitzvah birthday retirement wedding
# Jane Doe                1        1          0       1
# Janet Black             1        0          1       0
# John Doe                1        1          0       1
# Mark White              0        1          0       0
# Matthew Green           0        0          1       1
# William Hill            1        0          0       0
然后我们可以进行层次聚类并查看结果

hc <- hclust(dat_dist)
plot(hc)

同样,我认为其他人的答案更直接,并为您提供此示例数据集的输出,但如果您正在处理更大的数据集。分层聚类可能是一种选择。

非常好的方法!
# Load as a data frame
df <- data.frame(event = c(rep("birthday", 3), 
                           rep("wedding", 3), 
                           rep("bar mitzvah", 4), 
                           rep("retirement", 2)), 
                  guests = c("John Doe", "Jane Doe", "Mark White", 
                             "John Doe", "Jane Doe", "Matthew Green",   
                              "Janet Black", "John Doe", "Jane Doe",
                              "William Hill", "Janet Black", "Matthew Green"))

# You can represent who attended which event as a matrix
M <- table(df$guests, df$event)
# Now we can compute how many times each individual appeared at an
# event with another with a simple matrix product
admat <- M %*% t(M)
admat


  ##################Jane Doe Janet Black John Doe Mark White Matthew Green William Hill
  #Jane Doe             3           1        3          1             1            1
  #Janet Black          1           2        1          0             1            1
  #John Doe             3           1        3          1             1            1
  #Mark White           1           0        1          1             0            0
  #Matthew Green        1           1        1          0             2            0
  #William Hill         1           1        1          0             0            1
diag(admat) <- 0
admat[upper.tri(admat)] <- 0
library(reshape2)
dfmatches <- unique(melt(admat))
# Drop all the zero matches
dfmatches <- dfmatches[dfmatches$value !=0,]
# order it descending
dfmatches <- dfmatches[order(-dfmatches$value),]
dfmatches

#            Var1        Var2 value
#3       John Doe    Jane Doe     3
#2    Janet Black    Jane Doe     1
#4     Mark White    Jane Doe     1
#5  Matthew Green    Jane Doe     1
#6   William Hill    Jane Doe     1
#9       John Doe Janet Black     1
#11 Matthew Green Janet Black     1
#12  William Hill Janet Black     1
#16    Mark White    John Doe     1
#17 Matthew Green    John Doe     1
#18  William Hill    John Doe     1
library(tidyverse)
library(vegan)

dat_m <- dat %>%
  mutate(value = 1) %>%
  spread(event, value, fill = 0) %>%
  column_to_rownames(var = "guests") %>%
  as.matrix()

dat_m
#               bar mitzvah birthday retirement wedding
# Jane Doe                1        1          0       1
# Janet Black             1        0          1       0
# John Doe                1        1          0       1
# Mark White              0        1          0       0
# Matthew Green           0        0          1       1
# William Hill            1        0          0       0
dat_dist <- vegdist(dat_m, binary = TRUE)

dat_dist
#                Jane Doe Janet Black  John Doe Mark White Matthew Green
# Janet Black   0.6000000                                               
# John Doe      0.0000000   0.6000000                                   
# Mark White    0.5000000   1.0000000 0.5000000                         
# Matthew Green 0.6000000   0.5000000 0.6000000  1.0000000              
# William Hill  0.5000000   0.3333333 0.5000000  1.0000000     1.0000000
hc <- hclust(dat_dist)
plot(hc)
rowSums(dat_m)
# Jane Doe   Janet Black      John Doe    Mark White Matthew Green  William Hill 
#        3             2             3             1             2             1