R 查找列之间的一对一、一对多和多对一关系

R 查找列之间的一对一、一对多和多对一关系,r,dplyr,R,Dplyr,考虑以下数据框: first_name last_name 1 Al Smith 2 Al Jones 3 Jeff Thompson 4 Scott Thompson 5 Terry Dactil 6 Pete Zah data <- data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete"),

考虑以下数据框:

 first_name last_name
1         Al     Smith
2         Al     Jones
3       Jeff  Thompson
4      Scott  Thompson
5      Terry    Dactil
6       Pete       Zah

data <- data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah"))
一对多

  first_name last_name
1         Al     Smith
2         Al     Jones
多对一

   first_name last_name
1       Jeff  Thompson
2      Scott  Thompson

我希望在dplyr包中执行此操作。

通常,您可以使用
duplicated
函数检查值是否重复(正如@RichardScriven在对您的问题的评论中提到的)。但是,默认情况下,此函数不会将多次出现的元素的第一个实例标记为重复:

duplicated(c(1, 1, 1, 2))
# [1] FALSE  TRUE  TRUE FALSE
由于您还希望选择这些情况,因此通常需要在每个向量上运行两次
duplicated
,一次向前,一次向后:

duplicated(c(1, 1, 1, 2)) | duplicated(c(1, 1, 1, 2), fromLast=TRUE)
# [1]  TRUE  TRUE  TRUE FALSE
我发现这需要大量的键入,因此我将定义一个helper函数来检查元素是否多次出现:

d <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)
请注意,您也可以使用
功能定义
d
,而无需
duplicated
的帮助:

d <- function(x) table(x)[x] > 1
d1

虽然这个替代定义稍微简洁一些,但我也发现它可读性较差。

使用@josliber建议的方法,我构造了一个函数,它接受两个向量并返回它们的关系类型:

relationship_type <- function(x1, x2, na.rm = FALSE) {

  df <- tibble(x1 = x1, x2 = x2)

  if (na.rm) {
    df <- df %>%
      drop_na()
  }

  res <- c()

  counts <- df %>%
    count(x1, x2) %>%
    ungroup() %>%
    select(-n) %>%
    count(x1, x2)

  if (any(is.na(counts$x2))) {
    res <- c(res, "one to zero")
  }

  if (any(is.na(counts$x1))) {
    res <- c(res, "zero to one")
  }

  if (anyDuplicated(counts$x1) == 0 & anyDuplicated(counts$x2) == 0) {
    res <- c(res, "one to one")
  }

  if (anyDuplicated(counts$x1) > 0 & anyDuplicated(counts$x2) == 0) {
    res <- c(res, "one to many")
  }

  if (anyDuplicated(counts$x1) == 0 & anyDuplicated(counts$x2) > 0) {
    res <- c(res, "many to one")
  }

  if (anyDuplicated(counts$x1) > 0 & anyDuplicated(counts$x2) > 0) {
    res <- c(res, "many to many")
  }

  res
}
关系类型%
计数(x1,x2)
如果(有(is.na(计$x2))){

res这里是一种纯dplyr方法,使用与josliber相同的逻辑,为每个变量添加新的计数列:

data <- data %>% 
  add_count(first_name, name="first_name_n") %>%
  add_count(last_name, name="last_name_n")

# one-to-one
data %>% filter(first_name_n == 1 & last_name_n == 1)

# one-to-many
data %>% filter(first_name_n == 1 & last_name_n > 1)

# many-to-one
data %>% filter(first_name_n > 1 & last_name_n == 1)
数据%
添加计数(first\u name,name=“first\u name\u n”)%>%
添加计数(姓氏,name=“姓氏”)
#一对一
数据%>%筛选器(名字=1和姓氏=1)
#一对多
数据%>%筛选器(名字=1和姓氏>1)
#多对一
数据%>%筛选器(名字>1和姓氏=1)

如果数据包含重复项,则使用
复制功能的任何解决方案都将不起作用

例如:

data1 = data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete", "Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah", "Smith","Jones", "Dactil","Zah"))
下面是一个使用
数据的解决方案。表
既适用于上述情况,也适用于OP的原始数据:

library(data.table)
setDT(data1)

# One to one
temp1 = data1[ , uniqueN(last_name), by = 'first_name'][V1 == 1]
temp2 = data1[ , uniqueN(first_name), by = 'last_name'][V1 == 1]
data1[first_name %in% temp1$first_name & last_name %in% temp2$last_name]

#    first_name last_name
# 1:      Terry    Dactil
# 2:       Pete       Zah
# 3:      Terry    Dactil
# 4:       Pete       Zah


# One to many
temp3 = data1[ , uniqueN(last_name), by = 'first_name'][V1 > 1]
data1[first_name %in% temp3$first_name][order(first_name)]

#    first_name last_name
# 1:         Al     Smith
# 2:         Al     Jones
# 3:       Jeff  Thompson
# 4:       Jeff     Smith
# 5:      Scott  Thompson
# 6:      Scott     Jones


# Many to one
temp4 = data1[ , uniqueN(first_name), by = 'last_name'][V1 > 1]
data1[last_name %in% temp4$last_name][order(last_name)]

#    first_name last_name
# 1:         Al     Jones
# 2:      Scott     Jones
# 3:         Al     Smith
# 4:       Jeff     Smith
# 5:       Jeff  Thompson
# 6:      Scott  Thompson

你想要
duplicated()
函数是一些使用
duplicated()
的示例代码,但是我认为如果你能在这里给我们一些具体的东西会很酷,@RichardScriven。我不是为了解决这个问题而摸索它。ty。我有点希望我的名字是Terry Dactilnow@RichardScriven我会期待“特里·达克提尔”是的,这将是一个很棒的会话启动器!R不会考虑<代码>函数(x)复制(x)< /C>和<代码>重复(x,FaseOrth= true)< /C>作为两个独立的部分吗?@ aviasaRJ函数函数将是所有的<代码>重复(x)重复(x,FaseRe= true)
,就像在第二个函数中一样,主体将是表(x)[x]>1
中的所有部分。如果有疑问,可以添加
{
}
来封装主体。
data1 = data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete", "Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah", "Smith","Jones", "Dactil","Zah"))
library(data.table)
setDT(data1)

# One to one
temp1 = data1[ , uniqueN(last_name), by = 'first_name'][V1 == 1]
temp2 = data1[ , uniqueN(first_name), by = 'last_name'][V1 == 1]
data1[first_name %in% temp1$first_name & last_name %in% temp2$last_name]

#    first_name last_name
# 1:      Terry    Dactil
# 2:       Pete       Zah
# 3:      Terry    Dactil
# 4:       Pete       Zah


# One to many
temp3 = data1[ , uniqueN(last_name), by = 'first_name'][V1 > 1]
data1[first_name %in% temp3$first_name][order(first_name)]

#    first_name last_name
# 1:         Al     Smith
# 2:         Al     Jones
# 3:       Jeff  Thompson
# 4:       Jeff     Smith
# 5:      Scott  Thompson
# 6:      Scott     Jones


# Many to one
temp4 = data1[ , uniqueN(first_name), by = 'last_name'][V1 > 1]
data1[last_name %in% temp4$last_name][order(last_name)]

#    first_name last_name
# 1:         Al     Jones
# 2:      Scott     Jones
# 3:         Al     Smith
# 4:       Jeff     Smith
# 5:       Jeff  Thompson
# 6:      Scott  Thompson