R 查找列之间的一对一、一对多和多对一关系_R_Dplyr

R 查找列之间的一对一、一对多和多对一关系

R 查找列之间的一对一、一对多和多对一关系,r,dplyr,R,Dplyr,考虑以下数据框： first_name last_name 1 Al Smith 2 Al Jones 3 Jeff Thompson 4 Scott Thompson 5 Terry Dactil 6 Pete Zah data <- data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete"),

考虑以下数据框：

 first_name last_name
1         Al     Smith
2         Al     Jones
3       Jeff  Thompson
4      Scott  Thompson
5      Terry    Dactil
6       Pete       Zah

data <- data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah"))

一对多

  first_name last_name
1         Al     Smith
2         Al     Jones

多对一

   first_name last_name
1       Jeff  Thompson
2      Scott  Thompson

我希望在dplyr包中执行此操作。

通常，您可以使用

duplicated

函数检查值是否重复（正如@RichardScriven在对您的问题的评论中提到的）。但是，默认情况下，此函数不会将多次出现的元素的第一个实例标记为重复：

duplicated(c(1, 1, 1, 2))
# [1] FALSE  TRUE  TRUE FALSE

由于您还希望选择这些情况，因此通常需要在每个向量上运行两次

duplicated

，一次向前，一次向后：

duplicated(c(1, 1, 1, 2)) | duplicated(c(1, 1, 1, 2), fromLast=TRUE)
# [1]  TRUE  TRUE  TRUE FALSE

我发现这需要大量的键入，因此我将定义一个helper函数来检查元素是否多次出现：

d <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)

请注意，您也可以使用

表功能定义d
，而无需duplicated
的帮助：
d <- function(x) table(x)[x] > 1

d1

虽然这个替代定义稍微简洁一些，但我也发现它可读性较差。
使用@josliber建议的方法，我构造了一个函数，它接受两个向量并返回它们的关系类型：
relationship_type <- function(x1, x2, na.rm = FALSE) {

  df <- tibble(x1 = x1, x2 = x2)

  if (na.rm) {
    df <- df %>%
      drop_na()
  }

  res <- c()

  counts <- df %>%
    count(x1, x2) %>%
    ungroup() %>%
    select(-n) %>%
    count(x1, x2)

  if (any(is.na(counts$x2))) {
    res <- c(res, "one to zero")
  }

  if (any(is.na(counts$x1))) {
    res <- c(res, "zero to one")
  }

  if (anyDuplicated(counts$x1) == 0 & anyDuplicated(counts$x2) == 0) {
    res <- c(res, "one to one")
  }

  if (anyDuplicated(counts$x1) > 0 & anyDuplicated(counts$x2) == 0) {
    res <- c(res, "one to many")
  }

  if (anyDuplicated(counts$x1) == 0 & anyDuplicated(counts$x2) > 0) {
    res <- c(res, "many to one")
  }

  if (anyDuplicated(counts$x1) > 0 & anyDuplicated(counts$x2) > 0) {
    res <- c(res, "many to many")
  }

  res
}

关系类型%
计数（x1，x2）
如果（有（is.na（计$x2）））{
res这里是一种纯dplyr方法，使用与josliber相同的逻辑，为每个变量添加新的计数列：
data <- data %>% 
  add_count(first_name, name="first_name_n") %>%
  add_count(last_name, name="last_name_n")

# one-to-one
data %>% filter(first_name_n == 1 & last_name_n == 1)

# one-to-many
data %>% filter(first_name_n == 1 & last_name_n > 1)

# many-to-one
data %>% filter(first_name_n > 1 & last_name_n == 1)

数据%
添加计数（first\u name，name=“first\u name\u n”）%>%
添加计数（姓氏，name=“姓氏”）
#一对一
数据%>%筛选器（名字=1和姓氏=1）
#一对多
数据%>%筛选器（名字=1和姓氏>1）
#多对一
数据%>%筛选器（名字>1和姓氏=1）
如果数据包含重复项，则使用复制功能的任何解决方案都将不起作用
例如：
data1 = data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete", "Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah", "Smith","Jones", "Dactil","Zah"))

下面是一个使用数据的解决方案。表既适用于上述情况，也适用于OP的原始数据：
library(data.table)
setDT(data1)

# One to one
temp1 = data1[ , uniqueN(last_name), by = 'first_name'][V1 == 1]
temp2 = data1[ , uniqueN(first_name), by = 'last_name'][V1 == 1]
data1[first_name %in% temp1$first_name & last_name %in% temp2$last_name]

#    first_name last_name
# 1:      Terry    Dactil
# 2:       Pete       Zah
# 3:      Terry    Dactil
# 4:       Pete       Zah


# One to many
temp3 = data1[ , uniqueN(last_name), by = 'first_name'][V1 > 1]
data1[first_name %in% temp3$first_name][order(first_name)]

#    first_name last_name
# 1:         Al     Smith
# 2:         Al     Jones
# 3:       Jeff  Thompson
# 4:       Jeff     Smith
# 5:      Scott  Thompson
# 6:      Scott     Jones


# Many to one
temp4 = data1[ , uniqueN(first_name), by = 'last_name'][V1 > 1]
data1[last_name %in% temp4$last_name][order(last_name)]

#    first_name last_name
# 1:         Al     Jones
# 2:      Scott     Jones
# 3:         Al     Smith
# 4:       Jeff     Smith
# 5:       Jeff  Thompson
# 6:      Scott  Thompson

你想要duplicated（）
函数是一些使用duplicated（）
的示例代码，但是我认为如果你能在这里给我们一些具体的东西会很酷，@RichardScriven。我不是为了解决这个问题而摸索它。ty。我有点希望我的名字是Terry Dactilnow@RichardScriven我会期待“特里·达克提尔”是的，这将是一个很棒的会话启动器！R不会考虑<代码>函数（x）复制（x）< /C>和<代码>重复（x，FaseOrth= true）< /C>作为两个独立的部分吗？@ aviasaRJ函数函数将是所有的<代码>重复（x）重复（x，FaseRe= true）
，就像在第二个函数中一样，主体将是表（x）[x]>1
中的所有部分。如果有疑问，可以添加{
和}来封装主体。
data1 = data.frame(first_name=c("Al","Al","Jeff","Scott","Terry","Pete", "Jeff","Scott","Terry","Pete"),
                   last_name=c("Smith","Jones","Thompson","Thompson","Dactil","Zah", "Smith","Jones", "Dactil","Zah"))

library(data.table)
setDT(data1)

# One to one
temp1 = data1[ , uniqueN(last_name), by = 'first_name'][V1 == 1]
temp2 = data1[ , uniqueN(first_name), by = 'last_name'][V1 == 1]
data1[first_name %in% temp1$first_name & last_name %in% temp2$last_name]

#    first_name last_name
# 1:      Terry    Dactil
# 2:       Pete       Zah
# 3:      Terry    Dactil
# 4:       Pete       Zah


# One to many
temp3 = data1[ , uniqueN(last_name), by = 'first_name'][V1 > 1]
data1[first_name %in% temp3$first_name][order(first_name)]

#    first_name last_name
# 1:         Al     Smith
# 2:         Al     Jones
# 3:       Jeff  Thompson
# 4:       Jeff     Smith
# 5:      Scott  Thompson
# 6:      Scott     Jones


# Many to one
temp4 = data1[ , uniqueN(first_name), by = 'last_name'][V1 > 1]
data1[last_name %in% temp4$last_name][order(last_name)]

#    first_name last_name
# 1:         Al     Jones
# 2:      Scott     Jones
# 3:         Al     Smith
# 4:       Jeff     Smith
# 5:       Jeff  Thompson
# 6:      Scott  Thompson