R 如何使用多列作为不同的字符串条件执行联接?

R 如何使用多列作为不同的字符串条件执行联接?,r,join,sqldf,R,Join,Sqldf,我想执行一个复杂的联接,它将多个列视为不同类型的条件 我想根据每个水果是否包含字符串、可能包含的字符串以及不包含的字符串,为每个水果分配一个类别 我有一个水果向量: head(fruit) [1] "apple" "apricot" "avocado" "banana" "bell pepper" "bilberry" 每种水果的分配标准如下: fruitAssignment <- data.frame(assignment = c('Appl

我想执行一个复杂的联接,它将多个列视为不同类型的条件

我想根据每个水果是否包含字符串、可能包含的字符串以及不包含的字符串,为每个水果分配一个类别

我有一个水果向量:

head(fruit) 
[1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper" "bilberry" 
每种水果的分配标准如下:

 fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
       contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
       mayContain = c(NA,'black',NA,NA,NA,NA,NA),
       doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA))

  assignment   contains mayContain doesNotContain
1      Apple      apple       <NA>           <NA>
2      Berry      berry      black           <NA>
3      Black      black       <NA>          berry
4      Melon   honeydew       <NA>           <NA>
5      Melon      melon       <NA>           <NA>
6      Melon cantaloupe       <NA>           <NA>
7    Currant    currant       <NA>           <NA>

无论使用什么包来实现这一点都很好。

我认为这里不适合使用连接,它更像是一项分类任务。使用正则表达式查找搜索词和分类表之间的匹配项:

fruit <- c("redcurrant", "blackcurrant", "pineapple", "blackberry", "coconut")

fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
                              contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
                              mayContain = c(NA,'black',NA,NA,NA,NA,NA),
                              doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA),
                              stringsAsFactors = FALSE)

library(dplyr)
library(tibble)

fun <- function(fruit, fruitAssignment) {

  fruitAssignment[,2:4] <- apply(fruitAssignment[,2:4],
                                 2,
                                 function(x, fruit) sapply(x, grepl, fruit, ignore.case = TRUE),
                                 fruit = fruit)
  fruitAssignment[is.na(fruitAssignment)] <- FALSE

  x <- fruitAssignment %>%
    filter(!doesNotContain, contains | mayContain)

  if (nrow(x) == 1)
    return(x$assignment)
  "Fruit"

}

sapply(fruit, fun, fruitAssignment) %>%
  enframe() %>%
  setNames(c("fruit", "assignment"))

# A tibble: 5 x 2
  fruit        assignment
  <chr>        <chr>     
1 redcurrant   Currant   
2 blackcurrant Fruit     
3 pineapple    Apple     
4 blackberry   Berry     
5 coconut      Fruit 

水果、苹果不会出现在水果赋值中,列赋值的值以大写字母开头。请正确指定您希望包含正确样本输出的输出。我只需要将水果分配给不区分大小写的分配的标准。如果你需要更多的澄清,请告诉我。
fruit <- c("redcurrant", "blackcurrant", "pineapple", "blackberry", "coconut")

fruitAssignment <- data.frame(assignment = c('Apple','Berry','Black','Melon','Melon','Melon','Currant'),
                              contains = c('apple','berry','black','honeydew','melon','cantaloupe','currant'),
                              mayContain = c(NA,'black',NA,NA,NA,NA,NA),
                              doesNotContain = c(NA,NA,'berry',NA,NA,NA,NA),
                              stringsAsFactors = FALSE)

library(dplyr)
library(tibble)

fun <- function(fruit, fruitAssignment) {

  fruitAssignment[,2:4] <- apply(fruitAssignment[,2:4],
                                 2,
                                 function(x, fruit) sapply(x, grepl, fruit, ignore.case = TRUE),
                                 fruit = fruit)
  fruitAssignment[is.na(fruitAssignment)] <- FALSE

  x <- fruitAssignment %>%
    filter(!doesNotContain, contains | mayContain)

  if (nrow(x) == 1)
    return(x$assignment)
  "Fruit"

}

sapply(fruit, fun, fruitAssignment) %>%
  enframe() %>%
  setNames(c("fruit", "assignment"))

# A tibble: 5 x 2
  fruit        assignment
  <chr>        <chr>     
1 redcurrant   Currant   
2 blackcurrant Fruit     
3 pineapple    Apple     
4 blackberry   Berry     
5 coconut      Fruit