R 比较列并将输出放在附加列中

R 比较列并将输出放在附加列中,r,dataframe,R,Dataframe,让我们从数据示例开始: structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", "Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 4L, 3L, 3L, 5L, 5L, 5L,

让我们从数据示例开始:

structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
"Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
"Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
    P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
    "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
    P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
    "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
    3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
    "Table,Shelf,Fridge"), class = "factor")), .Names = c("P1", 
"P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
"P2_location_all_predictors"), class = "data.frame", row.names = c(NA, 
-20L))
我想比较一下这两对列。我想比较的第一对是
P1\u location\u subacon
P2\u location\u subacon
。第二对是带有
P2\u位置\u所有\u预测器的
P1\u位置\u所有\u预测器

我想如何比较它们?在每一列中,你有不同的水果/蔬菜“位置”。因此:

  • 如果第一对中的位置相同(P1/2_location_subcon),我想在附加列中添加编号
    2

  • 如果第二对中的位置相同(P1/2位置所有预测值),我想在附加列中添加数字
    1
    。这是一个有点复杂,因为不是所有的位置都必须是相同的。两种水果/蔬菜必须至少有一种是相同的

  • 如果在这两种情况下它们不同,则将
    0
    。在示例数据中,您不会看到这种情况

  • 总而言之,我向您展示了我希望实现的输出:

    structure(list(P1 = structure(c(1L, 1L, 3L, 3L, 5L, 5L, 5L, 5L, 
    4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 2L, 2L), .Label = c("Apple", 
    "Grape", "Orange", "Peach", "Tomato"), class = "factor"), P2 = structure(c(4L, 
    4L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
    1L, 6L, 6L), .Label = c("Banana", "Cucumber", "Lemon", "Orange", 
    "Potato", "Tomato"), class = "factor"), P1_location_subacon = structure(c(2L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("Fridge", "Table"), class = "factor"), 
        P1_location_all_predictors = structure(c(2L, 2L, 3L, 3L, 
        3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
        3L), .Label = c("Table,Desk,Bag,Fridge,Bed,Shelf,Chair", 
        "Table,Shelf,Cupboard,Bed,Fridge", "Table,Shelf,Fridge"), class = "factor"), 
        P2_location_subacon = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
        2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Fridge", 
        "Shelf"), class = "factor"), P2_location_all_predictors = structure(c(3L, 
        3L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 
        3L, 3L, 3L, 3L), .Label = c("Shelf,Fridge", "Shelf,Fridge,Bed", 
        "Table,Shelf,Fridge"), class = "factor"), X = c(NA, NA, NA, 
        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
        NA, NA), Correct = c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
        1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), .Names = c("P1", 
    "P2", "P1_location_subacon", "P1_location_all_predictors", "P2_location_subacon", 
    "P2_location_all_predictors", "X", "Correct"), class = "data.frame", row.names = c(NA, 
    -20L))
    

    编辑:利用这里的反馈,我改进了我的答案

    其中DT是您的表格:

    library(data.table)
    setDT(DT)
    DT <- data.table(sapply(DT,as.character))
    
    DT[, P1_location_all_predictors := gsub(",","|",P1_location_all_predictors)]
    DT[, P1_location_subacon := gsub(",","|",P1_location_subacon)]
    
    DT[, match_all_pred := grepl(P1_location_all_predictors, P2_location_all_predictors) + 0, by = P1_location_all_predictors]
    DT[, match_subacon := grepl(P1_location_subacon, P2_location_subacon), by = P1_location_subacon]
    
    
    DT[, P1_location_all_predictors := gsub("\\|",",",P1_location_all_predictors)]
    DT[, P1_location_subacon := gsub("\\|",",",P1_location_subacon)]
    
    假设subacon取代all位置。

    这里是另一种方式:

    myData <- data.frame(sapply(myData, as.character), stringsAsFactors=FALSE)
    
    doesIntersect <- function(setA, setB) {length(intersect(setA,setB)) > 0}
    
    myData$Correct <- 0
    myData$Correct[mapply(doesIntersect, strsplit(myData$P1_location_all_predictors, ","), strsplit(myData$P2_location_all_predictors, ","))] <- 1
    myData$Correct[mapply(setequal, strsplit(myData$P1_location_subacon, ","), strsplit(myData$P2_location_subacon, ","))] <- 2
    
    > myData$Correct
    [1] 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
    

    myData它看起来像是在为示例数据工作,但不想为我的主数据工作。。。我还不知道为什么。是否有必要在其中一列的所有字母之间加上comas。可以稍后删除吗?这是我在主数据上尝试代码时遇到的错误:
    [0;例如,NA_integer_389;中的
    错误。如果您试图将列类型更改为空列表列,则与所有列类型更改一样,提供完整长度的RHS向量,如vector('list',nrow(DT));即新专栏中的“plonk”。
    @ShaxiLiver,我在代码中犯了几个愚蠢的错误,包括它是如何返回的(它标记了错误)我是如何将
    |
    s改回
    s的。现在编辑。效果非常好!我还有一个问题。为了使我的数据更可靠。我已经发现在少数情况下(比如10公里中有50行)我有两个或三个P1/2\u location\u subacon的可能本地化,而不是一个。正如我所说,我可以这样离开,但我必须跳过这50行。也许你还有一些时间,你想写一个代码来包含这种情况。主要是关于这个代码
    DT[P1_位置_子CON==P2_位置_子CON,MyCol:=2]
    。其他位置用逗号分隔…Thx!!@ShaxiLiver如上所述。正如我在回答中所述,速度缓慢的原因是grep函数依赖于一个循环,因为它没有矢量化。对于任何希望帮助回答的人,我认为解决方案将是替换这两行。也就是说,要求是相当可计算的我的答案现在应该快多了。你想要的输出包含一个
    X
    列,列中有所有空格。
    myData <- data.frame(sapply(myData, as.character), stringsAsFactors=FALSE)
    
    doesIntersect <- function(setA, setB) {length(intersect(setA,setB)) > 0}
    
    myData$Correct <- 0
    myData$Correct[mapply(doesIntersect, strsplit(myData$P1_location_all_predictors, ","), strsplit(myData$P2_location_all_predictors, ","))] <- 1
    myData$Correct[mapply(setequal, strsplit(myData$P1_location_subacon, ","), strsplit(myData$P2_location_subacon, ","))] <- 2
    
    > myData$Correct
    [1] 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2