R 若列值重复，则基于多个条件保留行，否则保留行_R_Dataframe_Filter_Data.table

R 若列值重复，则基于多个条件保留行，否则保留行

r dataframe filter

R 若列值重复，则基于多个条件保留行，否则保留行,r,dataframe,filter,data.table,R,Dataframe,Filter,Data.table,我想根据记录的日期和其他两列（id和类型变量）的条件值对数据表进行子集化，以包括记录。但是，如果每个id只存在一条记录，则不管其他条件列或日期的值如何，都会保留该记录我的数据示例如下所示： dt <- data.table(badge = c("1001", "1001", "1002", "1003", "1003", "1003", "1004", "1004"), location = c("training", "test", "training", "training", "t

我想根据记录的日期和其他两列（id和类型变量）的条件值对数据表进行子集化，以包括记录。但是，如果每个id只存在一条记录，则不管其他条件列或日期的值如何，都会保留该记录

我的数据示例如下所示：

dt <- data.table(badge = c("1001", "1001", "1002", "1003", "1003", "1003", "1004", "1004"), location = c("training", "test", "training", "training", "test", "test", "training", "training"), date = as.POSIXct(c("2014-09-21", "2014-10-01", "2014-09-20", "2014-09-15", "2014-11-01", "2014-12-10", "2014-09-09", "2014-09-10")), score = as.numeric(c(3,5,-1,0,1,3,-2,1)))

> dt
   badge location       date score
1:  1001 training 2014-09-21     3
2:  1001     test 2014-10-01     5
3:  1002 training 2014-09-20    -1
4:  1003 training 2014-09-15     0
5:  1003     test 2014-11-01     1
6:  1003     test 2014-12-10     3
7:  1004 training 2014-09-09    -2
8:  1004 training 2014-09-10     1

我尝试了不同dplyr字符串和子集的变体

dt%group_by（badge）%%>%filter（location==“test”）%%>%filter（date==min（date））

是我得到的最近的一个，因为它按badge为我提供了最早的测试分数，但会删除所有培训记录，无论该徽章是否有测试分数。我可以理解为什么这个代码不起作用，因为我要求它是选择性的，但我不知道如何使它更细微地产生我想要的结果

我想这就是你想要的逻辑：

library(data.table)
myfunc <- function(x) {
 if (!'test' %in% x$location) {
  out <- setorder(x, -date)
 } else {
  out <- setorder(x, location, date)
 }
 out[1, ]
}

dt[, myfunc(.SD), by = 'badge']
#   badge location       date score
#1:  1003     test 2014-11-01     1
#2:  1001     test 2014-10-01     5
#3:  1002 training 2014-09-20    -1
#4:  1004 training 2014-09-10     1

库（data.table）
myfunc我想这就是你想要的逻辑：
library(data.table)
myfunc <- function(x) {
 if (!'test' %in% x$location) {
  out <- setorder(x, -date)
 } else {
  out <- setorder(x, location, date)
 }
 out[1, ]
}

dt[, myfunc(.SD), by = 'badge']
#   badge location       date score
#1:  1003     test 2014-11-01     1
#2:  1001     test 2014-10-01     5
#3:  1002 training 2014-09-20    -1
#4:  1004 training 2014-09-10     1

库（data.table）
myfunc使用dplyr
的另一种可能的解决方案是使用过滤器
、连接
和联合
library(data.table)
library(dplyr)


    dt <- data.table(badge = c("1001", "1001", "1002", "1003", "1003", "1003", "1004", "1004"),
location = c("training", "test", "training", "training", "test", "test", "training", "training"), 
date = as.POSIXct(c("2014-09-21", "2014-10-01", "2014-09-20", "2014-09-15", "2014-11-01", "2014-12-10", "2014-09-09", "2014-09-10")), 
score = as.numeric(c(3,5,-1,0,1,3,-2,1)))


        # Rows with badge having both "test" and "training". Data with "test" is preferred
        df_test <- dt %>% filter(location == "test") %>%
        inner_join(filter(dt, location == "training"), by="badge") %>%
        select(badge, location = location.x, date = date.x, score = score.x)

        # Data for badge with only "training" records
        df_training <- dt %>% filter(location == "training") %>%
          anti_join(filter(dt, location == "test"), by="badge")

        # combine both
        union_all(df_test, df_training)

        # The result will look like:
        > union_all(df_test, df_training)
          badge location       date score
        1  1001     test 2014-10-01     5
        2  1003     test 2014-11-01     1
        3  1003     test 2014-12-10     3
        4  1002 training 2014-09-20    -1
        5  1004 training 2014-09-09    -2
        6  1004 training 2014-09-10     1

库（data.table）
图书馆（dplyr）
dt%
内部连接（过滤器（dt，位置=“培训”），通过=“徽章”）%>%
选择（徽章，位置=位置.x，日期=日期.x，分数=分数.x）
#只有“培训”记录的徽章数据
df_训练%筛选器（位置==“训练”）%>%
反_连接（过滤器（dt，位置=“测试”），通过=“徽章”）
#兼而有之
联合测试（测向测试、测向培训）
#结果如下所示：
>联合测试（测向测试、测向培训）
徽章位置日期分数
1001测试2014-10-01 5
21003测试2014-11-01 1
3 1003测试2014-12-10 3
41002培训2014-09-20-1
51004培训2014-09-09-2
61004培训2014-09-10 1

不确定OP是否希望将重复的
记录保存在相同的位置
。如果不需要重复记录，则可以使用distinct
过滤掉这些记录
 使用dplyr
的另一种可能的解决方案是使用过滤器
、连接
和联合_all

library(data.table)
library(dplyr)


    dt <- data.table(badge = c("1001", "1001", "1002", "1003", "1003", "1003", "1004", "1004"),
location = c("training", "test", "training", "training", "test", "test", "training", "training"), 
date = as.POSIXct(c("2014-09-21", "2014-10-01", "2014-09-20", "2014-09-15", "2014-11-01", "2014-12-10", "2014-09-09", "2014-09-10")), 
score = as.numeric(c(3,5,-1,0,1,3,-2,1)))


        # Rows with badge having both "test" and "training". Data with "test" is preferred
        df_test <- dt %>% filter(location == "test") %>%
        inner_join(filter(dt, location == "training"), by="badge") %>%
        select(badge, location = location.x, date = date.x, score = score.x)

        # Data for badge with only "training" records
        df_training <- dt %>% filter(location == "training") %>%
          anti_join(filter(dt, location == "test"), by="badge")

        # combine both
        union_all(df_test, df_training)

        # The result will look like:
        > union_all(df_test, df_training)
          badge location       date score
        1  1001     test 2014-10-01     5
        2  1003     test 2014-11-01     1
        3  1003     test 2014-12-10     3
        4  1002 training 2014-09-20    -1
        5  1004 training 2014-09-09    -2
        6  1004 training 2014-09-10     1

库（data.table）
图书馆（dplyr）
dt%
内部连接（过滤器（dt，位置=“培训”），通过=“徽章”）%>%
选择（徽章，位置=位置.x，日期=日期.x，分数=分数.x）
#只有“培训”记录的徽章数据
df_训练%筛选器（位置==“训练”）%>%
反_连接（过滤器（dt，位置=“测试”），通过=“徽章”）
#兼而有之
联合测试（测向测试、测向培训）
#结果如下所示：
>联合测试（测向测试、测向培训）
徽章位置日期分数
1001测试2014-10-01 5
21003测试2014-11-01 1
3 1003测试2014-12-10 3
41002培训2014-09-20-1
51004培训2014-09-09-2
61004培训2014-09-10 1

不确定OP是否希望将重复的
记录保存在相同的位置
。如果不需要重复记录，则可以使用distinct
过滤掉这些记录
 这里有一个替代解决方案，它只订购一次，以避免分组时重复重新订购：
library(data.table)
tmp <- dt[order(date), if (any(location == "test")) 
  first(.I[location == "test"]) else last(.I), keyby = badge]
dt[tmp$V1]

为了更好地解释，我引入了tmp
，尽管这并不是必需的tmp
保存V1
中所选记录的索引：
以下是一种替代解决方案，它只订购一次，以避免分组时重复重新订购：
library(data.table)
tmp <- dt[order(date), if (any(location == "test")) 
  first(.I[location == "test"]) else last(.I), keyby = badge]
dt[tmp$V1]

为了更好地解释，我引入了tmp
，尽管这并不是必需的tmp
保存V1
中所选记录的索引：
请检查您的答案，因为它不会返回预期结果。特别是，OP解释说，他希望在测试和培训案例中对重复条目进行不同的处理。因此，它不仅仅是简单地使用distinct（）
。请检查您的答案，因为它不会返回预期的结果。特别是，OP解释说，他希望在测试和培训案例中对重复条目进行不同的处理。因此，它不仅仅是简单地使用distinct（）
。我选择这个答案是因为它简单明了。非常感谢。我选择这个答案是因为它简单明了。非常感谢。