R数据操作-data.table/dplyr中的范围条件

R数据操作-data.table/dplyr中的范围条件,r,data.table,R,Data.table,在R中,我正在对df1进行分析,但我还需要从df2中更详细的记录/观察中提取数据,并根据特定条件附加到df1 这是与我自己的数据相当的样本数据: df1 <- data.frame(id=c(1,2,3,3,3,4,4,5), location=c("a", "a" , "a", "b" , "b" , "a", "a&quo

在R中,我正在对
df1
进行分析,但我还需要从
df2
中更详细的记录/观察中提取数据,并根据特定条件附加到
df1

这是与我自己的数据相当的样本数据:

df1 <- data.frame(id=c(1,2,3,3,3,4,4,5),
                  location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ),
                  actiontime=c("2020-03-10" , "2020-02-17" , "2020-04-22" , "2020-04-19" , "2020-04-20" , "2020-04-22" , "2020-03-02" , "2020-05-07" ) )

df2 <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3,3,3,3,3,3,3,3,3, 4,4,4,4,4, 5,5,5) , 
                  observation=c( "2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00",  "2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00",  "2020-04-16 14:30",  "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" , "2020-04-18 10:00" , "2020-04-19 10:30",  "2020-04-20 12:00", "2020-04-21 12:00" , "2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30", "2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" ,  "2020-03-15 16:45" ,  "2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
                  var1=round(runif(25, min=10, max=60),0) ,
                  var2=c("Red" , "Blue" , "Yellow" , NA , "Yellow" , "Blue" , "Red" , "Yellow" , NA , NA , "Yellow" , NA , NA , NA , NA , NA , "Blue", NA , "Blue" , "Yellow" , NA , "Blue" , "Yellow" , "Red" , "Blue") )

其中有三个问题,我将尝试逐一回答

问题1 如果我理解正确,OP希望根据
id
识别
var2
中排名最高的颜色,并希望将颜色复制到
df1
中的新列,以匹配
id
s

这可以通过将
var2
转换为有序因子,通过
id
聚合
df2
,并通过更新联接将结果添加到
df1
来解决:

这是非常简单的,因为所有
id
组都在
var2
中包含
Blue

这可以通过更新联接附加到
df1

问题2 如果我理解正确,OP希望过滤
df2
,以便只保留那些行,其中
df2
中的
观察日期正好在
df1
中的
操作时间
前一天(对于相同的
id
)。然后以与上述问题1中的
df2
相同的方式处理该中间结果

过滤由联接操作完成,但需要将字符日期
actiontime
和字符日期时间
observation
分别强制为数字日期类型以进行日期计算

df1[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L] 
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]

keep_df2_rows
keep_df2_rows
包含
df2
中那些行的行号,这些行号满足了
观察
恰好发生在
df1
中的
操作时间
前一天的条件(对于相同的
id

现在,我们可以使用问题1的代码,但使用
保留_df2_行
过滤
df2

df1[df2[keep_df2_rows, min(var2, na.rm = TRUE), by = id]
  , on = .(id), color := V1][]
id观察var1 var2动作\u日颜色
1:112020-03-09 01:00 23红色2020-03-10蓝色
2:1202-03-09 10:00 29蓝色2020-03-10
3:1202-03-10 05:00 39黄色2020-03-11
4:2020-02-15 08:00 55 2020-02-16黄色
5:2020-02-16 09:00 20黄色2020-02-17
6:2 2020-02-17 08:00 55蓝色2020-02-18
7:3 2020-04-16 14:30 57红色2020-04-17
8:3 2020-04-16 07:30 43黄色2020-04-17黄色
9:  3 2020-04-17 15:00   41    2020-04-18   
10:  3 2020-04-25 07:20   13    2020-04-26   
11:3 2020-04-18 10:00 20黄色2020-04-19
12:  3 2020-04-19 10:30   19    2020-04-20   
13:  3 2020-04-20 12:00   44    2020-04-21   
14:  3 2020-04-21 12:00   29    2020-04-22   
15:  3 2020-04-22 09:30   48    2020-04-23   
16:  3 2020-04-24 23:00   35    2020-04-25   
17:3 2020-04-23 17:30 46蓝色2020-04-24
18:  4 2020-03-01 08:00   60    2020-03-02   
19:42002-03-02 08:00 29蓝色2020-03-03
20:42002-03-03 08:00 49黄色2020-03-04
21:  4 2020-03-15 16:45   57    2020-03-16   
22:4 2020-03-16 08:00 21蓝色2020-03-17
23:52020-05-05 13:45 43黄色2020-05-06红色
24:52020-05-06 08:00 16红色2020-05-07
25:52020-05-07 11:00 23蓝色2020-05-08
id观察var1 var2行动日颜色
请注意,结果与OP发布的示例表不同,因为OP对
df2
的定义与示例表不同


还请注意,我必须修改
agg1
的计算,因为
min(var2,na.rm=TRUE)
的意外行为是
id
组仅由
na
组成。(要重现该问题,请尝试
min(ordered(NA),NA.rm=TRUE)
vs
min(ordered(NA))

Thanx了解有关set.seed的提示-下次将使用它。这一次只有var1受随机数的影响,而随机数不会影响输出/解决方案。var1是被动的,包含它只是为了证明df2中还有其他变量需要与df1以及条件中包含的变量连接。感谢您的响应。在“df1”中,除了“id”之外,我不明白你是如何匹配的。谢谢。在df1中,只有id是匹配变量是正确的。非常期待epsiode 2:)Thanx!非常感谢您的技能和时间!thanx回答了所有3个问题:)非常有帮助!
Classes ‘data.table’ and 'data.frame':    25 obs. of  5 variables:
 $ id         : num  1 1 1 2 2 2 3 3 3 3 ...
 $ observation: chr  "2020-03-09 01:00" "2020-03-09 10:00" "2020-03-10 05:00" "2020-02-15 08:00" ...
 $ var1       : num  15 58 12 35 11 25 24 54 14 15 ...
 $ var2       : Ord.factor w/ 4 levels "Blue"<"Red"<"Yellow"<..: 2 1 3 4 3 1 2 3 4 4 ...
 $ action_day : IDate, format: "2020-03-10" "2020-03-10" "2020-03-11" "2020-02-16" ...
 - attr(*, ".internal.selfref")=<externalptr>
df2[, min(var2, na.rm = TRUE), by = id]
   id   V1
1:  1 Blue
2:  2 Blue
3:  3 Blue
4:  4 Blue
5:  5 Blue
df1[df2[, min(var2, na.rm = TRUE), by = id], on = .(id), color := V1][]
   id location actiontime color
1:  1        a 2020-03-10  Blue
2:  2        a 2020-02-17  Blue
3:  3        a 2020-04-22  Blue
4:  3        b 2020-04-19  Blue
5:  3        b 2020-04-20  Blue
6:  4        a 2020-04-22  Blue
7:  4        a 2020-03-02  Blue
8:  5        a 2020-05-07  Blue
df1[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L] 
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]

keep_df2_rows
[1]  1  2  5 14 11 12 18 24
df1[df2[keep_df2_rows, min(var2, na.rm = TRUE), by = id]
  , on = .(id), color := V1][]
   id location actiontime  color
1:  1        a 2020-03-10   Blue
2:  2        a 2020-02-17 Yellow
3:  3        a 2020-04-22 Yellow
4:  3        b 2020-04-19 Yellow
5:  3        b 2020-04-20 Yellow
6:  4        a 2020-04-22   <NA>
7:  4        a 2020-03-02   <NA>
8:  5        a 2020-05-07    Red
library(data.table)
setDT(df2)[, var2 := ordered(var2, levels = c("Blue", "Red", "Yellow"))]
setDT(df1)[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]

agg1 <- df2[keep_df2_rows][!is.na(var2), min(var2), by = id]
agg2 <- df2[, .(observation = min(observation)), by = id]
lut <- merge(agg1, agg2, by = "id")
df2[lut, on = .(id, observation), color := as.character(V1)][]
    id      observation var1   var2 action_day  color
 1:  1 2020-03-09 01:00   23    Red 2020-03-10   Blue
 2:  1 2020-03-09 10:00   29   Blue 2020-03-10   <NA>
 3:  1 2020-03-10 05:00   39 Yellow 2020-03-11   <NA>
 4:  2 2020-02-15 08:00   55   <NA> 2020-02-16 Yellow
 5:  2 2020-02-16 09:00   20 Yellow 2020-02-17   <NA>
 6:  2 2020-02-17 08:00   55   Blue 2020-02-18   <NA>
 7:  3 2020-04-16 14:30   57    Red 2020-04-17   <NA>
 8:  3 2020-04-16 07:30   43 Yellow 2020-04-17 Yellow
 9:  3 2020-04-17 15:00   41   <NA> 2020-04-18   <NA>
10:  3 2020-04-25 07:20   13   <NA> 2020-04-26   <NA>
11:  3 2020-04-18 10:00   20 Yellow 2020-04-19   <NA>
12:  3 2020-04-19 10:30   19   <NA> 2020-04-20   <NA>
13:  3 2020-04-20 12:00   44   <NA> 2020-04-21   <NA>
14:  3 2020-04-21 12:00   29   <NA> 2020-04-22   <NA>
15:  3 2020-04-22 09:30   48   <NA> 2020-04-23   <NA>
16:  3 2020-04-24 23:00   35   <NA> 2020-04-25   <NA>
17:  3 2020-04-23 17:30   46   Blue 2020-04-24   <NA>
18:  4 2020-03-01 08:00   60   <NA> 2020-03-02   <NA>
19:  4 2020-03-02 08:00   29   Blue 2020-03-03   <NA>
20:  4 2020-03-03 08:00   49 Yellow 2020-03-04   <NA>
21:  4 2020-03-15 16:45   57   <NA> 2020-03-16   <NA>
22:  4 2020-03-16 08:00   21   Blue 2020-03-17   <NA>
23:  5 2020-05-05 13:45   43 Yellow 2020-05-06    Red
24:  5 2020-05-06 08:00   16    Red 2020-05-07   <NA>
25:  5 2020-05-07 11:00   23   Blue 2020-05-08   <NA>
    id      observation var1   var2 action_day  color