R数据操作-data.table/dplyr中的范围条件
在R中,我正在对R数据操作-data.table/dplyr中的范围条件,r,data.table,R,Data.table,在R中,我正在对df1进行分析,但我还需要从df2中更详细的记录/观察中提取数据,并根据特定条件附加到df1 这是与我自己的数据相当的样本数据: df1 <- data.frame(id=c(1,2,3,3,3,4,4,5), location=c("a", "a" , "a", "b" , "b" , "a", "a&quo
df1
进行分析,但我还需要从df2
中更详细的记录/观察中提取数据,并根据特定条件附加到df1
这是与我自己的数据相当的样本数据:
df1 <- data.frame(id=c(1,2,3,3,3,4,4,5),
location=c("a", "a" , "a", "b" , "b" , "a", "a" ,"a" ),
actiontime=c("2020-03-10" , "2020-02-17" , "2020-04-22" , "2020-04-19" , "2020-04-20" , "2020-04-22" , "2020-03-02" , "2020-05-07" ) )
df2 <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3,3,3,3,3,3,3,3,3, 4,4,4,4,4, 5,5,5) ,
observation=c( "2020-03-09 01:00" , "2020-03-09 10:00" , "2020-03-10 05:00", "2020-02-15 08:00" , "2020-02-16 09:00" , "2020-02-17 08:00", "2020-04-16 14:30", "2020-04-16 07:30" , "2020-04-17 15:00" , "2020-04-25 07:20" , "2020-04-18 10:00" , "2020-04-19 10:30", "2020-04-20 12:00", "2020-04-21 12:00" , "2020-04-22 09:30" , "2020-04-24 23:00", "2020-04-23 17:30", "2020-03-01 08:00" , "2020-03-02 08:00" , "2020-03-03 08:00" , "2020-03-15 16:45" , "2020-03-16 08:00" , "2020-05-05 13:45" , "2020-05-06 08:00" , "2020-05-07 11:00") ,
var1=round(runif(25, min=10, max=60),0) ,
var2=c("Red" , "Blue" , "Yellow" , NA , "Yellow" , "Blue" , "Red" , "Yellow" , NA , NA , "Yellow" , NA , NA , NA , NA , NA , "Blue", NA , "Blue" , "Yellow" , NA , "Blue" , "Yellow" , "Red" , "Blue") )
其中有三个问题,我将尝试逐一回答 问题1 如果我理解正确,OP希望根据
id
识别var2
中排名最高的颜色,并希望将颜色复制到df1
中的新列,以匹配id
s
这可以通过将var2
转换为有序因子,通过id
聚合df2
,并通过更新联接将结果添加到df1
来解决:
这是非常简单的,因为所有id
组都在var2
中包含Blue
这可以通过更新联接附加到df1
问题2
如果我理解正确,OP希望过滤df2
,以便只保留那些行,其中df2
中的观察日期正好在df1
中的操作时间
前一天(对于相同的id
)。然后以与上述问题1中的df2
相同的方式处理该中间结果
过滤由联接操作完成,但需要将字符日期actiontime
和字符日期时间observation
分别强制为数字日期类型以进行日期计算
df1[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
keep_df2_rows
keep_df2_rows
包含df2
中那些行的行号,这些行号满足了观察
恰好发生在df1
中的操作时间
前一天的条件(对于相同的id
)
现在,我们可以使用问题1的代码,但使用保留_df2_行
过滤df2
:
df1[df2[keep_df2_rows, min(var2, na.rm = TRUE), by = id]
, on = .(id), color := V1][]
id观察var1 var2动作\u日颜色
1:112020-03-09 01:00 23红色2020-03-10蓝色
2:1202-03-09 10:00 29蓝色2020-03-10
3:1202-03-10 05:00 39黄色2020-03-11
4:2020-02-15 08:00 55 2020-02-16黄色
5:2020-02-16 09:00 20黄色2020-02-17
6:2 2020-02-17 08:00 55蓝色2020-02-18
7:3 2020-04-16 14:30 57红色2020-04-17
8:3 2020-04-16 07:30 43黄色2020-04-17黄色
9: 3 2020-04-17 15:00 41 2020-04-18
10: 3 2020-04-25 07:20 13 2020-04-26
11:3 2020-04-18 10:00 20黄色2020-04-19
12: 3 2020-04-19 10:30 19 2020-04-20
13: 3 2020-04-20 12:00 44 2020-04-21
14: 3 2020-04-21 12:00 29 2020-04-22
15: 3 2020-04-22 09:30 48 2020-04-23
16: 3 2020-04-24 23:00 35 2020-04-25
17:3 2020-04-23 17:30 46蓝色2020-04-24
18: 4 2020-03-01 08:00 60 2020-03-02
19:42002-03-02 08:00 29蓝色2020-03-03
20:42002-03-03 08:00 49黄色2020-03-04
21: 4 2020-03-15 16:45 57 2020-03-16
22:4 2020-03-16 08:00 21蓝色2020-03-17
23:52020-05-05 13:45 43黄色2020-05-06红色
24:52020-05-06 08:00 16红色2020-05-07
25:52020-05-07 11:00 23蓝色2020-05-08
id观察var1 var2行动日颜色
请注意,结果与OP发布的示例表不同,因为OP对df2
的定义与示例表不同
还请注意,我必须修改agg1
的计算,因为min(var2,na.rm=TRUE)
的意外行为是id
组仅由na
组成。(要重现该问题,请尝试min(ordered(NA),NA.rm=TRUE)
vsmin(ordered(NA))
)Thanx了解有关set.seed的提示-下次将使用它。这一次只有var1受随机数的影响,而随机数不会影响输出/解决方案。var1是被动的,包含它只是为了证明df2中还有其他变量需要与df1以及条件中包含的变量连接。感谢您的响应。在“df1”中,除了“id”之外,我不明白你是如何匹配的。谢谢。在df1中,只有id是匹配变量是正确的。非常期待epsiode 2:)Thanx!非常感谢您的技能和时间!thanx回答了所有3个问题:)非常有帮助!
Classes ‘data.table’ and 'data.frame': 25 obs. of 5 variables:
$ id : num 1 1 1 2 2 2 3 3 3 3 ...
$ observation: chr "2020-03-09 01:00" "2020-03-09 10:00" "2020-03-10 05:00" "2020-02-15 08:00" ...
$ var1 : num 15 58 12 35 11 25 24 54 14 15 ...
$ var2 : Ord.factor w/ 4 levels "Blue"<"Red"<"Yellow"<..: 2 1 3 4 3 1 2 3 4 4 ...
$ action_day : IDate, format: "2020-03-10" "2020-03-10" "2020-03-11" "2020-02-16" ...
- attr(*, ".internal.selfref")=<externalptr>
df2[, min(var2, na.rm = TRUE), by = id]
id V1
1: 1 Blue
2: 2 Blue
3: 3 Blue
4: 4 Blue
5: 5 Blue
df1[df2[, min(var2, na.rm = TRUE), by = id], on = .(id), color := V1][]
id location actiontime color
1: 1 a 2020-03-10 Blue
2: 2 a 2020-02-17 Blue
3: 3 a 2020-04-22 Blue
4: 3 b 2020-04-19 Blue
5: 3 b 2020-04-20 Blue
6: 4 a 2020-04-22 Blue
7: 4 a 2020-03-02 Blue
8: 5 a 2020-05-07 Blue
df1[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
keep_df2_rows
[1] 1 2 5 14 11 12 18 24
df1[df2[keep_df2_rows, min(var2, na.rm = TRUE), by = id]
, on = .(id), color := V1][]
id location actiontime color
1: 1 a 2020-03-10 Blue
2: 2 a 2020-02-17 Yellow
3: 3 a 2020-04-22 Yellow
4: 3 b 2020-04-19 Yellow
5: 3 b 2020-04-20 Yellow
6: 4 a 2020-04-22 <NA>
7: 4 a 2020-03-02 <NA>
8: 5 a 2020-05-07 Red
library(data.table)
setDT(df2)[, var2 := ordered(var2, levels = c("Blue", "Red", "Yellow"))]
setDT(df1)[, actiontime := as.IDate(actiontime)]
df2[, action_day := as.IDate(observation) + 1L]
keep_df2_rows <- df2[df1, on = .(id, action_day = actiontime), nomatch = NULL, which = TRUE]
agg1 <- df2[keep_df2_rows][!is.na(var2), min(var2), by = id]
agg2 <- df2[, .(observation = min(observation)), by = id]
lut <- merge(agg1, agg2, by = "id")
df2[lut, on = .(id, observation), color := as.character(V1)][]
id observation var1 var2 action_day color
1: 1 2020-03-09 01:00 23 Red 2020-03-10 Blue
2: 1 2020-03-09 10:00 29 Blue 2020-03-10 <NA>
3: 1 2020-03-10 05:00 39 Yellow 2020-03-11 <NA>
4: 2 2020-02-15 08:00 55 <NA> 2020-02-16 Yellow
5: 2 2020-02-16 09:00 20 Yellow 2020-02-17 <NA>
6: 2 2020-02-17 08:00 55 Blue 2020-02-18 <NA>
7: 3 2020-04-16 14:30 57 Red 2020-04-17 <NA>
8: 3 2020-04-16 07:30 43 Yellow 2020-04-17 Yellow
9: 3 2020-04-17 15:00 41 <NA> 2020-04-18 <NA>
10: 3 2020-04-25 07:20 13 <NA> 2020-04-26 <NA>
11: 3 2020-04-18 10:00 20 Yellow 2020-04-19 <NA>
12: 3 2020-04-19 10:30 19 <NA> 2020-04-20 <NA>
13: 3 2020-04-20 12:00 44 <NA> 2020-04-21 <NA>
14: 3 2020-04-21 12:00 29 <NA> 2020-04-22 <NA>
15: 3 2020-04-22 09:30 48 <NA> 2020-04-23 <NA>
16: 3 2020-04-24 23:00 35 <NA> 2020-04-25 <NA>
17: 3 2020-04-23 17:30 46 Blue 2020-04-24 <NA>
18: 4 2020-03-01 08:00 60 <NA> 2020-03-02 <NA>
19: 4 2020-03-02 08:00 29 Blue 2020-03-03 <NA>
20: 4 2020-03-03 08:00 49 Yellow 2020-03-04 <NA>
21: 4 2020-03-15 16:45 57 <NA> 2020-03-16 <NA>
22: 4 2020-03-16 08:00 21 Blue 2020-03-17 <NA>
23: 5 2020-05-05 13:45 43 Yellow 2020-05-06 Red
24: 5 2020-05-06 08:00 16 Red 2020-05-07 <NA>
25: 5 2020-05-07 11:00 23 Blue 2020-05-08 <NA>
id observation var1 var2 action_day color