如何将代码从使用tidyverse重写为data.table?
我编写了一个代码:如何将代码从使用tidyverse重写为data.table?,r,data.table,R,Data.table,我编写了一个代码: df_all<- df1 %>% mutate(type = factor(type, levels = df3$type)) %>% group_by(ID, date) %>% complete(type, fill = list(value = 0)) %>% left_join(df3) 我从数据库中获取数据帧。但在这个数据框架中,并非所有的“类型”都是。下面是该表的示例: ID date t
df_all<- df1 %>%
mutate(type = factor(type, levels = df3$type)) %>%
group_by(ID, date) %>%
complete(type, fill = list(value = 0)) %>%
left_join(df3)
我从数据库中获取数据帧。但在这个数据框架中,并非所有的“类型”都是。下面是该表的示例:
ID date type value
a1 2020-09-01 enter 18
a1 2020-09-01 close 15
a1 2020-09-02 enter 4
a2 2020-09-01 close 10
b1 2020-09-02 update 10
如您所见,ID a1只有两种类型:enter和close。a2只有关闭,b1只有更新
我希望以这种方式绑定这两个表,以便表中不存在的“类型”对于每个ID和日期都具有零值。那么,如何绑定这两个表以获得:
comment ID date type value
used a1 2020-09-01 enter 18
used a1 2020-09-01 open 0
used a1 2020-09-01 close 15
used a1 2020-09-01 update 0
not_used a1 2020-09-01 delete 0
used a1 2020-09-02 enter 4
used a1 2020-09-02 open 0
used a1 2020-09-02 close 0
used a1 2020-09-02 update 0
not_used a1 2020-09-02 delete 0
used a2 2020-09-01 enter 0
used a2 2020-09-01 open 0
used a2 2020-09-01 close 10
used a2 2020-09-01 update 0
not_used a2 2020-09-01 delete 0
used b1 2020-09-01 enter 0
used b1 2020-09-01 open 0
used b1 2020-09-01 close 0
used b1 2020-09-01 update 10
not_used b1 2020-09-01 delete 0
如你所见,我还保留了“评论”一栏。如何将代码重写为数据表?我们可以转换
因子的类型
,使用CJ
(交叉连接)按ID、日期和类型展开
library(data.table)
setDT(df1)[, type := factor(type, levels = unique(df3$type))][,
CJ(ID, date, type = type, unique = TRUE)][df1,
value := value, on = .(ID, date, type)][is.na(value),
value := 0][df3, on = .(type)]
也可以通过split
setDT(df1)[, type := factor(type, levels = unique(df3$type))]
rbindlist(lapply(split(df1, df1[, .(ID, date)], drop = TRUE),
function(x) x[, CJ(ID, date, type = levels(x$type), unique = TRUE)][x,
value := value, on = .(ID, date, type)][is.na(value), value := 0][]))[df3, on = .(type)]))
-输出
# ID date type value comment
# 1: a1 2020-09-01 enter 18 used
# 2: a1 2020-09-01 open 0 used
# 3: a1 2020-09-01 close 15 used
# 4: a1 2020-09-01 update 0 used
# 5: a1 2020-09-01 delete 0 not_used
# 6: a2 2020-09-01 enter 0 used
# 7: a2 2020-09-01 open 0 used
# 8: a2 2020-09-01 close 10 used
# 9: a2 2020-09-01 update 0 used
#10: a2 2020-09-01 delete 0 not_used
#11: a1 2020-09-02 enter 4 used
#12: a1 2020-09-02 open 0 used
#13: a1 2020-09-02 close 0 used
#14: a1 2020-09-02 update 0 used
#15: a1 2020-09-02 delete 0 not_used
#16: b1 2020-09-02 enter 0 used
#17: b1 2020-09-02 open 0 used
#18: b1 2020-09-02 close 0 used
#19: b1 2020-09-02 update 10 used
#20: b1 2020-09-02 delete 0 not_used
这是一个简明的数据。表OP代码版本:
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)]
返回预期结果(为清晰起见,值
列中的NA
s在第二步中转换-见下文)
对于每一组唯一的ID
,date
组合,df1
的行子集与type
上的df3
右键连接,从而完成每个子集缺少的行。因为使用了右连接而不是tidyr::complete(),所以这里并没有必要强制type
使用所有因子级别进行因子转换。另外,data.table
在连接期间保留了df3
行的顺序
对于转换value
列中的NA
s,有4种不同的方法可用,它们都返回相同的结果:
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][is.na(value), value := 0L][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := fcoalesce(value, 0L)][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := nafill(value, fill = 0L)][]
setnafill(setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)], fill = 0L, cols = "value")[]
它在[.data.table
(rbindlist(list(df1,df2))[,]中表示错误:当i是data.table(或字符向量)时,必须使用“on=”参数(请参见?data.table)、键入x(即排序,并标记为排序,请参见?setkey)或通过在x和i之间共享列名(即自然联接)来指定要联接的列.Keyed join可能对非常大的数据有进一步的速度优势,因为x在RAM中排序。@french_fries它必须来自CJ`的最后一步。我没有测试它,因为没有示例。你能发布一个示例进行测试吗。谢谢我添加了示例和explanation@french_fries您确定dplyr语法适用于 rbind
您的代码是否使用示例数据帧给出了所需的结果?在代码中,您有计数列。它不在示例中code@akrun我编辑过。这是第一次dataframe@akrun我删除了第一个rbind部分,现在它只是示例中的df1,它是第二个示例“df1”和第一个“df3”吗?
ID date type value comment
1: a1 2020-09-01 enter 18 used
2: a1 2020-09-01 open NA used
3: a1 2020-09-01 close 15 used
4: a1 2020-09-01 update NA used
5: a1 2020-09-01 delete NA not_used
6: a1 2020-09-02 enter 4 used
7: a1 2020-09-02 open NA used
8: a1 2020-09-02 close NA used
9: a1 2020-09-02 update NA used
10: a1 2020-09-02 delete NA not_used
11: a2 2020-09-01 enter NA used
12: a2 2020-09-01 open NA used
13: a2 2020-09-01 close 10 used
14: a2 2020-09-01 update NA used
15: a2 2020-09-01 delete NA not_used
16: b1 2020-09-02 enter NA used
17: b1 2020-09-02 open NA used
18: b1 2020-09-02 close NA used
19: b1 2020-09-02 update 10 used
20: b1 2020-09-02 delete NA not_used
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][is.na(value), value := 0L][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := fcoalesce(value, 0L)][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := nafill(value, fill = 0L)][]
setnafill(setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)], fill = 0L, cols = "value")[]
ID date type value comment
1: a1 2020-09-01 enter 18 used
2: a1 2020-09-01 open 0 used
3: a1 2020-09-01 close 15 used
4: a1 2020-09-01 update 0 used
5: a1 2020-09-01 delete 0 not_used
6: a1 2020-09-02 enter 4 used
7: a1 2020-09-02 open 0 used
8: a1 2020-09-02 close 0 used
9: a1 2020-09-02 update 0 used
10: a1 2020-09-02 delete 0 not_used
11: a2 2020-09-01 enter 0 used
12: a2 2020-09-01 open 0 used
13: a2 2020-09-01 close 10 used
14: a2 2020-09-01 update 0 used
15: a2 2020-09-01 delete 0 not_used
16: b1 2020-09-02 enter 0 used
17: b1 2020-09-02 open 0 used
18: b1 2020-09-02 close 0 used
19: b1 2020-09-02 update 10 used
20: b1 2020-09-02 delete 0 not_used