如何将代码从使用tidyverse重写为data.table?

如何将代码从使用tidyverse重写为data.table?,r,data.table,R,Data.table,我编写了一个代码: df_all<- df1 %>% mutate(type = factor(type, levels = df3$type)) %>% group_by(ID, date) %>% complete(type, fill = list(value = 0)) %>% left_join(df3) 我从数据库中获取数据帧。但在这个数据框架中,并非所有的“类型”都是。下面是该表的示例: ID date t

我编写了一个代码:

df_all<- df1 %>%
  mutate(type = factor(type, levels = df3$type)) %>%
  group_by(ID, date) %>%
  complete(type, fill = list(value = 0)) %>%
  left_join(df3)
我从数据库中获取数据帧。但在这个数据框架中,并非所有的“类型”都是。下面是该表的示例:

ID    date            type           value
a1    2020-09-01       enter          18
a1    2020-09-01       close          15
a1    2020-09-02       enter          4
a2    2020-09-01       close          10
b1    2020-09-02       update         10
如您所见,ID a1只有两种类型:enter和close。a2只有关闭,b1只有更新

我希望以这种方式绑定这两个表,以便表中不存在的“类型”对于每个ID和日期都具有零值。那么,如何绑定这两个表以获得:

comment            ID    date            type           value
used               a1    2020-09-01       enter          18
used               a1    2020-09-01       open           0
used               a1    2020-09-01       close          15
used               a1    2020-09-01       update         0
not_used           a1    2020-09-01       delete         0
used               a1    2020-09-02       enter          4
used               a1    2020-09-02       open           0
used               a1    2020-09-02       close          0
used               a1    2020-09-02       update         0
not_used           a1    2020-09-02       delete         0
used               a2    2020-09-01       enter          0
used               a2    2020-09-01       open           0
used               a2    2020-09-01       close          10
used               a2    2020-09-01       update         0
not_used           a2    2020-09-01       delete         0
used               b1    2020-09-01       enter          0
used               b1    2020-09-01       open           0
used               b1    2020-09-01       close          0
used               b1    2020-09-01       update         10
not_used           b1    2020-09-01       delete         0

如你所见,我还保留了“评论”一栏。如何将代码重写为数据表?

我们可以转换
因子的
类型
,使用
CJ
(交叉连接)按ID、日期和类型展开

library(data.table)
setDT(df1)[, type := factor(type, levels = unique(df3$type))][,
   CJ(ID, date, type = type, unique = TRUE)][df1,
    value := value, on = .(ID, date, type)][is.na(value),
     value := 0][df3, on = .(type)]

也可以通过
split

setDT(df1)[, type := factor(type, levels = unique(df3$type))]
rbindlist(lapply(split(df1, df1[, .(ID, date)], drop = TRUE), 
   function(x) x[, CJ(ID, date, type = levels(x$type), unique = TRUE)][x, 
     value := value, on = .(ID, date, type)][is.na(value), value := 0][]))[df3, on = .(type)]))
-输出

#    ID       date   type value  comment
# 1: a1 2020-09-01  enter    18     used
# 2: a1 2020-09-01   open     0     used
# 3: a1 2020-09-01  close    15     used
# 4: a1 2020-09-01 update     0     used
# 5: a1 2020-09-01 delete     0 not_used
# 6: a2 2020-09-01  enter     0     used
# 7: a2 2020-09-01   open     0     used
# 8: a2 2020-09-01  close    10     used
# 9: a2 2020-09-01 update     0     used
#10: a2 2020-09-01 delete     0 not_used
#11: a1 2020-09-02  enter     4     used
#12: a1 2020-09-02   open     0     used
#13: a1 2020-09-02  close     0     used
#14: a1 2020-09-02 update     0     used
#15: a1 2020-09-02 delete     0 not_used
#16: b1 2020-09-02  enter     0     used
#17: b1 2020-09-02   open     0     used
#18: b1 2020-09-02  close     0     used
#19: b1 2020-09-02 update    10     used
#20: b1 2020-09-02 delete     0 not_used

这是一个简明的
数据。表
OP代码版本:

setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)]
返回预期结果(为清晰起见,
列中的
NA
s在第二步中转换-见下文)

对于每一组唯一的
ID
date
组合,
df1
的行子集与
type
上的
df3
右键连接,从而完成每个子集缺少的行。因为使用了右连接而不是tidyr::complete(),所以这里并没有必要强制
type
使用所有因子级别进行因子转换。另外,
data.table
在连接期间保留了
df3
行的顺序

对于转换
value
列中的
NA
s,有4种不同的方法可用,它们都返回相同的结果:

setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][is.na(value), value := 0L][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := fcoalesce(value, 0L)][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := nafill(value, fill = 0L)][]
setnafill(setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)], fill = 0L, cols = "value")[]

它在
[.data.table
(rbindlist(list(df1,df2))[,]中表示错误:当i是data.table(或字符向量)时,必须使用“on=”参数(请参见?data.table)、键入x(即排序,并标记为排序,请参见?setkey)或通过在x和i之间共享列名(即自然联接)来指定要联接的列.Keyed join可能对非常大的数据有进一步的速度优势,因为x在RAM中排序。@french_fries它必须来自
CJ`的最后一步。我没有测试它,因为没有示例。你能发布一个示例进行测试吗。谢谢我添加了示例和explanation@french_fries您确定dplyr语法适用于 rbind
您的代码是否使用示例数据帧给出了所需的结果?在代码中,您有
计数列。它不在示例中code@akrun我编辑过。这是第一次dataframe@akrun我删除了第一个rbind部分,现在它只是示例中的df1,它是第二个示例“df1”和第一个“df3”吗?
    ID       date   type value  comment
 1: a1 2020-09-01  enter    18     used
 2: a1 2020-09-01   open    NA     used
 3: a1 2020-09-01  close    15     used
 4: a1 2020-09-01 update    NA     used
 5: a1 2020-09-01 delete    NA not_used
 6: a1 2020-09-02  enter     4     used
 7: a1 2020-09-02   open    NA     used
 8: a1 2020-09-02  close    NA     used
 9: a1 2020-09-02 update    NA     used
10: a1 2020-09-02 delete    NA not_used
11: a2 2020-09-01  enter    NA     used
12: a2 2020-09-01   open    NA     used
13: a2 2020-09-01  close    10     used
14: a2 2020-09-01 update    NA     used
15: a2 2020-09-01 delete    NA not_used
16: b1 2020-09-02  enter    NA     used
17: b1 2020-09-02   open    NA     used
18: b1 2020-09-02  close    NA     used
19: b1 2020-09-02 update    10     used
20: b1 2020-09-02 delete    NA not_used
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][is.na(value), value := 0L][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := fcoalesce(value, 0L)][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := nafill(value, fill = 0L)][]
setnafill(setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)], fill = 0L, cols = "value")[]
     ID       date   type value  comment
 1: a1 2020-09-01  enter    18     used
 2: a1 2020-09-01   open     0     used
 3: a1 2020-09-01  close    15     used
 4: a1 2020-09-01 update     0     used
 5: a1 2020-09-01 delete     0 not_used
 6: a1 2020-09-02  enter     4     used
 7: a1 2020-09-02   open     0     used
 8: a1 2020-09-02  close     0     used
 9: a1 2020-09-02 update     0     used
10: a1 2020-09-02 delete     0 not_used
11: a2 2020-09-01  enter     0     used
12: a2 2020-09-01   open     0     used
13: a2 2020-09-01  close    10     used
14: a2 2020-09-01 update     0     used
15: a2 2020-09-01 delete     0 not_used
16: b1 2020-09-02  enter     0     used
17: b1 2020-09-02   open     0     used
18: b1 2020-09-02  close     0     used
19: b1 2020-09-02 update    10     used
20: b1 2020-09-02 delete     0 not_used