从事务行在R中创建矩阵_R_Data.table

从事务行在R中创建矩阵

从事务行在R中创建矩阵,r,data.table,R,Data.table,我有事务数据，我需要从中创建关联矩阵。我尝试了自我加入，但似乎没有任何效果。下面是示例代码和所需的输出。我正在寻找一个解决方案，使用R中的数据表 > TP <- data.table( Tr = c("T1","T1","T2","T2","T2", "T3", "T4"), Pr = c("P1","P2","P3","P1","P4", "P2", "P9") ) > TP Tr Pr 1: T1 P1 2: T1 P2 3: T2 P3 4: T2 P1

我有事务数据，我需要从中创建关联矩阵。我尝试了自我加入，但似乎没有任何效果。下面是示例代码和所需的输出。我正在寻找一个解决方案，使用R中的数据表

> TP <- data.table(
  Tr = c("T1","T1","T2","T2","T2", "T3", "T4"),
  Pr = c("P1","P2","P3","P1","P4", "P2", "P9")
)


> TP
   Tr Pr
1: T1 P1
2: T1 P2
3: T2 P3
4: T2 P1
5: T2 P4
6: T3 P2
7: T4 P9

如果可能的话，得到这样的东西会更好

   Pr T1 T2 T3 T4
1: P1  1  1  0  0
2: P2  1  0  1  0
3: P3  1  0  0  0
4: P4  0  1  0  0
5: P9  0  0  0  1

这应该起作用：

dcast(TP, Pr ~ Tr, fun.aggregate = function(x){(length(x) > 0) * 1})

Using 'Pr' as value column. Use 'value.var' to override
   Pr T1 T2 T3 T4
1: P1  1  1  0  0
2: P2  1  0  1  0
3: P3  0  1  0  0
4: P4  0  1  0  0
5: P9  0  0  0  1

@如果我们没有重复的关联，David Arenburg的建议会更加清晰：

dcast(TP, Pr ~ Tr, length)

这应该起作用：

dcast(TP, Pr ~ Tr, fun.aggregate = function(x){(length(x) > 0) * 1})

Using 'Pr' as value column. Use 'value.var' to override
   Pr T1 T2 T3 T4
1: P1  1  1  0  0
2: P2  1  0  1  0
3: P3  0  1  0  0
4: P4  0  1  0  0
5: P9  0  0  0  1

@如果我们没有重复的关联，David Arenburg的建议会更加清晰：

dcast(TP, Pr ~ Tr, length)

我会按照David Arenburg的建议使用

dcast（）

，但这里有一个（快速）的选择，只是为了好玩：

TP[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")]
   Pr T1 T2 T3 T4
1: P1  1  1  0  0
2: P2  1  0  1  0
3: P3  0  1  0  0
4: P4  0  1  0  0
5: P9  0  0  0  1

基准：

在处理数百万行时，

dcast（）

似乎更快：

TP1 <- data.table(
  Tr = paste0("T", sample(1:10, size = 1e5, replace = TRUE)),
  Pr = paste0("P", sample(1:1e4, size = 1e5, replace = TRUE))
)

TP_huge <- data.table(
  Tr = paste0("T", sample(1:10, size = 1e7, replace = TRUE)),
  Pr = paste0("P", sample(1:1e4, size = 1e7, replace = TRUE))
)

microbenchmark::microbenchmark(
  table1 = TP1[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")],
  dcast1 = dcast(TP1, Pr ~ Tr, length, value.var = "Pr"),
  table_huge = TP_huge[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")],
  dcast_huge = dcast(TP_huge, Pr ~ Tr, length, value.var = "Pr"),
  times = 5
)
Unit: milliseconds
       expr        min        lq      mean    median        uq       max neval  cld
     table1   92.71867  105.8366  127.4707  124.4188  150.0642  164.3155     5 a   
     dcast1  255.53793  271.5194  292.2005  301.4840  302.5010  329.9600     5  b  
 table_huge 1719.83678 1732.1086 1771.0142 1733.8847 1771.5087 1897.7325     5    d
 dcast_huge  917.94755  927.1657  971.4084  986.1038  998.1780 1027.6468     5   c

TP1我会按照David Arenburg的建议使用dcast（）
，但这里有一个（快速）的选择，只是为了好玩：
TP[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")]
   Pr T1 T2 T3 T4
1: P1  1  1  0  0
2: P2  1  0  1  0
3: P3  0  1  0  0
4: P4  0  1  0  0
5: P9  0  0  0  1

基准：
在处理数百万行时，dcast（）
似乎更快：
TP1 <- data.table(
  Tr = paste0("T", sample(1:10, size = 1e5, replace = TRUE)),
  Pr = paste0("P", sample(1:1e4, size = 1e5, replace = TRUE))
)

TP_huge <- data.table(
  Tr = paste0("T", sample(1:10, size = 1e7, replace = TRUE)),
  Pr = paste0("P", sample(1:1e4, size = 1e7, replace = TRUE))
)

microbenchmark::microbenchmark(
  table1 = TP1[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")],
  dcast1 = dcast(TP1, Pr ~ Tr, length, value.var = "Pr"),
  table_huge = TP_huge[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")],
  dcast_huge = dcast(TP_huge, Pr ~ Tr, length, value.var = "Pr"),
  times = 5
)
Unit: milliseconds
       expr        min        lq      mean    median        uq       max neval  cld
     table1   92.71867  105.8366  127.4707  124.4188  150.0642  164.3155     5 a   
     dcast1  255.53793  271.5194  292.2005  301.4840  302.5010  329.9600     5  b  
 table_huge 1719.83678 1732.1086 1771.0142 1733.8847 1771.5087 1897.7325     5    d
 dcast_huge  917.94755  927.1657  971.4084  986.1038  998.1780 1027.6468     5   c 

TP1我没有真正遵循该函数的逻辑，为什么你不能直接执行dcast（TP，Pr~Tr，length）
？我的想法是，如果TP
中有重复的行，我们仍然需要一个1
值。如果没有DUP，那么您的也可以。我并没有真正遵循该函数的逻辑，为什么您不能只执行dcast（TP，Pr~Tr，length）
？我的想法是，如果TP
中有重复的行，我们仍然需要一个1
值。如果没有DUP，那么您的也可以工作。可能这是fasteTP[，setDT（as.data.frame.matrix（table（Pr，Tr）），keep.rownames=“Pr”）][
。打开了一条探索这种可能性的新线索：我有点怀疑在大数据集上table
会比dcast
更快。我很想看电影banchmark@DavidArenburg. 你说得对。我还尝试了1亿行，而dcast（）
所花的时间约为一半。仍然table（）
在小数据上获胜。也许这是fasteTP[，setDT（as.data.frame.matrix（table（Pr，Tr）），keep.rownames=“Pr”）[]
。打开了一条探索这种可能性的新线索：我有点怀疑在大数据集上table
会比dcast
更快。我很想看电影banchmark@DavidArenburg. 你说得对。我还尝试了1亿行，而dcast（）
所花的时间约为一半。仍然table（）
在小数据上获胜。