从事务行在R中创建矩阵
我有事务数据,我需要从中创建关联矩阵。我尝试了自我加入,但似乎没有任何效果。下面是示例代码和所需的输出。我正在寻找一个解决方案,使用R中的数据表从事务行在R中创建矩阵,r,data.table,R,Data.table,我有事务数据,我需要从中创建关联矩阵。我尝试了自我加入,但似乎没有任何效果。下面是示例代码和所需的输出。我正在寻找一个解决方案,使用R中的数据表 > TP <- data.table( Tr = c("T1","T1","T2","T2","T2", "T3", "T4"), Pr = c("P1","P2","P3","P1","P4", "P2", "P9") ) > TP Tr Pr 1: T1 P1 2: T1 P2 3: T2 P3 4: T2 P1
> TP <- data.table(
Tr = c("T1","T1","T2","T2","T2", "T3", "T4"),
Pr = c("P1","P2","P3","P1","P4", "P2", "P9")
)
> TP
Tr Pr
1: T1 P1
2: T1 P2
3: T2 P3
4: T2 P1
5: T2 P4
6: T3 P2
7: T4 P9
如果可能的话,得到这样的东西会更好
Pr T1 T2 T3 T4
1: P1 1 1 0 0
2: P2 1 0 1 0
3: P3 1 0 0 0
4: P4 0 1 0 0
5: P9 0 0 0 1
这应该起作用:
dcast(TP, Pr ~ Tr, fun.aggregate = function(x){(length(x) > 0) * 1})
Using 'Pr' as value column. Use 'value.var' to override
Pr T1 T2 T3 T4
1: P1 1 1 0 0
2: P2 1 0 1 0
3: P3 0 1 0 0
4: P4 0 1 0 0
5: P9 0 0 0 1
@如果我们没有重复的关联,David Arenburg的建议会更加清晰:
dcast(TP, Pr ~ Tr, length)
这应该起作用:
dcast(TP, Pr ~ Tr, fun.aggregate = function(x){(length(x) > 0) * 1})
Using 'Pr' as value column. Use 'value.var' to override
Pr T1 T2 T3 T4
1: P1 1 1 0 0
2: P2 1 0 1 0
3: P3 0 1 0 0
4: P4 0 1 0 0
5: P9 0 0 0 1
@如果我们没有重复的关联,David Arenburg的建议会更加清晰:
dcast(TP, Pr ~ Tr, length)
我会按照David Arenburg的建议使用
dcast()
,但这里有一个(快速)的选择,只是为了好玩:
TP[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")]
Pr T1 T2 T3 T4
1: P1 1 1 0 0
2: P2 1 0 1 0
3: P3 0 1 0 0
4: P4 0 1 0 0
5: P9 0 0 0 1
基准:
在处理数百万行时,dcast()
似乎更快:
TP1 <- data.table(
Tr = paste0("T", sample(1:10, size = 1e5, replace = TRUE)),
Pr = paste0("P", sample(1:1e4, size = 1e5, replace = TRUE))
)
TP_huge <- data.table(
Tr = paste0("T", sample(1:10, size = 1e7, replace = TRUE)),
Pr = paste0("P", sample(1:1e4, size = 1e7, replace = TRUE))
)
microbenchmark::microbenchmark(
table1 = TP1[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")],
dcast1 = dcast(TP1, Pr ~ Tr, length, value.var = "Pr"),
table_huge = TP_huge[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")],
dcast_huge = dcast(TP_huge, Pr ~ Tr, length, value.var = "Pr"),
times = 5
)
Unit: milliseconds
expr min lq mean median uq max neval cld
table1 92.71867 105.8366 127.4707 124.4188 150.0642 164.3155 5 a
dcast1 255.53793 271.5194 292.2005 301.4840 302.5010 329.9600 5 b
table_huge 1719.83678 1732.1086 1771.0142 1733.8847 1771.5087 1897.7325 5 d
dcast_huge 917.94755 927.1657 971.4084 986.1038 998.1780 1027.6468 5 c
TP1我会按照David Arenburg的建议使用dcast()
,但这里有一个(快速)的选择,只是为了好玩:
TP[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")]
Pr T1 T2 T3 T4
1: P1 1 1 0 0
2: P2 1 0 1 0
3: P3 0 1 0 0
4: P4 0 1 0 0
5: P9 0 0 0 1
基准:
在处理数百万行时,dcast()
似乎更快:
TP1 <- data.table(
Tr = paste0("T", sample(1:10, size = 1e5, replace = TRUE)),
Pr = paste0("P", sample(1:1e4, size = 1e5, replace = TRUE))
)
TP_huge <- data.table(
Tr = paste0("T", sample(1:10, size = 1e7, replace = TRUE)),
Pr = paste0("P", sample(1:1e4, size = 1e7, replace = TRUE))
)
microbenchmark::microbenchmark(
table1 = TP1[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")],
dcast1 = dcast(TP1, Pr ~ Tr, length, value.var = "Pr"),
table_huge = TP_huge[, data.table(unclass(table(Pr, Tr)), keep.rownames = "Pr")],
dcast_huge = dcast(TP_huge, Pr ~ Tr, length, value.var = "Pr"),
times = 5
)
Unit: milliseconds
expr min lq mean median uq max neval cld
table1 92.71867 105.8366 127.4707 124.4188 150.0642 164.3155 5 a
dcast1 255.53793 271.5194 292.2005 301.4840 302.5010 329.9600 5 b
table_huge 1719.83678 1732.1086 1771.0142 1733.8847 1771.5087 1897.7325 5 d
dcast_huge 917.94755 927.1657 971.4084 986.1038 998.1780 1027.6468 5 c
TP1我没有真正遵循该函数的逻辑,为什么你不能直接执行dcast(TP,Pr~Tr,length)
?我的想法是,如果TP
中有重复的行,我们仍然需要一个1
值。如果没有DUP,那么您的也可以。我并没有真正遵循该函数的逻辑,为什么您不能只执行dcast(TP,Pr~Tr,length)
?我的想法是,如果TP
中有重复的行,我们仍然需要一个1
值。如果没有DUP,那么您的也可以工作。可能这是fasteTP[,setDT(as.data.frame.matrix(table(Pr,Tr)),keep.rownames=“Pr”)][
。打开了一条探索这种可能性的新线索:我有点怀疑在大数据集上table
会比dcast
更快。我很想看电影banchmark@DavidArenburg. 你说得对。我还尝试了1亿行,而dcast()
所花的时间约为一半。仍然table()
在小数据上获胜。也许这是fasteTP[,setDT(as.data.frame.matrix(table(Pr,Tr)),keep.rownames=“Pr”)[]
。打开了一条探索这种可能性的新线索:我有点怀疑在大数据集上table
会比dcast
更快。我很想看电影banchmark@DavidArenburg. 你说得对。我还尝试了1亿行,而dcast()
所花的时间约为一半。仍然table()
在小数据上获胜。