R 如何基于data.table中的分类变量以编程方式创建二进制列？_R_Data.table_Binary Data_Programmatically Created

R 如何基于data.table中的分类变量以编程方式创建二进制列？

R 如何基于data.table中的分类变量以编程方式创建二进制列？,r,data.table,binary-data,programmatically-created,R,Data.table,Binary Data,Programmatically Created,我有一个大的（1200万行）数据。表如下所示：库（data.table）种子集（123） dt id y 1:1 b 2:1 d 3:1 c 4:1 e 5:1 e 6:2 a 7:2摄氏度 8:2 e 9:2摄氏度 10:2摄氏度 11:3 e 12:3 c 13:3d 14:3摄氏度 15:3 a 我想创建一个新的数据.table包含我的变量id（这将是这个新数据.table的唯一键）和5个其他二进制变量，每个变量对应于y的每个类别，如果id的值为y，则取值1，否则为0。输出data

我有一个大的（1200万行）

数据。表

如下所示：

库（data.table）
种子集（123）
dt
id y
1:1 b
2:1 d
3:1 c
4:1 e
5:1 e
6:2 a
7:2摄氏度
8:2 e
9:2摄氏度
10:2摄氏度
11:3 e
12:3 c
13:3d
14:3摄氏度
15:3 a

我想创建一个新的

数据.table

包含我的变量

id

（这将是这个新

数据.table

的唯一键）和5个其他二进制变量，每个变量对应于

的每个类别，如果id的值为

，则取值

，否则为

。
输出

data.table

应如下所示：

id a b c d e
1:  1 0 1 1 1 1
2:  2 1 0 1 0 1
3:  3 1 0 1 1 1

我尝试在循环中执行此操作，但速度非常慢，而且我不知道如何以编程方式传递二进制变量名，因为它们取决于我尝试“拆分”的变量

EDIT：正如@mtoto所指出的，类似的问题已经被提出并得到了回答，但解决方案是使用

restrape2

软件包。
我想知道是否有另一种（更快的）方法可以做到这一点，可以使用data.table中的

：=

操作符，因为我有一个庞大的数据集，并且我正在大量使用这个包

EDIT2：@Arun在我的数据上发布的函数的基准（

变量的约1200万行、~350万个不同ID和490个不同标签（导致490个伪变量））：

system.time（ans1如果您已经知道行的范围（如您知道的示例中不超过3行），并且知道列，则可以从零数组开始，并使用apply函数更新该辅助表中的值
我的R有点生锈，但我认为应该可以工作。此外，传递给apply方法的函数可能包含根据需要添加必要行和列的条件
我的R有点生锈，所以我现在有点想写出来，但我认为这是解决问题的方法
如果您正在寻找更多即插即用的产品，我发现了这个小blerb：
There are two sets of methods that are explained below:

gather() and spread() from the tidyr package. This is a newer interface to the reshape2 package.

melt() and dcast() from the reshape2 package.

There are a number of other methods which aren’t covered here, since they are not as easy to use:

The reshape() function, which is confusingly not part of the reshape2 package; it is part of the base install of R.

stack() and unstack()

从这里开始：：
如果我更精通R，我会告诉你这些不同的方法是如何处理从长列表到宽列表的冲突的
还可以使用我的个人评论包装器查看与上面相同的网站：p
data.table有自己的dcast
实现，使用数据。table的内部结构应该很快。请尝试一下：
dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
#    id a b c d e
# 1:  1 0 1 1 1 1
# 2:  2 1 0 1 0 1
# 3:  3 1 0 1 1 1


只是想了另一种方法来处理这个问题，通过引用预分配和更新（也许dcast的逻辑应该这样做以避免中间过程）
剩下的就是用1L
填充现有组合
dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
#    id b d c e a
# 1:  1 1 1 1 1 0
# 2:  2 0 0 1 1 1
# 3:  3 0 1 1 1 1


好的，我已经开始对OP的数据维度进行基准测试，大约有1000万行和10列
require(data.table)
set.seed(45L)
y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

system.time(ans1 <- AnsFunction())   # 2.3s
system.time(ans2 <- dcastFunction()) # 2.2s
system.time(ans3 <- TableFunction()) # 6.2s

setcolorder(ans1, names(ans2))
setcolorder(ans3, names(ans2))
setorder(ans1, id)
setkey(ans2, NULL)
setorder(ans3, id)

identical(ans1, ans2) # TRUE
identical(ans1, ans3) # TRUE

require（data.table）
结实种子（45L）
y=应用（矩阵（样本（字母，10L*20L，真），ncol=20L），1L，粘贴，折叠=”）
dt=数据表（id=样本（1e5,1e7，真），y=样本（y，1e7，真））
system.time（ans1对于小数据集，表函数似乎更有效，但对于大数据集，dcast似乎是最有效和方便的选择
TableFunction <- function(){
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1] <- 1
    df <- cbind(id = as.numeric(row.names(df)), df)
    setDT(df)
}


AnsFunction <- function(){
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=id, j=unique(y), value=1L); NULL}, by=id]
}

dcastFunction <- function(){
    df <-dcast.data.table(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")

}

library(data.table)
library(microbenchmark)
set.seed(123)
N = 10000
dt <- data.table(id=rep(1:N, each=5),y=sample(letters[1 : 5], N*5, replace = T)) 


microbenchmark(
    "dcast" = dcastFunction(),
    "Table" = TableFunction(),
    "Ans"   = AnsFunction()
    )


 Unit: milliseconds
  expr       min        lq      mean    median        uq       max neval cld
 dcast  42.48367  45.39793  47.56898  46.83755  49.33388  60.72327   100  b 
 Table  28.32704  28.74579  29.14043  29.00010  29.23320  35.16723   100 a  
   Ans 120.80609 123.95895 127.35880 126.85018 130.12491 156.53289   100   c

我注意到有类似的行，比如四行和五行，你能更好地解释一下这个数据吗？据我所知，如果（2>0），则data[1][e]=1else 0
但它似乎有点奇怪。可能是@kpie I编辑了第二个数据。表
，现在应该更清楚了：id
n.1具有y
的distinc值b、c、d、e
，而不是a
。这解释了为什么他在第二个数据上的行。表
每小时都有1
除了a
列之外，@mtoto谢谢你的回答，这将解决我的问题，但是对于如此海量的数据，我想知道是否还有其他方法可以做同样的事情，但是在数据中。table
，可能使用：=
操作符。如果你想使用数据。table
，你可以使用dcast（）<代码> DCAST（DT，ID~Y，Fun.Copys=函数（x）（长度（x）＞0）+0）< /C> >你也可以考虑把你的1/0放在一个“矩阵”中，可能是稀疏的有保存一些内存的机会-<代码> Uy＝唯一（dt $y）；m＝矩阵（0L，max（dt $id），长度（uy），diMeNe=列表（null，uy））；m [cBin（dt$id，匹配（dt$ y，uy））]=1L
您的方法看起来正是我想要的。我明白了，但是当我在dt
上运行您的第二种方法的代码时，它不起作用，我得到了空数据表（0行）关于1 col:id
@helter，你能编辑你的Q来显示你原始数据上面两个方法之间的运行时间基准吗？这根本不是问题，我只是以前做不到，我认为@Tobias的基准已经足够了。我只是在问题中添加了基准。太棒了，谢谢。我计划改进dcast
用于下一版本。这无疑有助于了解如何不改进dcast（）
。我认为TableFunction
中最慢的部分是table（dt$id，dt$y）
。事实上，在处理这个数据集时，我注意到，table（）
非常慢，可能是因为我有太多的id
s。因此，一般来说，我倾向于使用data.table
的.N
操作符在j
参数中对by=id
进行子集设置。在TableFunction
中更改该位可能会提高性能（？），但我不知道如何在没有table（）
的情况下获得与TableFunction第一行相同的输出。我已经在我的帖子中添加了一个关于更大数据的基准测试。我不确定您是否正在运行
AnsFunction <- function() {
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
    ans
    # reorder columns outside
}

dcastFunction <- function() {
    # no need to load reshape2. data.table has its own dcast as well
    # no need for setDT
    df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
}

TableFunction <- function() {
    # need to return integer results for identical results
    # fixed 1 -> 1L; as.numeric -> as.integer
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1L] <- 1L
    df <- cbind(id = as.integer(row.names(df)), df)
    setDT(df)
}

TableFunction <- function(){
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1] <- 1
    df <- cbind(id = as.numeric(row.names(df)), df)
    setDT(df)
}


AnsFunction <- function(){
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=id, j=unique(y), value=1L); NULL}, by=id]
}

dcastFunction <- function(){
    df <-dcast.data.table(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")

}

library(data.table)
library(microbenchmark)
set.seed(123)
N = 10000
dt <- data.table(id=rep(1:N, each=5),y=sample(letters[1 : 5], N*5, replace = T)) 


microbenchmark(
    "dcast" = dcastFunction(),
    "Table" = TableFunction(),
    "Ans"   = AnsFunction()
    )


 Unit: milliseconds
  expr       min        lq      mean    median        uq       max neval cld
 dcast  42.48367  45.39793  47.56898  46.83755  49.33388  60.72327   100  b 
 Table  28.32704  28.74579  29.14043  29.00010  29.23320  35.16723   100 a  
   Ans 120.80609 123.95895 127.35880 126.85018 130.12491 156.53289   100   c

> all(test1 == test2)
[1] TRUE
> all(test1 == test3)
[1] TRUE

y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

microbenchmark(
    "dcast" = dcastFunction(),
    "Table" = TableFunction(),
    "Ans"   = AnsFunction()
)
Unit: seconds
  expr      min       lq     mean   median       uq      max neval cld
 dcast 1.985969 2.064964 2.189764 2.216138 2.266959 2.643231   100 a  
 Table 5.022388 5.403263 5.605012 5.580228 5.830414 6.318729   100   c
   Ans 2.234636 2.414224 2.586727 2.599156 2.645717 2.982311   100  b