R 如何使用data.table对因子变量进行热编码?
对于那些不熟悉的人来说,一个热编码只是指将一列类别(即一个因子)转换为多列二进制指示符变量,其中每个新列对应于原始列的一个类。这个例子将更好地解释这一点:R 如何使用data.table对因子变量进行热编码?,r,data.table,R,Data.table,对于那些不熟悉的人来说,一个热编码只是指将一列类别(即一个因子)转换为多列二进制指示符变量,其中每个新列对应于原始列的一个类。这个例子将更好地解释这一点: dt <- data.table( ID=1:5, Color=factor(c("green", "red", "red", "blue", "green"), levels=c("blue", "green", "red", "purple")), Shape=factor(c("square", "triangle"
dt <- data.table(
ID=1:5,
Color=factor(c("green", "red", "red", "blue", "green"), levels=c("blue", "green", "red", "purple")),
Shape=factor(c("square", "triangle", "square", "triangle", "cirlce"))
)
dt
ID Color Shape
1: 1 green square
2: 2 red triangle
3: 3 red square
4: 4 blue triangle
5: 5 green cirlce
# one hot encode the colors
color.binarized <- dcast(dt[, list(V1=1, ID, Color)], ID ~ Color, fun=sum, value.var="V1", drop=c(TRUE, FALSE))
# Prepend Color_ in front of each one-hot-encoded feature
setnames(color.binarized, setdiff(colnames(color.binarized), "ID"), paste0("Color_", setdiff(colnames(color.binarized), "ID")))
# one hot encode the shapes
shape.binarized <- dcast(dt[, list(V1=1, ID, Shape)], ID ~ Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))
# Prepend Shape_ in front of each one-hot-encoded feature
setnames(shape.binarized, setdiff(colnames(shape.binarized), "ID"), paste0("Shape_", setdiff(colnames(shape.binarized), "ID")))
# Join one-hot tables with original dataset
dt <- dt[color.binarized, on="ID"]
dt <- dt[shape.binarized, on="ID"]
dt
ID Color Shape Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
1: 1 green square 0 1 0 0 0 1 0
2: 2 red triangle 0 0 1 0 0 0 1
3: 3 red square 0 0 1 0 0 1 0
4: 4 blue triangle 1 0 0 0 0 0 1
5: 5 green cirlce 0 1 0 0 1 0 0
我得到了列组合
ID blue_cirlce blue_square blue_triangle green_cirlce green_square green_triangle red_cirlce red_square red_triangle purple_cirlce purple_square purple_triangle
1: 1 0 0 0 0 1 0 0 0 0 0 0 0
2: 2 0 0 0 0 0 0 0 0 1 0 0 0
3: 3 0 0 0 0 0 0 0 1 0 0 0 0
4: 4 0 0 1 0 0 0 0 0 0 0 0 0
5: 5 0 0 0 1 0 0 0 0 0 0 0 0
如果每次都没有人用一种干净的方式手写出来,您可以制作一个函数/宏:
OHE <- function(dt, grp, encodeCols) {
grpSymb = as.symbol(grp)
for (col in encodeCols) {
colSymb = as.symbol(col)
eval(bquote(
dt[, .SD
][, V1 := 1
][, dcast(.SD, .(grpSymb) ~ .(colSymb), fun=sum, value.var='V1')
][, setnames(.SD, setdiff(colnames(.SD), grp), sprintf("%s_%s", col, setdiff(colnames(.SD), grp)))
][, dt <<- dt[.SD, on=grp]
]
))
}
dt
}
dtOHE = OHE(dt, 'ID', c('Color', 'Shape'))
dtOHE
ID Color Shape Color_blue Color_green Color_red Shape_cirlce Shape_square Shape_triangle
1: 1 green square 0 1 0 0 1 0
2: 2 red triangle 0 0 1 0 0 1
3: 3 red square 0 0 1 0 1 0
4: 4 blue triangle 1 0 0 0 0 1
5: 5 green cirlce 0 1 0 1 0 0
OHE使用model.matrix
:
> cbind(dt[, .(ID)], model.matrix(~ Color + Shape, dt))
ID (Intercept) Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
1: 1 1 1 0 0 1 0
2: 2 1 0 1 0 0 1
3: 3 1 0 1 0 1 0
4: 4 1 0 0 0 0 1
5: 5 1 1 0 0 0 0
如果你在做模特,这是最有意义的
如果要抑制截距(并恢复第一个变量的别名列):
给你:
dcast(melt(dt, id.vars='ID'), ID ~ variable + value, fun = length)
# ID Color_blue Color_green Color_red Shape_cirlce Shape_square Shape_triangle
#1: 1 0 1 0 0 1 0
#2: 2 0 0 1 0 0 1
#3: 3 0 0 1 0 1 0
#4: 4 1 0 0 0 0 1
#5: 5 0 1 0 1 0 0
要获取缺少的因素,可以执行以下操作:
res = dcast(melt(dt, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
setnames(res, c("ID", unlist(lapply(2:ncol(dt),
function(i) paste(names(dt)[i], levels(dt[[i]]), sep = "_")))))
res
# ID Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
#1: 1 0 1 0 0 0 1 0
#2: 2 0 0 1 0 0 0 1
#3: 3 0 0 1 0 0 1 0
#4: 4 1 0 0 0 0 0 1
#5: 5 0 1 0 0 1 0 0
以下是eddi解决方案的更一般化版本:
one_hot <- function(dt, cols="auto", dropCols=TRUE, dropUnusedLevels=FALSE){
# One-Hot-Encode unordered factors in a data.table
# If cols = "auto", each unordered factor column in dt will be encoded. (Or specifcy a vector of column names to encode)
# If dropCols=TRUE, the original factor columns are dropped
# If dropUnusedLevels = TRUE, unused factor levels are dropped
# Automatically get the unordered factor columns
if(cols[1] == "auto") cols <- colnames(dt)[which(sapply(dt, function(x) is.factor(x) & !is.ordered(x)))]
# Build tempDT containing and ID column and 'cols' columns
tempDT <- dt[, cols, with=FALSE]
tempDT[, ID := .I]
setcolorder(tempDT, unique(c("ID", colnames(tempDT))))
for(col in cols) set(tempDT, j=col, value=factor(paste(col, tempDT[[col]], sep="_"), levels=paste(col, levels(tempDT[[col]]), sep="_")))
# One-hot-encode
if(dropUnusedLevels == TRUE){
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
} else{
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
}
# Combine binarized columns with the original dataset
result <- cbind(dt, newCols[, !"ID"])
# If dropCols = TRUE, remove the original factor columns
if(dropCols == TRUE){
result <- result[, !cols, with=FALSE]
}
return(result)
}
one\u hot只需几行就可以解决此问题:
library(tidyverse)
dt2 <- spread(dt,Color,Shape)
dt3 <- spread(dt,Shape,Color)
df <- cbind(dt2,dt3)
df2 <- apply(df, 2, function(x){sapply(x, function(y){
ifelse(is.na(y),0,1)
})})
df2 <- as.data.frame(df2)
df <- cbind(dt,df2[,-1])
库(tidyverse)
dt2对于OHE,最好使用稀疏矩阵。@Davidernburg感谢您的快速响应。在生产模型中,我通常会这样做,但当我修补新想法并在小数据集上进行测试时,我喜欢使用data.table,因为它更容易查看/绘图/subsetOk,然后我只需执行dcast(melt(dt,1),ID~值,长度)
。可能在某个地方有一个dupe?model.matrix
matrix::sparse.model.matrix
会更好。Shapecirlce
缺失…?@eddiShapecircle
可以从Shapesquare
和Shapetriangle
的值推断出来。表示n个级别通常需要n-1列。@HongOoi虽然这是事实,但机器学习模型经常使用一种热编码,对列的子集进行随机采样(例如随机森林、梯度增强等)。对于这些模型,通常最好包含所有数据列,因为在获取列子集后无法推断缺少的一列。@Ben我不确定我是否理解您关于使用n列而不是n-1Ah的评论,这看起来非常优雅,但不幸的是缺少紫色(未使用的颜色级别).看来我是越狱了。不幸的是,只有当每个因子列的级别完全不同时,这才有效。不过我很确定我能修好它。你给了我90%的机会。@Ben也许可以从这个开始,然后你以后就不需要重新命名了:newdt=setDT(lapply(1:ncol(dt),function(i)if(is.factor(dt[[i]]){factor(paste(names(dt)[i]],levels(dt[[i]]),sep=“[u”)[dt i]]}否则{dt i[[i]}
你的函数对我来说非常有用,但由于它在n列中转换n个因子级别,因此对于创建对多重共线性敏感的模型是没有用的。是否有一个调整后的函数版本可以为每个因子列生成n-1个伪列?@Constantin否,但您可以在编码后删除其中一个列。
one_hot <- function(dt, cols="auto", dropCols=TRUE, dropUnusedLevels=FALSE){
# One-Hot-Encode unordered factors in a data.table
# If cols = "auto", each unordered factor column in dt will be encoded. (Or specifcy a vector of column names to encode)
# If dropCols=TRUE, the original factor columns are dropped
# If dropUnusedLevels = TRUE, unused factor levels are dropped
# Automatically get the unordered factor columns
if(cols[1] == "auto") cols <- colnames(dt)[which(sapply(dt, function(x) is.factor(x) & !is.ordered(x)))]
# Build tempDT containing and ID column and 'cols' columns
tempDT <- dt[, cols, with=FALSE]
tempDT[, ID := .I]
setcolorder(tempDT, unique(c("ID", colnames(tempDT))))
for(col in cols) set(tempDT, j=col, value=factor(paste(col, tempDT[[col]], sep="_"), levels=paste(col, levels(tempDT[[col]]), sep="_")))
# One-hot-encode
if(dropUnusedLevels == TRUE){
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
} else{
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
}
# Combine binarized columns with the original dataset
result <- cbind(dt, newCols[, !"ID"])
# If dropCols = TRUE, remove the original factor columns
if(dropCols == TRUE){
result <- result[, !cols, with=FALSE]
}
return(result)
}
library(tidyverse)
dt2 <- spread(dt,Color,Shape)
dt3 <- spread(dt,Shape,Color)
df <- cbind(dt2,dt3)
df2 <- apply(df, 2, function(x){sapply(x, function(y){
ifelse(is.na(y),0,1)
})})
df2 <- as.data.frame(df2)
df <- cbind(dt,df2[,-1])