减少每个因素dplyr方法的级别数
我试图减少数据中每个因子变量的级别数。我想减少执行2个操作的级别数:减少每个因素dplyr方法的级别数,r,dplyr,levels,R,Dplyr,Levels,我试图减少数据中每个因子变量的级别数。我想减少执行2个操作的级别数: 如果级别数大于截止值,则将频率较低的级别更换为新级别,直到级别数达到截止值 将因子中观测值不足的级别替换为新级别 我写了一个运行良好的函数,但我不喜欢它的代码。如果没有足够的观测值,则水平保持不变并不重要。我更喜欢dplyr方法 ReplaceFactor <- function(data, max_levels, min_values_factor){ # First make sure that not to
ReplaceFactor <- function(data, max_levels, min_values_factor){
# First make sure that not to many levels are in a factor
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(length(levels(data[[i]])) > max_levels){
levels_keep <- names(sort(table(data[[i]]), decreasing = T))[1 : (max_levels - 1)]
data[!get(i) %in% levels_keep, (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
# Now make sure that in each level has enough observations
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(min(table(data[[i]])) < min_values_factor){
levels_replace <- table(data[[i]])[table(data[[i]]) < min_values_factor]
data[get(i) %in% names(levels_replace), (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
return(data)
}
df <- data.frame(A = c("A","A","B","B","C","C","C","C","C"),
B = 1:9,
C = c("A","A","B","B","C","C","C","D","D"),
D = c("A","B","E", "E", "E","E","E", "E", "E"))
str(df)
'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 3 3 3 4 4
$ D: Factor w/ 3 levels "A","B","E": 1 2 3 3 3 3 3 3 3
dt2 <- ReplaceFactor(data = data.table(df),
max_levels = 3,
min_values_factor = 2)
str(dt2)
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 3 levels "A","C","REMAIN": 1 1 3 3 2 2 2 3 3
$ D: Factor w/ 2 levels "E","REMAIN": 2 2 1 1 1 1 1 1 1
- attr(*, ".internal.selfref")=<externalptr>
dt2
A B C D
1: A 1 A REMAIN
2: A 2 A REMAIN
3: B 3 REMAIN E
4: B 4 REMAIN E
5: C 5 C E
6: C 6 C E
7: C 7 C E
8: C 8 REMAIN E
9: C 9 REMAIN E
ReplaceFactor最大值(U级){
使用猫的保持水平
:
library(dplyr)
library(forcats)
max_levels <- 3
min_values_factor <- 2
df %>%
mutate_if(is.factor, fct_lump, n = max_levels,
other_level = "REMAIN", ties.method = "first") %>%
mutate_if(is.factor, fct_lump, prop = (min_values_factor - 1) / nrow(.),
other_level = "REMAIN")
# A B C D
# 1 A 1 A REMAIN
# 2 A 2 A REMAIN
# 3 B 3 B E
# 4 B 4 B E
# 5 C 5 C E
# 6 C 6 C E
# 7 C 7 C E
# 8 C 8 REMAIN E
# 9 C 9 REMAIN E
库(dplyr)
图书馆(供猫用)
最高水平%
如果(is.factor,fct\u lump,prop=(最小值\u factor-1)/nrow(.),则进行变异,
其他_level=“剩余”)
#A、B、C、D
#1 A 1 A剩余
#2 A 2 A剩余
#3 B 3 B E
#4 B 4 B E
#5C5E
#6C6E
#7C 7C E
#8 C 8保持E
#9 C 9保持E
(哦,我无法复制函数的确切行为,但通过调整ties.method
并将1减去max_levels
,您可能会得到想要的结果。).我建议您看看forcats
软件包,该软件包对于此类任务具有很好的功能:fct\u lump
,例如,可能会有所帮助