R 决策树参与方包预测错误-级别不匹配
我正在使用party软件包在R中构建一个CART回归树模型,但当我尝试将模型应用于测试数据集时,收到错误消息说级别不匹配 在过去的一周里,我一直在阅读论坛上的帖子,但仍然找不到解决问题的正确方法。所以我在这里用我编造的假例子重新提出这个问题。。有人能帮助解释错误信息并提供解决方案吗 我的训练数据集大约有1000条记录,测试数据集大约有150条。两个数据集中都没有NA或空白字段 我在派对套餐下使用携程的购物车型号为:R 决策树参与方包预测错误-级别不匹配,r,decision-tree,prediction,party,R,Decision Tree,Prediction,Party,我正在使用party软件包在R中构建一个CART回归树模型,但当我尝试将模型应用于测试数据集时,收到错误消息说级别不匹配 在过去的一周里,我一直在阅读论坛上的帖子,但仍然找不到解决问题的正确方法。所以我在这里用我编造的假例子重新提出这个问题。。有人能帮助解释错误信息并提供解决方案吗 我的训练数据集大约有1000条记录,测试数据集大约有150条。两个数据集中都没有NA或空白字段 我在派对套餐下使用携程的购物车型号为: mytree您可以尝试使用可比较的级别重建因子,而不是为现有因子指定新的级别。下
mytree您可以尝试使用可比较的级别重建因子,而不是为现有因子指定新的级别。下面是一个例子:
Rate Bank Product Salary
1.5 A aaa 100000
0.6 B abc 60000
3 C bac 10000
2.1 D cba 50000
1.1 E cca 80000
# start the party
library(party)
# create training data sample
data_train <- data.frame(Rate = c(1.5, 0.6, 3, 2.1, 1.1),
Bank = c("A", "B", "C", "D", "E"),
Product = c("aaa", "abc", "bac", "cba", "cca"),
Salary = c(100000, 60000, 10000, 50000, 80000))
# create testing data sample
data_test <- data.frame(Rate = c(2.0, 0.5, 0.8, 2.1),
Bank = c("A", "D", "E", "C"),
Product = c("cba", "cca", "cba", "abc"),
Salary = c(80000, 250000, 120000, 65000))
# get the union of levels between train and test for Bank and Product
bank_levels <- union(levels(data_test$Bank), levels(data_train$Bank))
product_levels <- union(levels(data_test$Product), levels(data_train$Product))
# rebuild Bank with union of levels
data_test$Bank <- with(data_test, factor(Bank, levels = bank_levels))
data_train$Bank <- with(data_train, factor(Bank, levels = bank_levels))
# rebuild Product with union of levels
data_test$Product <- with(data_test, factor(Product, levels = product_levels))
data_train$Product <- with(data_train, factor(Product, levels = product_levels))
# fit the model
mytree <- ctree(Rate ~ Bank + Product + Salary, data = data_train)
# generate predictions
fit1 <- predict(mytree, newdata = data_test)
> fit1
Rate
[1,] 1.66
[2,] 1.66
[3,] 1.66
[4,] 1.66
#开始派对
图书馆(党)
#创建培训数据样本
data_train我使用的是ctree的例子,但这基本上是巧妙地使用因子,因此可以用于严格依赖因子水平的任何算法(RandomForest等)
这一切都是为了了解R如何存储和使用因子级别。如果我们使用列车数据中使用的相同因子水平(且顺序相同)(是的,即使没有测试数据的棒击),我们使用预先训练的ctree模型进行预测
实际上,使用携程(party)软件包进行预测时,不需要使用俱乐部训练和测试数据。这是因为当您使用预先培训过的模型时,在运行时生产过程中,您可能没有那么多的内存和处理器。预先训练的模型减轻了我们在生产环境中根据大量训练数据构建模型的负担
步骤1:在构建模型时,您可以在列车数据中存储每列的系数级别(只要适用)
var\u list我在这里提供了一个变通解决方案:在训练模型时,我没有对数据进行分区,而是给0个权重来测试观察值。ctree将在培训期间忽略这些观察结果,但保留因子信息。
>is.factor(data_test$Bank)
TRUE
(Made sure Bank and Products are factors in both datasets)
>levels(data_test$Bank) <-union(levels(data_test$Bank), levels(data_train$Bank))
> levels(data_test$product)<-union(levels(data_test$product),levels(data_train$product))
> fit1<- predict(mytree,newdata=data_test)
Error in checkData(oldData, RET) :
Levels in factors of new data do not match original data
Rate Bank(altered) Bank (original)
2.0 A A
0.5 B D
0.8 C E
2.1 D C
# start the party
library(party)
# create training data sample
data_train <- data.frame(Rate = c(1.5, 0.6, 3, 2.1, 1.1),
Bank = c("A", "B", "C", "D", "E"),
Product = c("aaa", "abc", "bac", "cba", "cca"),
Salary = c(100000, 60000, 10000, 50000, 80000))
# create testing data sample
data_test <- data.frame(Rate = c(2.0, 0.5, 0.8, 2.1),
Bank = c("A", "D", "E", "C"),
Product = c("cba", "cca", "cba", "abc"),
Salary = c(80000, 250000, 120000, 65000))
# get the union of levels between train and test for Bank and Product
bank_levels <- union(levels(data_test$Bank), levels(data_train$Bank))
product_levels <- union(levels(data_test$Product), levels(data_train$Product))
# rebuild Bank with union of levels
data_test$Bank <- with(data_test, factor(Bank, levels = bank_levels))
data_train$Bank <- with(data_train, factor(Bank, levels = bank_levels))
# rebuild Product with union of levels
data_test$Product <- with(data_test, factor(Product, levels = product_levels))
data_train$Product <- with(data_train, factor(Product, levels = product_levels))
# fit the model
mytree <- ctree(Rate ~ Bank + Product + Salary, data = data_train)
# generate predictions
fit1 <- predict(mytree, newdata = data_test)
> fit1
Rate
[1,] 1.66
[2,] 1.66
[3,] 1.66
[4,] 1.66
var_list <- colnames(dtrain)
for(var in var_list)
{
if(class(dtrain[,var]) == 'character')
{
print(var)
#Fill blanks with "None" to keep the factor levels consistent
dtrain[dtrain[,var] == '',var] <- 'None'
col_name_levels <- unique(dtrain[,var])
#Make sure you have sorted the column levels
col_name_levels <- sort(col_name_levels, decreasing = FALSE)
#Make as factors
dtrain[,var] <- factor(dtrain[,var], levels = col_name_levels, ordered=TRUE)
print(levels(dtrain[,var]))
#This is the trick: Store the exact levels in a CSV which is much easier to load than the whole train data later in prediction phase
write.csv(levels(dtrain[,var]), paste0(getwd(),'/Output CSVs/',var,'_levels.csv'), row.names = FALSE)
}
}
# also store the column names and data types for detecting later
for(col_name in colnames(dtrain))
{
abc <- data.frame('col_name' = col_name,'class_colname' = paste(class(dtrain[,col_name]), collapse = ' '))
if(!exists('col_name_type_list'))
{
col_name_type_list <- abc
}else
{
col_name_type_list <- rbind(col_name_type_list, abc)
}
}
#Store for checking later
write.csv(col_name_type_list, filepath, row.names = FALSE)
###############Now in test prediction ###########################
#Read the column list of train data (stored earlier)
col_name_type_list_dtrain <- read.csv( filepath, header = TRUE)
for(i in 1:nrow(col_name_type_list_dtrain))
{
col_name <- col_name_type_list_dtrain[i,]$col_name
class_colname <- col_name_type_list_dtrain[i,]$class_colname
if(class_colname == 'numeric')
{
dtest[,col_name] <- as.numeric(dtest[,col_name])
}
if(class_colname == 'ordered factor')
{
#Now use the column factor levels from train
remove(col_name_levels)
col_name_levels <- read.csv( paste0(getwd(),'/Output CSVs/',var,'_levels.csv'), header = TRUE)
factor_check_flag <- TRUE
col_name_levels <- as.character(col_name_levels$x)
print(col_name)
print('Pre-Existing levels detected')
print(NROW(col_name_levels))
#Drop new rows which are not in train; the model cant predict for them
rows_before_dropping <- nrow(dtest)
print('Adjusting levels to train......')
dtest <- dtest[dtest[,col_name] %in% col_name_levels,]
rows_after_dropping <- nrow(dtest)
cat('\nDropped Rows for adjusting ',col_name,': ',(rows_before_dropping - rows_after_dropping),'\n')
#Convert to factors
dtest[,col_name] <- factor(dtest[,col_name], levels=col_name_levels, ordered=TRUE)
print(dtest[,col_name])
}
}