R 如何将朴素贝叶斯模型应用于新数据_R_Naivebayes

R 如何将朴素贝叶斯模型应用于新数据

R 如何将朴素贝叶斯模型应用于新数据,r,naivebayes,R,Naivebayes,今天早上我问了一个问题，但我删除了这个问题，并用更好的措辞在这里发布我用训练和测试数据创建了我的第一个机器学习模型。我返回了一个混乱矩阵，看到了一些汇总统计数据我现在想把这个模型应用到新的数据中去做预测，但我不知道怎么做背景：预测每月的“客户流失”取消。目标变量是“搅动的”，它有两个可能的标签“搅动的”和“未搅动的” 以下是我的培训和测试： library("klaR") library("caret") # import data test_data_imp <- read.

今天早上我问了一个问题，但我删除了这个问题，并用更好的措辞在这里发布

我用训练和测试数据创建了我的第一个机器学习模型。我返回了一个混乱矩阵，看到了一些汇总统计数据

我现在想把这个模型应用到新的数据中去做预测，但我不知道怎么做

背景：预测每月的“客户流失”取消。目标变量是“搅动的”，它有两个可能的标签“搅动的”和“未搅动的”

以下是我的培训和测试：

 library("klaR")
 library("caret")

# import data
test_data_imp <- read.csv("tdata.csv")

# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]

#training
rn_train <- sample(nrow(tdata),
                   floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)

# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)

这很可能是由于训练数据中的因子编码（变量

tdata

）与

predict

函数中使用的新数据（变量

pdata

）不匹配造成的，通常情况下，测试数据中的因子级别不在训练数据中。功能编码的一致性必须由您强制执行，因为

predict

函数不会对其进行检查。因此，我建议您仔细检查两个变量中的特性

nvk_medium

和

org_type

的级别

错误消息：

 Error in object$tables[[v]][, nd] : subscript out of bounds

在评估数据点中的给定特征（第

-th特征）时引发，其中

nd

是对应于特征的系数的数值。您还收到警告，表明数据点（“观察”）1、2和3中所有病例的后验概率均为零，但不清楚这是否也与因子编码有关

为了重现你所看到的错误，考虑下面的玩具数据（from），它有一组与你的数据相似的特征：

# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]

然后一切按预期进行：

model <- NaiveBayes(Survived ~ ., data = Data_train)

# This will work
pred_1 <- predict(model, Data_test)

> str(pred_1)
List of 2
$ class    : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"

模型变量pdata
中的数据是什么样子的？能否请您添加head（pdata）
？Hi@tguzella的结果与tdata完全相同，除了所有搅动的实例都说“未搅动”（因为我想预测哪个会搅动"好吧，考虑到这个错误，我倾向于认为数据与tdata不一样……这个错误似乎是在处理一个作为因素的功能时触发的。但是，如果你不显示数据，基本上就不可能知道出了什么问题。Hi@tguzella我之前在手机上，所以无法添加数据。但我已经添加了现在是pdata的e head和str。非常欢迎任何指点或帮助。非常感谢，这很有意义。我查看了“中等”和“组织”类型，发现了一个低计数的长尾级别，因此通过将差异（级别？）减少到6，将其分组到更高级别。现在一切都按预期进行了！谢谢
head(pdata)
  months_subscription nvk_medium                                org_type     churned
1                  26       none                               Community not churned
2                   8       none                            Sports clubs not churned
3                  30       none                            Sports clubs not churned
4                  19    unknown Religious congregations and communities not churned
5                  16       none              Association - Professional not churned
6                  10       none              Association - Professional not churned
> str(pdata)
'data.frame':   6433 obs. of  4 variables:
 $ months_subscription: int  26 8 30 19 16 10 3 5 14 2 ...
 $ nvk_medium         : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
 $ org_type           : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
 $ churned            : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...

 Error in object$tables[[v]][, nd] : subscript out of bounds

# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]

> str(Data_train)

'data.frame':   656 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age     : num  35 28 34 28 29 28 28 28 45 28 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...

> str(Data_test)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

model <- NaiveBayes(Survived ~ ., data = Data_train)

# This will work
pred_1 <- predict(model, Data_test)

> str(pred_1)
List of 2
$ class    : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"

# Mess things up, by "displacing" the factor values (i.e., 'Nothing' 
# will now be encoded as number 5, which was not present in the 
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
    as.character(Data_test_2$Title), 
    levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)

> str(Data_test_2)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds