Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/72.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 如何将朴素贝叶斯模型应用于新数据_R_Naivebayes - Fatal编程技术网

R 如何将朴素贝叶斯模型应用于新数据

R 如何将朴素贝叶斯模型应用于新数据,r,naivebayes,R,Naivebayes,今天早上我问了一个问题,但我删除了这个问题,并用更好的措辞在这里发布 我用训练和测试数据创建了我的第一个机器学习模型。我返回了一个混乱矩阵,看到了一些汇总统计数据 我现在想把这个模型应用到新的数据中去做预测,但我不知道怎么做 背景:预测每月的“客户流失”取消。目标变量是“搅动的”,它有两个可能的标签“搅动的”和“未搅动的” 以下是我的培训和测试: library("klaR") library("caret") # import data test_data_imp <- read.

今天早上我问了一个问题,但我删除了这个问题,并用更好的措辞在这里发布

我用训练和测试数据创建了我的第一个机器学习模型。我返回了一个混乱矩阵,看到了一些汇总统计数据

我现在想把这个模型应用到新的数据中去做预测,但我不知道怎么做

背景:预测每月的“客户流失”取消。目标变量是“搅动的”,它有两个可能的标签“搅动的”和“未搅动的”

以下是我的培训和测试:

 library("klaR")
 library("caret")

# import data
test_data_imp <- read.csv("tdata.csv")

# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]

#training
rn_train <- sample(nrow(tdata),
                   floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)

# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)

这很可能是由于训练数据中的因子编码(变量
tdata
)与
predict
函数中使用的新数据(变量
pdata
)不匹配造成的,通常情况下,测试数据中的因子级别不在训练数据中。功能编码的一致性必须由您强制执行,因为
predict
函数不会对其进行检查。因此,我建议您仔细检查两个变量中的特性
nvk_medium
org_type
的级别

错误消息:

 Error in object$tables[[v]][, nd] : subscript out of bounds
在评估数据点中的给定特征(第
v
-th特征)时引发,其中
nd
是对应于特征的系数的数值。您还收到警告,表明数据点(“观察”)1、2和3中所有病例的后验概率均为零,但不清楚这是否也与因子编码有关

为了重现你所看到的错误,考虑下面的玩具数据(from),它有一组与你的数据相似的特征:

# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]
然后一切按预期进行:

model <- NaiveBayes(Survived ~ ., data = Data_train)

# This will work
pred_1 <- predict(model, Data_test)

> str(pred_1)
List of 2
$ class    : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"

模型变量
pdata
中的数据是什么样子的?能否请您添加
head(pdata)
?Hi@tguzella的结果与tdata完全相同,除了所有搅动的实例都说“未搅动”(因为我想预测哪个会搅动"好吧,考虑到这个错误,我倾向于认为数据与
tdata
不一样……这个错误似乎是在处理一个作为因素的功能时触发的。但是,如果你不显示数据,基本上就不可能知道出了什么问题。Hi@tguzella我之前在手机上,所以无法添加数据。但我已经添加了现在是pdata的e head和str。非常欢迎任何指点或帮助。非常感谢,这很有意义。我查看了“中等”和“组织”类型,发现了一个低计数的长尾级别,因此通过将差异(级别?)减少到6,将其分组到更高级别。现在一切都按预期进行了!谢谢
head(pdata)
  months_subscription nvk_medium                                org_type     churned
1                  26       none                               Community not churned
2                   8       none                            Sports clubs not churned
3                  30       none                            Sports clubs not churned
4                  19    unknown Religious congregations and communities not churned
5                  16       none              Association - Professional not churned
6                  10       none              Association - Professional not churned
> str(pdata)
'data.frame':   6433 obs. of  4 variables:
 $ months_subscription: int  26 8 30 19 16 10 3 5 14 2 ...
 $ nvk_medium         : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
 $ org_type           : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
 $ churned            : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...
 Error in object$tables[[v]][, nd] : subscript out of bounds
# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]
> str(Data_train)

'data.frame':   656 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age     : num  35 28 34 28 29 28 28 28 45 28 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...

> str(Data_test)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...
model <- NaiveBayes(Survived ~ ., data = Data_train)

# This will work
pred_1 <- predict(model, Data_test)

> str(pred_1)
List of 2
$ class    : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"
# Mess things up, by "displacing" the factor values (i.e., 'Nothing' 
# will now be encoded as number 5, which was not present in the 
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
    as.character(Data_test_2$Title), 
    levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)

> str(Data_test_2)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...
> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds