R-插入符号序列（）；错误：停止"；加上；在newdata中未找到对象中使用的所有变量名；_R_Machine Learning_R Caret_Naivebayes

R-插入符号序列（）；错误：停止"；加上；在newdata中未找到对象中使用的所有变量名；

r machine-learning

R-插入符号序列（）；错误：停止"；加上；在newdata中未找到对象中使用的所有变量名；,r,machine-learning,r-caret,naivebayes,R,Machine Learning,R Caret,Naivebayes,我正在尝试建立一个简单的。我想使用所有变量作为分类预测因子来预测蘑菇是否可以食用我正在使用软件包以下是我的完整代码： ################################################################################## # Prepare R and R Studio environment #####################################################################

我正在尝试建立一个简单的。我想使用所有变量作为分类预测因子来预测蘑菇是否可以食用

我正在使用软件包

以下是我的完整代码：

##################################################################################
# Prepare R and R Studio environment
##################################################################################

# Clear the R studio console
cat("\014")

# Remove objects from environment
rm(list = ls())

# Install and load packages if necessary
if (!require(tidyverse)) {
  install.packages("tidyverse")
  library(tidyverse)
}
if (!require(caret)) {
  install.packages("caret")
  library(caret)
}
if (!require(klaR)) {
  install.packages("klaR")
  library(klaR)
}

#################################

mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)

na.omit(mushrooms)

names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")

# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'

set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)

train <- mushrooms[split, ]
test <- mushrooms[-split, ]

predictors <- names(train)[2:20] #Create response and predictor data

x <- train[,predictors] #predictors
y <- train$edibility #response

train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation

edibility_mod1 <- train( #train the model
  x = x,
  y = y,
  method = "nb", 
  trControl = train_control
)

脚本运行后的x和y：

> str(x)
'data.frame':   6500 obs. of  19 variables:
 $ capShape                : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
 $ capSurface              : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
 $ cap-color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
 $ bruises                 : logi  TRUE TRUE TRUE TRUE FALSE TRUE ...
 $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
 $ gill-attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
 $ gill-spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
 $ gill-size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
 $ gill-color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
 $ stalk-shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
 $ stalk-root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
 $ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ stalk-color-above-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ stalk-color-below-ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ veil-type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
 $ veil-color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
 $ ring-number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
 $ ring-type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...



> str(y)
 Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...

我的环境是：

> R.version
               _                           
platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          0.3                         
year           2020                        
month          10                          
day            10                          
svn rev        79318                       
language       R                           
version.string R version 4.0.3 (2020-10-10)
nickname       Bunny-Wunnies Freak Out     
> RStudio.Version()
$citation

To cite RStudio in publications use:

  RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {RStudio: Integrated Development Environment for R},
    author = {{RStudio Team}},
    organization = {RStudio, PBC},
    address = {Boston, MA},
    year = {2020},
    url = {http://www.rstudio.com/},
  }


$mode
[1] "desktop"

$version
[1] ‘1.3.1093’

$release_name
[1] "Apricot Nasturtium"

您试图做的是一个有点棘手的、最简单的bayes实现，或者至少您正在使用的一个（从e1071派生的kLAR）使用正态分布。您可以在下面的详细信息中看到：

标准朴素贝叶斯分类器（至少是这个实现）假设预测变量和高斯分布独立度量预测器的分布（给定目标类）。对于如果属性缺少值，则对应的表项为为预测而省略

你的预测是绝对的，所以这可能是有问题的。您可以尝试设置

kernel=TRUE

和

adjust=1

以强制其恢复正常，并避免引发错误的

kernel=FALSE

在此之前，我们删除只有1个级别的列并对列名进行排序，在这种情况下，使用公式和避免生成伪变量更容易：

df = train 
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))

Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))

mod1 <- train(edibility~.,data=df,
  method = "nb", trControl = trainControl(method="cv",number=5),
  tuneGrid=Grid
)

 mod1
Naive Bayes 

6500 samples
  21 predictor
   2 classes: 'e', 'p' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200 
Resampling results across tuning parameters:

  fL   Accuracy   Kappa    
  0.2  0.9243077  0.8478624
  0.5  0.9243077  0.8478624
  0.8  0.9243077  0.8478624

Tuning parameter 'usekernel' was held constant at a value of TRUE

Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
 adjust = 1.

df=列车
级别（df[[“面纱类型”]]
[1] “p”
df[[“面纱类型”]]=NULL
colnames（df）=gsub（“-”，“34;”，colnames（df））
Grid=expand.Grid（usekernel=TRUE，adjust=1，fL=c（0.2,0.5,0.8））
mod1可能是目标变量的类不平衡的问题：也不确定目标变量是否需要因子？你们正在阅读它，就像它看起来的文本一样……我不会因为类不平衡而得到一个更明确的错误。不管怎样，我都会研究它。y是因子，用输出更新问题以显示是否有用。编辑问题中的x和y输出显示除一个逻辑变量外，所有x变量都是因子。我将检查NA，好主意。如果我的预测是非度量的，即分类/标称/因子，为什么NB算法需要使用高斯分布或非参数核技术。我是新来的，所以请让我知道我错过了什么。我现在正尝试使用多项式的_naive _bayes（）函数，我认为它可能更适合我，但我不知道如何进行后处理，请看这里的问题：模型需要评估给定预测值的观测的条件概率，并且大多数假设你的预测值是高斯的。你可以看到。在本博客的其余部分，它解释了互惠互利的运作方式
df = train 
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))

Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))

mod1 <- train(edibility~.,data=df,
  method = "nb", trControl = trainControl(method="cv",number=5),
  tuneGrid=Grid
)

 mod1
Naive Bayes 

6500 samples
  21 predictor
   2 classes: 'e', 'p' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200 
Resampling results across tuning parameters:

  fL   Accuracy   Kappa    
  0.2  0.9243077  0.8478624
  0.5  0.9243077  0.8478624
  0.8  0.9243077  0.8478624

Tuning parameter 'usekernel' was held constant at a value of TRUE

Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
 adjust = 1.