R中训练朴素贝叶斯模型的问题_R_R Caret

R中训练朴素贝叶斯模型的问题

R中训练朴素贝叶斯模型的问题,r,r-caret,R,R Caret,我正在使用插入符号包（没有太多使用插入符号的经验）用朴素贝叶斯训练我的数据，如下面的R代码所述。我在执行“nb_模型”时遇到了一个包含句子的问题，因为它会产生一系列错误消息，这些错误消息是： 1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) : Not all variable names used in object

我正在使用插入符号包（没有太多使用插入符号的经验）用朴素贝叶斯训练我的数据，如下面的R代码所述。我在执行“nb_模型”时遇到了一个包含句子的问题，因为它会产生一系列错误消息，这些错误消息是：

1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in 
predict.NaiveBayes(modelFit, newdata) : 
Not all variable names used in object found in newdata

2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in 
NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) :

请您就如何调整下面的R代码以克服此问题提出建议

数据集外观的快速示例（10个变量）：

数据集是所有字符数据。在这些数据中，有一组易于编码的单词（

V2

V10

）和句子，您可以对它们进行任意数量的特征工程，并生成任意数量的特征

要阅读文本挖掘，请查看

tm

软件包、其文档或类似博客的实用示例。以下是链接文章中的一些内容

首先我设置了

stringsAsFactors=F

，因为你的

V1

有很多独特的句子

TrainSet <- read.csv(url("https://raw.githubusercontent.com/jcool12/dataset/master/textsentiment.csv?token=AA4LAP5VXI6I7FRKMT6HDPK6U5XBY"),
                     header = F,
                     stringsAsFactors = F)

library(caret)

在这个基本示例中，您会得到一些不可忽略的警告，因为

V1

中只有很少几个句子包含“伦敦”一词。我建议将该专栏用于情绪分析、术语频率/反向文档频率等方面。

这看起来是一个不错的问题，但请记住，我们不会在堆栈溢出上使用个人云链接，因为它们会导致virii和断开链接。如果数据可从软件包或主要网站获得，则使用该数据，否则请编写伪数据用于示例。话虽如此，这个错误-

并不是newdata中找到的对象中使用的所有变量名都是它所说的意思。您的培训数据中有一些数据在新数据中丢失。我认为，如果在训练数据中意外地包含因变量作为预测值，这种情况经常发生。@Hack-R我将确保将来提供伪数据。我尝试了几种方法来克服这个问题，但我没有找到解决这个问题的方法。请使用上面的R代码进行帮助。@Hack-R我已经更改了到GitHub的链接，因此将数据集导入R应该更容易，并提供了上面数据集外观的示例。@Hack-R好的，非常感谢您的帮助。@Hack-R好的，很高兴知道。再次感谢。谢谢你的详细回答，这对理解我今后需要做什么很有帮助。第一列中的句子被一系列情感词汇识别为肯定、否定或中性。我希望将手动分类（V10）结果与词典结果进行正、负和中性比较。我明白在进行比较时，我需要包括这些句子。你认为我这样做对吗？希望这有意义。@jr134肯定。那么，您可能希望完全删除原始的V1列？因此，如果我理解正确，我根本不需要使用这些句子并运行V2-V10？我想如果我不包括这些句子，那么这将产生一个不准确的图片，或者我没有正确理解这一点？
library(caret)

# Loading dataset
setwd("directory/path")
TrainSet = read.csv("textsent.csv", header = FALSE)

# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]

# Declaring the trainControl function
train_ctrl = trainControl(
  method  = "cv", #Specifying Cross validation
  number  = 3, # Specifying 3-fold
)

nb_model = train(
  V10 ~., # Specifying the response variable and the feature variables
  method = "nb", # Specifying the model to use
  data = train, 
  trControl = train_ctrl,
)

# Get the predictions of your model in the test set
predictions = predict(nb_model, newdata = test)

# See the confusion matrix of your model in the test set
confusionMatrix(predictions, test$V10)

TrainSet <- read.csv(url("https://raw.githubusercontent.com/jcool12/dataset/master/textsentiment.csv?token=AA4LAP5VXI6I7FRKMT6HDPK6U5XBY"),
                     header = F,
                     stringsAsFactors = F)

library(caret)

## Feature Engineering
# V2 - V10
TrainSet[TrainSet=="Negative"] <- 0
TrainSet[TrainSet=="Positive"] <- 1

# V1 - not sure what you wanted to do with this
#     but here's a simple example of what 
#     you could do
TrainSet$V1 <- grepl("london", TrainSet$V1) # tests if london is in the string

# In reality you could probably generate 20+ decent features from this text
#  word count, tons of stuff... see the tm package

# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]

# Declaring the trainControl function
train_ctrl = trainControl(
  method  = "cv", # Specifying Cross validation
  number  = 3,    # Specifying 3-fold
)

nb_model = train(
  V10 ~., # Specifying the response variable and the feature variables
  method = "nb", # Specifying the model to use
  data = train, 
  trControl = train_ctrl,
)

# Resampling: Cross-Validated (3 fold) 
# Summary of sample sizes: 799, 800, 801 
# Resampling results across tuning parameters:
#   
#   usekernel  Accuracy   Kappa    
# FALSE      0.6533444  0.4422346
# TRUE      0.6633569  0.4185751