R中的朴素贝叶斯情感分析_R

R中的朴素贝叶斯情感分析

R中的朴素贝叶斯情感分析,r,R,我试图用朴素贝叶斯分类器对R中的推特进行情感分析。推文已被手动标记为肯定或否定。我使用的软件包有：tm、weka、RTextTools、e1071。问题是，当我运行代码时，似乎所有tweet都被预测为正，如混淆矩阵所示，因此我认为我的文档术语矩阵中有错误，或者我一起遗漏了其他内容。我认为主要问题可能是非常稀疏的文档术语矩阵，这可能会导致NB分类器出现问题，因为我得到的是： <<DocumentTermMatrix (documents: 2289, terms: 8565)>

我试图用朴素贝叶斯分类器对R中的推特进行情感分析。推文已被手动标记为肯定或否定。我使用的软件包有：tm、weka、RTextTools、e1071。问题是，当我运行代码时，似乎所有tweet都被预测为正，如混淆矩阵所示，因此我认为我的文档术语矩阵中有错误，或者我一起遗漏了其他内容。我认为主要问题可能是非常稀疏的文档术语矩阵，这可能会导致NB分类器出现问题，因为我得到的是：

<<DocumentTermMatrix (documents: 2289, terms: 8565)>>
Non-/sparse entries: 20052/19585233
Sparsity           : 100%
Maximal term length: 73
Weighting          : term frequency (tf)

假设您的示例不可重复：在您的

naiveBayes

调用中，

mat1[1:1500，]

列是否也需要是factor's，并且您是否应该删除第一列

mat1[1:1500，]

到

mat1[1:1500，-1]

。也可以从predict调用中的

文件[1501:2288]

predict（分类器，文件[1501:2288，-1]）

-否则您将使用类变量来预测自身。（如果它是predict中的

file

或

mat

），您可能还需要检查阈值选项。我希望我对它的作用有一个更好的了解，但是通过使用它，你可以获得更多的负面预测。@user20650:mat1[1:1500]中的列不需要成为naiveBayes分类器中的因子，根据你的建议删除第一列也没有什么不同。我认为主要的问题是TDM的稀疏性，我不知道谁应该减少它。

file <- read.csv("twitter4242_2_1a.csv") #the file has two columns, the first stating
#stating positive or negative, and the second column has the tweets text itself
tweetsCorpus <- Corpus(VectorSource(file[,2])) # selecting the tweets from the 2nd column
tweetsTDM <- DocumentTermMatrix(tweetsCorpus,
    control = list(
    asPlain = TRUE,
    stopwords = TRUE,
    tolower = TRUE,
    removeNumbers = TRUE,
    stemWords = FALSE,
    removePunctuation = TRUE,
    stripWhitespace = TRUE))
    #tokenize = NGramTokenizer)) I'm not sure if the tokenizer should be included or not, but I get the same result regardless.

mat1 <- as.matrix(tweetsTDM) # creating matrix from the tweetsTDM
classifier <- naiveBayes(mat1[1:1500,], as.factor(file[1:1500,1])) # training the NB classifier with the first 1500 rows, with the factor from the first column (positive/negative)
predicted <- predict(classifier, file[1501:2288,]); #predicting the remaining rows in file, based on the classifier model
table (file[1501:2288,1], predicted) # confusion matrix

    predicted
        negative positive
negative      0       324
positive      0       464