R 朴素贝叶斯的特征选择_R_Naivebayes

R 朴素贝叶斯的特征选择

R 朴素贝叶斯的特征选择,r,naivebayes,R,Naivebayes,我用朴素贝叶斯做了分类。目标是通过文本预测4个因素。数据如下所示： 'data.frame': 387 obs. of 2 variables: $ reviewText: chr "I love this. I have a D800. I am mention my camera to make sure that you understand that this product is not ju"| __truncated__ "I hate buying larger gig

我用朴素贝叶斯做了分类。目标是通过文本预测4个因素。数据如下所示：

 'data.frame':  387 obs. of  2 variables:
 $ reviewText: chr  "I love this. I have a D800. I am mention my camera to make sure that you understand that this product is not ju"| __truncated__ "I hate buying larger gig memory cards - because there's always that greater risk of losing the photos, and/or r"| __truncated__ "These chromebooks are really a pretty nice idea -- Almost no maintaince (no maintaince?), no moving parts, smal"| __truncated__ "Purchased, as this drive allows a much speedier read/write and is just below a full SSD (they need to drop the "| __truncated__ ...
 $ pragmatic : Factor w/ 4 levels "-1","0","1","9": 4 4 4 3 3 4 3 3 3...

我用

caret

软件包进行了分类。分类代码如下所示：

sms_corpus <- Corpus(VectorSource(sms_raw$text))
sms_corpus_clean <- sms_corpus %>%
    tm_map(content_transformer(tolower)) %>% 
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords(kind="en")) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

train_index <- createDataPartition(sms_raw$type, p=0.5, list=FALSE)
sms_raw_train <- sms_raw[train_index,]
sms_raw_test <- sms_raw[-train_index,]
sms_corpus_clean_train <- sms_corpus_clean[train_index]
sms_corpus_clean_test <- sms_corpus_clean[-train_index]
sms_dtm_train <- sms_dtm[train_index,]
sms_dtm_test <- sms_dtm[-train_index,]

sms_dict <- findFreqTerms(sms_dtm_train, lowfreq= 5) 
sms_train <- DocumentTermMatrix(sms_corpus_clean_train, list(dictionary=sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_clean_test, list(dictionary=sms_dict))

convert_counts <- function(x) {
    x <- ifelse(x > 0, 1, 0)
    x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))
}
sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)
sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)


ctrl <- trainControl(method="cv", 10)
set.seed(8)
sms_model1 <- train(sms_train, sms_raw_train$type, method="nb",
                trControl=ctrl)


sms_predict1 <- predict(sms_model1, sms_test)
cm1 <- confusionMatrix(sms_predict1, sms_raw_test$type)

当我分别对所有4个变量进行预测时，我得到了更好的结果。分类代码与上述相同，但此处使用的准确度不是

df$sensorial，而是误导性指标。在您发布的多标签混淆矩阵中，如果只查看标签-1
与其他标签相比，您的准确率约为89%。因为您只预测了一次-1
，并且将-1
误分类为其他人
20次（9+11）。对于所有其他情况，您可以正确地将-1
与其他问题进行分类，因此170/191=89%
准确度。但当然，这并不意味着该模型按预期运行；它只是将其他
打印到几乎所有的案例中。这一机制也是您在单标签分类中看到更高精度数字的原因
有关类不平衡问题以及缓解该问题的潜在方法的详细概述，请参阅
此外，这与您的案例非常相关，因此我建议您看看。
Thaks获取链接。你的权利。当我观察整体预测的召回率和精确度时，值-9，0，-1的结果非常糟糕。根据链接中的信息，您发布的最有效的方法是通过过采样和欠采样变换术语频率。但我不知道在r怎么做。似乎没有可用于该操作的包。在您发布的线程中，这个家伙（有同样的问题）分别对每个值进行分类（就像我一样）。您认为这是目前最好的解决方案吗？@Banjo使用您正在使用的caret库，有非常易于使用的类不平衡修复器实现。请看这里的教程：对于第二个问题，这不重要。
          Reference
Prediction -1  0  1  9
        -1  0  0  1  0
        0   0  0  0  0
        1   9  5 33 25
        9  11  3 33 72

prop.table(table(sms_raw_train$type))
         -1           0           1           9 
0.025773196 0.005154639 0.180412371 0.788659794 

modelweights <- ifelse(sms_raw_train$type == -1, 
             (1/table(sms_raw_train$type)[1]) * 0.25, 
             ifelse(sms_raw_train$type == 0, 
             (1/table(sms_raw_train$type)[2]) * 0.25,
             ifelse(sms_raw_train$type == 1, 
             (1/table(sms_raw_train$type)[3]) * 0.25,
             ifelse(sms_raw_train$type == 9, 
             (1/table(sms_raw_train$type)[4]) * 0.25,9))))    

              Reference
    Prediction -1  0  1  9
            -1  1  0  1  1
            0   1  0  1  0
            1  11  3 32 20
            9   7  5 33 76