Machine learning 对数似然法在文本分类中的应用_Machine Learning_Data Mining_Probability_Bayesian_Text Mining

Machine learning 对数似然法在文本分类中的应用

machine-learning

Machine learning 对数似然法在文本分类中的应用,machine-learning,data-mining,probability,bayesian,text-mining,Machine Learning,Data Mining,Probability,Bayesian,Text Mining,我正在实现用于文本分类的朴素贝叶斯算法。我有约1000份培训文件和400份测试文件。我认为我已经正确地实现了培训部分，但我对测试部分感到困惑。以下是我简要介绍的内容：在我的培训职能中： vocabularySize= GetUniqueTermsInCollection();//get all unique terms in the entire collection spamModelArray[vocabularySize]; nonspamModelArray[vocabularyS

我正在实现用于文本分类的朴素贝叶斯算法。我有约1000份培训文件和400份测试文件。我认为我已经正确地实现了培训部分，但我对测试部分感到困惑。以下是我简要介绍的内容：

在我的培训职能中：

vocabularySize= GetUniqueTermsInCollection();//get all unique terms in the entire collection

spamModelArray[vocabularySize]; 
nonspamModelArray[vocabularySize];

for each training_file{
        class = GetClassLabel(); // 0 for spam or 1 = non-spam
        document = GetDocumentID();

        counterTotalTrainingDocs ++;

        if(class == 0){
                counterTotalSpamTrainingDocs++;
        }

        for each term in document{
                freq = GetTermFrequency; // how many times this term appears in this document?
                id = GetTermID; // unique id of the term 

                if(class = 0){ //SPAM
                        spamModelArray[id]+= freq;
                        totalNumberofSpamWords++; // total number of terms marked as spam in the training docs
                }else{ // NON-SPAM
                        nonspamModelArray[id]+= freq;
                        totalNumberofNonSpamWords++; // total number of terms marked as non-spam in the training docs
                }
        }//for


        for i in vocabularySize{
                spamModelArray[i] = spamModelArray[i]/totalNumberofSpamWords;
                nonspamModelArray[i] = nonspamModelArray[i]/totalNumberofNonSpamWords;

        }//for


        priorProb = counterTotalSpamTrainingDocs/counterTotalTrainingDocs;// calculate prior probability of the spam documents
}

vocabularySize= GetUniqueTermsInCollection;//get all unique terms in the entire collection
for each testing_file:
        document = getDocumentID;

        logProbabilityofSpam = 0;
        logProbabilityofNonSpam = 0;

        for each term in document{
                freq = GetTermFrequency; // how many times this term appears in this document?
                id = GetTermID; // unique id of the term 

                // logP(w1w2.. wn) = C(wj)∗logP(wj)
                logProbabilityofSpam+= freq*log(spamModelArray[id]);
                logProbabilityofNonSpam+= freq*log(nonspamModelArray[id]);
        }//for

        // Now I am calculating the probability of being spam for this document
        if (logProbabilityofNonSpam + log(1-priorProb) > logProbabilityofSpam +log(priorProb)) { // argmax[logP(i|ck) + logP(ck)]
                newclass = 1; //not spam
        }else{
                newclass = 0; // spam
        }

}//for

我认为我正确理解并实施了培训部分，但我不确定我是否能够正确实施测试部分。在这里，我试图遍历每个测试文档，并计算每个文档的logP（spam|d）和logP（non-spam|d）。然后我比较这两个数量以确定类别（垃圾邮件/非垃圾邮件）

在我的测试功能中：

vocabularySize= GetUniqueTermsInCollection();//get all unique terms in the entire collection

spamModelArray[vocabularySize]; 
nonspamModelArray[vocabularySize];

for each training_file{
        class = GetClassLabel(); // 0 for spam or 1 = non-spam
        document = GetDocumentID();

        counterTotalTrainingDocs ++;

        if(class == 0){
                counterTotalSpamTrainingDocs++;
        }

        for each term in document{
                freq = GetTermFrequency; // how many times this term appears in this document?
                id = GetTermID; // unique id of the term 

                if(class = 0){ //SPAM
                        spamModelArray[id]+= freq;
                        totalNumberofSpamWords++; // total number of terms marked as spam in the training docs
                }else{ // NON-SPAM
                        nonspamModelArray[id]+= freq;
                        totalNumberofNonSpamWords++; // total number of terms marked as non-spam in the training docs
                }
        }//for


        for i in vocabularySize{
                spamModelArray[i] = spamModelArray[i]/totalNumberofSpamWords;
                nonspamModelArray[i] = nonspamModelArray[i]/totalNumberofNonSpamWords;

        }//for


        priorProb = counterTotalSpamTrainingDocs/counterTotalTrainingDocs;// calculate prior probability of the spam documents
}

vocabularySize= GetUniqueTermsInCollection;//get all unique terms in the entire collection
for each testing_file:
        document = getDocumentID;

        logProbabilityofSpam = 0;
        logProbabilityofNonSpam = 0;

        for each term in document{
                freq = GetTermFrequency; // how many times this term appears in this document?
                id = GetTermID; // unique id of the term 

                // logP(w1w2.. wn) = C(wj)∗logP(wj)
                logProbabilityofSpam+= freq*log(spamModelArray[id]);
                logProbabilityofNonSpam+= freq*log(nonspamModelArray[id]);
        }//for

        // Now I am calculating the probability of being spam for this document
        if (logProbabilityofNonSpam + log(1-priorProb) > logProbabilityofSpam +log(priorProb)) { // argmax[logP(i|ck) + logP(ck)]
                newclass = 1; //not spam
        }else{
                newclass = 0; // spam
        }

}//for

我的问题是；我想返回每个类的概率，而不是精确的1和0（垃圾邮件/非垃圾邮件）。我希望看到例如newclass=0.8684212，以便稍后应用阈值。但我在这里感到困惑。如何计算每个文档的概率？我可以用对数概率来计算它吗？

根据朴素贝叶斯概率模型，属于C类的一组特征{F1，F2，…，Fn}所描述的数据的概率为

P(C|F) = P(C) * (P(F1|C) * P(F2|C) * ... * P(Fn|C)) / P(F1, ..., Fn)

除了1/p（F1，…，Fn）项之外，您拥有所有的项（以对数形式），因为您正在实现的朴素贝叶斯分类器中没有使用该项。（严格地说，是分类器。）

您还必须收集特征的频率，并从中计算

P(F1, ..., Fn) = P(F1) * ... * P(Fn)

谢谢你的回答。假设我也收集了P（F1，…，Fn），现在我将计算P（C | F），如下所示：例如，对于非垃圾邮件文档：P（C | F）=logProbabilityofNonSpam+log（1-priorProb）-log（P（F1…Fn））。但它是否仍然是对数形式，而不是0到1之间的概率？我有点困惑。是的，它是对数形式的。你必须将其指数化。（不过要注意小数值，因为e^x可能变为零）。