Python 在基于斯坦福NLP研究论文的scikit学习多项式朴素贝叶斯中,是否存在提取最大后验概率的方法?

Python 在基于斯坦福NLP研究论文的scikit学习多项式朴素贝叶斯中,是否存在提取最大后验概率的方法?,python,scikit-learn,text-classification,naivebayes,Python,Scikit Learn,Text Classification,Naivebayes,我试图在链接中复制论文的结果 这个链接解释了多项式朴素贝叶斯如何用于文本分类 我尝试使用scikit learn重现该示例 from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split import pandas as pd import numpy as np import matplotlib.pyplot as plt fro

我试图在链接中复制论文的结果

这个链接解释了多项式朴素贝叶斯如何用于文本分类

我尝试使用scikit learn重现该示例

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.naive_bayes import MultinomialNB

#TRAINING SET
dftrain = pd.DataFrame(data=np.array([["Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"], 
["yes", "yes", "yes", "no"]]))

dftrain = dftrain.T
dftrain.columns = ['text', 'label']

#TEST SET
dftest = pd.DataFrame(data=np.array([["Chinese Chinese Chinese Tokyo Japan"]]))
dftest.columns = ['text']

count_vectorizer = CountVectorizer(min_df=0, token_pattern=r"\b\w+\b", stop_words = None)
count_train = count_vectorizer.fit_transform(dftrain['text'])
count_test = count_vectorizer.transform(dftest['text'])

clf = MultinomialNB()
clf.fit(count_train, df['label'])
clf.predict(count_test)
输出正确打印为:

array(['yes'],
  dtype='<U3')
我明白了

array([[ 0.31024139,  0.68975861]])
我认为这意味着:

p(测试属于标签“否”)=0.31024139
P(测试属于标签'yes')=0.68975861

因此,
scikit learn
预测文本属于标签
yes
,但是

我的问题是:为什么概率不同<代码>P(是测试集)=0.0003>P(否测试集)=0.0001,我没有看到数字
0.0003
0.0001
,而是看到
0.31024139
0.68975861

我是不是遗漏了什么?这与
class\u prior
参数有关吗

我确实读过文档

显然,参数是通过平滑版本的最大似然估计的,即相对频率计数


我想知道的是,不管怎样,我是否可以复制并看到研究论文中的结果

这与概率产生的意义有关。0.0003和0.0001的数字未标准化,即它们的总和不等于1。如果将这些值标准化,将得到相同的结果

请参见下面的代码段:

clf.predict_proba(count_test)
Out[63]: array([[ 0.31024139,  0.68975861]])

In [64]: p = (3/4)*((3/7)**3)*(1/14)*(1/14)

In [65]: p
Out[65]: 0.00030121377997263036

In [66]: p0 = (1/4)*((2/9)**3)*(2/9)*(2/9)

In [67]: p0
Out[67]: 0.00013548070246744223

#normalised values
In [68]: p/(p0+p)
Out[68]: 0.6897586117634674

In [69]: p0/(p0+p)
Out[69]: 0.3102413882365326
clf.predict_proba(count_test)
Out[63]: array([[ 0.31024139,  0.68975861]])

In [64]: p = (3/4)*((3/7)**3)*(1/14)*(1/14)

In [65]: p
Out[65]: 0.00030121377997263036

In [66]: p0 = (1/4)*((2/9)**3)*(2/9)*(2/9)

In [67]: p0
Out[67]: 0.00013548070246744223

#normalised values
In [68]: p/(p0+p)
Out[68]: 0.6897586117634674

In [69]: p0/(p0+p)
Out[69]: 0.3102413882365326