如何增加多项式nb()';使用sklearn获得准确度分数,并使用matplotlib在图形中显示结果?
我正在处理的数据集如下所示: 在我附加的屏幕截图中,您可以看到我的数据集包含16行和12个元组,但实际上它包含521行和12个元组如何增加多项式nb()';使用sklearn获得准确度分数,并使用matplotlib在图形中显示结果?,matplotlib,machine-learning,scikit-learn,naivebayes,multinomial,Matplotlib,Machine Learning,Scikit Learn,Naivebayes,Multinomial,我正在处理的数据集如下所示: 在我附加的屏幕截图中,您可以看到我的数据集包含16行和12个元组,但实际上它包含521行和12个元组 第一栏是:“月经初潮早开始” 第二栏:“口服避孕药” 第三栏:“饮食维持” 第四栏:“受乳腺癌影响” 第五专栏:“受宫颈癌影响?” 第六专栏:“家族癌症史?” 第七栏:“教育?” 第八栏:“丈夫年龄” 第九栏:“更年期结束年龄?” 第十栏:“食物含有高脂肪?” 第11栏:“堕胎?” 第12栏:“受卵巢癌影响?” 这里所有的列都包含分类变量。因此,我使用Label
- 第一栏是:“月经初潮早开始”
- 第二栏:“口服避孕药”
- 第三栏:“饮食维持”
- 第四栏:“受乳腺癌影响”
- 第五专栏:“受宫颈癌影响?”
- 第六专栏:“家族癌症史?”
- 第七栏:“教育?”
- 第八栏:“丈夫年龄”
- 第九栏:“更年期结束年龄?”
- 第十栏:“食物含有高脂肪?”
- 第11栏:“堕胎?”
- 第12栏:“受卵巢癌影响?”
我希望r平方和adj r平方的值都在1左右,但我不知道如何才能有效地做到这一点,因为我是这方面的新手,以前从未使用过任何包含所有分类变量且没有值的数据集,请帮助我使用naive bayes算法使我的模型更好。如果您在我的模型中发现任何错误,请告诉我并提供帮助,还请提供资源和教程+代码示例,帮助我从我的模型构建数据可视化图。以下是我的项目代码:
#导入库
将numpy作为np导入
将matplotlib.pyplot作为plt导入
作为pd进口熊猫
#导入数据集
数据集=pd.read\u csv('RiskFactor.csv'))
X=dataset.iloc[:,:-1]。值
y=dataset.iloc[:,11]。值
#dummy_x=dataset.iloc[:,[0,6,7,8]]。值
从sklearn.preprocessing导入LabelEncoder,OneHotEncoder
label_x=LabelEncoder()
X[:,0]=标签X.fit_变换(X[:,0])#月经初潮开始早
label_x=LabelEncoder()
X[:,1]=标签X.fit\u变换(X[:,1])
label_x=LabelEncoder()
X[:,2]=标签X.fit\u变换(X[:,2])
label_x=LabelEncoder()
X[:,3]=标签X.fit_变换(X[:,3])
label_x=LabelEncoder()
X[:,4]=标签X.fit_变换(X[:,4])
label_x=LabelEncoder()
X[:,5]=标签X.fit_变换(X[:,5])
label_x=LabelEncoder()
X[:,6]=标签X.fit_变换(X[:,6])教育
label_x=LabelEncoder()
X[:,7]=标签X.fit_变换(X[:,7])#丈夫年龄
label_x=LabelEncoder()
X[:,8]=标签X.fit转换(X[:,8])更年期结束年龄?
label_x=LabelEncoder()
X[:,9]=标签X.fit_变换(X[:,9])
label_x=LabelEncoder()
X[:,10]=标签X.fit_变换(X[:,10])
onehotencoder=onehotencoder(分类功能=“全部”)
X=onehotcoder.fit_transform(X).toarray()
#通过删除额外列避免伪变量陷阱
X=X[:,[1,2,3,4,5,6,7,8,9,10,11,12,14,15,17,18,20,21,22,23,24,25,26]]
#对因变量进行编码
labelencoder_y=labelencoder()
y=标签编码器y.拟合变换(y)
#将数据集拆分为训练集和测试集
从sklearn.cross\u验证导入序列测试\u分割
X_系列,X_测试,y_系列,y_测试=系列测试分割(X,y,测试尺寸=0.25,
随机状态=18)
从sklearn.naive_bayes导入GaussianNB、BernoulliNB、多项式nb
分类器=多项式nb()
分类器。配合(X_系列,y_系列)
打印(分类器)
y_expect=y_测试
#预测测试集结果
y_pred=分类器。预测(X_测试)
#制作混淆矩阵
从sklearn.metrics导入混淆矩阵、准确性得分
cm=混淆矩阵(y_测试,y_预测)
打印(准确度得分(预期、预测)
#从statsmodels中查找P值
将statsmodels.formula.api作为sm导入
回归器_OLS=sm.OLS(endog=y,exog=X).fit()
回归分析工具汇总()
从sklearn.model_选择导入学习曲线
从sklearn.model_选择导入ShuffleSplit
def绘图学习曲线(估计器、标题、X、y、ylim=None、cv=None、,
n_jobs=1,train_size=np.linspace(.1,1.0,5)):
"""
生成测试和培训学习曲线的简单绘图。
参数
----------
估计器:实现“拟合”和“预测”方法的对象类型
为每次验证克隆的该类型的对象。
标题:字符串
图表的标题。
X:类似阵列的形状(n个样本,n个特征)
训练向量,其中n_samples是样本数,并且
n_features是功能的数量。
y:类似阵列的形状(n_样本)或(n_样本,n_特征),可选
相对于X的分类或回归目标;
无监督学习不适用。
ylim:元组,形状(ymin,ymax),可选
定义绘制的最小值和最大值。
cv:int,交叉验证生成器或iterable,可选
确定交叉验证拆分策略。
cv的可能输入为:
-无,要使用默认的3倍交叉验证,
-整数,以指定折叠数。
-要用作交叉验证生成器的对象。
-可承受的屈服列车/试验分离。
对于整数/无输入,如果“y”是二进制或多类,
:参数系列尺寸:
:class:'StratifiedKFold'已使用。如果估计器不是分类器
或者,如果``y``既不是二进制类也不是多类,则使用:class:`KFold`。
请参阅:参考:`用户指南',了解各种
可以在此处使用的交叉验证程序。
n_作业:整数,可选
并行运行的作业数
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('RiskFactor.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 11].values
#dummy_x = dataset.iloc[:, [0,6,7,8]].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
label_x = LabelEncoder()
X[:,0] = label_x.fit_transform(X[:,0] ) #Menarche start early
label_x = LabelEncoder()
X[:,1] = label_x.fit_transform(X[:,1] )
label_x = LabelEncoder()
X[:,2] = label_x.fit_transform(X[:,2] )
label_x = LabelEncoder()
X[:,3] = label_x.fit_transform(X[:,3] )
label_x = LabelEncoder()
X[:,4] = label_x.fit_transform(X[:,4] )
label_x = LabelEncoder()
X[:,5] = label_x.fit_transform(X[:,5] )
label_x = LabelEncoder()
X[:,6] = label_x.fit_transform(X[:,6] ) #Education
label_x = LabelEncoder()
X[:,7] = label_x.fit_transform(X[:,7] ) #Age of Husband
label_x = LabelEncoder()
X[:,8] = label_x.fit_transform(X[:,8] ) #Menopause End age?
label_x = LabelEncoder()
X[:,9] = label_x.fit_transform(X[:,9] )
label_x = LabelEncoder()
X[:,10] = label_x.fit_transform(X[:,10] )
onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()
#avoiding dummy variable trap by removing extra columns
X = X[: ,[1,2,3,4,5,6,7,8,9,10,11,12,14,15,17,18,20,21,22,23,24,25,26]]
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.25,
random_state = 18)
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
print(classifier)
y_expect = y_test
#predicting the test set result
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix (y_test, y_pred)
print(accuracy_score(y_expect,y_pred))
# finding P value from statsmodels
import statsmodels.formula.api as sm
regressor_OLS = sm.OLS(endog=y,exog = X).fit()
regressor_OLS.summary()
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- An object to be used as a cross-validation generator.
- An iterable yielding train/test splits.
For integer/None inputs, if ``y`` is binary or multiclass,
:param train_sizes:
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
estimator = MultinomialNB()
title = "Learning Curves (Naive Bayes classifier ALGORITHM)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation
#set.
cv = ShuffleSplit(n_splits=100, test_size=0.25, random_state=17)
#cv = None
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv,
n_jobs=1)
plt.show()
I've solved this problem by using PCA ,here is the code :
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 31 22:38:32 2018
@author: MOBASSIR
"""
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('ovarian.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 11].values
#dummy_x = dataset.iloc[:, [0,6,7,8]].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
label_x1 = LabelEncoder()
X[:,0] = label_x1.fit_transform(X[:,0] ) #Menarche start early
label_x2 = LabelEncoder()
X[:,1] = label_x2.fit_transform(X[:,1] )
label_x3 = LabelEncoder()
X[:,2] = label_x3.fit_transform(X[:,2] )
label_x4 = LabelEncoder()
X[:,3] = label_x4.fit_transform(X[:,3] )
label_x5 = LabelEncoder()
X[:,4] = label_x5.fit_transform(X[:,4] )
label_x6 = LabelEncoder()
X[:,5] = label_x6.fit_transform(X[:,5] )
label_x7 = LabelEncoder()
X[:,6] = label_x7.fit_transform(X[:,6] ) #Education
label_x8 = LabelEncoder()
X[:,7] = label_x8.fit_transform(X[:,7] ) #Age of Husband
label_x9 = LabelEncoder()
X[:,8] = label_x9.fit_transform(X[:,8] ) #Menopause End age?
label_x10 = LabelEncoder()
X[:,9] = label_x10.fit_transform(X[:,9] )
label_x11 = LabelEncoder()
X[:,10] = label_x11.fit_transform(X[:,10] )
onehotencoder = OneHotEncoder(categorical_features = [0,6,7,8])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
"""
idx_to_delete = [0, 13, 16, 19]
X = [i for i in range(X.shape[-1]) if i not in idx_to_delete]
X = X[:, 1:]
df = pd.DataFrame(X, dtype='float64')
df = pd.to_numeric(X)
"""
#avoiding dummy variable trap by removing extra columns
#X = X[: ,[1,2,3,4,5,6,7,8,9,10,11,12,14,15,17,18,20,21,22,23,24,25,26]]
"""
#4,8,10,12,18,21,22,23 for dropped columns
#5,9,11,13,19,22,23,24 for dropped columns
#1,4,5,6 == 2,5,6,7
X = X[: ,[9,11,23,24]]
"""
#24,21,19,18,17,14,12,10,8,7,6 ,4 ,3 ,2,1 for undropped column
#25,22,20,19,18,15,13,11,9,8,7 ,5 ,4 ,3,2
#2,5,6,8,12,15
X = X[: ,[9,13,16,18,19]]
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
"""
onehotencoder = OneHotEncoder()
y= onehotencoder.fit_transform(y).toarray()
"""
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
#Applying naive bayes classifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
classifier = BernoulliNB()
classifier.fit(X_train, y_train)
print(classifier)
y_expect = y_test
#predicting the test set result
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix (y_test, y_pred)
print(accuracy_score(y_expect,y_pred))
# finding P value from statsmodels
import statsmodels.formula.api as sm
regressor_OLS = sm.OLS(endog=y,exog = X).fit()
regressor_OLS.summary()
from sklearn.cross_validation import cross_val_score
ck = BernoulliNB()
scores = cross_val_score(ck,X,y,cv=10, scoring='accuracy')
print (scores)
print (scores.mean())
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
'''Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- An object to be used as a cross-validation generator.
- An iterable yielding train/test splits.
For integer/None inputs, if ``y`` is binary or multiclass,
:param train_sizes:
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).'''
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
estimator = BernoulliNB()
title = "Learning Curves (Naive Bayes classifier ALGORITHM)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.25, random_state=0)
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=1)
plt.show()
#End of Bayes theorem
plt.rcParams['font.size'] = 14
plt.hist(y_pred, bins = 8)
plt.xlim(0, 1)
plt.title('Predicted probabilities')
plt.xlabel('Affected by ovarian cancer?(predicted)')
plt.ylabel('frequency')
from sklearn.metrics import recall_score,precision_score
recall_score(y_test,y_pred,average='macro')
precision_score(y_test, y_pred, average='micro')
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Training set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()