Python 洗牌提高精度-sklearn-多项式贝叶斯
我试图在scikit learn中计算多项式NaiveBayes算法的精度 代码如下:Python 洗牌提高精度-sklearn-多项式贝叶斯,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,我试图在scikit learn中计算多项式NaiveBayes算法的精度 代码如下: import numpy as np import random from sklearn import naive_bayes from sklearn.preprocessing import LabelBinarizer import random from collections import Counter dim0 = ['high', 'low', 'med', 'vhigh'] dim1 =
import numpy as np
import random
from sklearn import naive_bayes
from sklearn.preprocessing import LabelBinarizer
import random
from collections import Counter
dim0 = ['high', 'low', 'med', 'vhigh']
dim1 = ['high', 'low', 'med', 'vhigh']
dim2 = ['2', '3', '4', '5more']
dim3 = ['2', '4', 'more']
dim4 = [ 'big', 'med', 'small' ]
dim5 = ['high' , 'low', 'med' ]
target = ['acc', 'good', 'unacc', 'vgood' ]
dimensions = [ dim0, dim1, dim2, dim3, dim4, dim5, target]
# function to read dataset
def readDataSet(fname):
f = open(fname, 'r')
dataset = []
for line in f:
words = []
tokenized = line.strip().split(',')
if len(tokenized) != 7:
continue
for w in tokenized:
words.append(w)
dataset.append(np.array(words))
return np.array(dataset)
# split the dataset into X - features and Y - labels / targets
# assumes last column of the data is the target
def XYfromDataset(dataset):
X = []
Y = []
for d in dataset:
X.append(np.array(d[:-1]))
Y.append(d[-1])
return np.array(X), np.array(Y)
def splitXY(X, Y, perc):
splitpos = int(len(X) * perc)
X_train = X[:splitpos]
X_test = X[splitpos:]
Y_train = Y[:splitpos]
Y_test = Y[splitpos:]
return (X_train, Y_train, X_test, Y_test)
def mapDimension(dimen, mapping):
res = []
for d in dimen:
res.append(float(mapping.index(d)))
return np.array(res)
def runTrails( dataset, split = 0.66 ):
random.shuffle(dataset, random.random)
(X,Y) = XYfromDataset(dataset)
(X_train, Y_train, X_test, Y_test) = splitXY(X, Y, split)
mnb = naive_bayes.MultinomialNB()
mnb.fit(X_train, Y_train)
score = mnb.score(X_test, Y_test)
mnb = None
return score
dataset = readDataSet('car.txt')
print "Class distributution:" , Counter(dataset[:,6])
for d in range(dataset.shape[1]):
dataset[:, d] = mapDimension(dataset[: , d] , dimensions[d])
dataset = dataset.astype(float)
score = 0.0
num_trails = 10
for t in range(num_trails):
acc = runTrails(dataset)
print "Trail", t, "Accuracy:", acc
score += acc
print score / num_trails
数据集可在以下位置找到:
我对程序的输出感到困惑:
Trail 0 Accuracy: 0.758503401361
Trail 1 Accuracy: 0.84693877551
Trail 2 Accuracy: 0.926870748299
Trail 3 Accuracy: 0.96768707483
Trail 4 Accuracy: 0.979591836735
Trail 5 Accuracy: 0.996598639456
Trail 6 Accuracy: 1.0
Trail 7 Accuracy: 1.0
Trail 8 Accuracy: 1.0
Trail 9 Accuracy: 1.0
0.947619047619
如果我删除方法runTrail中的random.shuffle,则输出如下
Class distributution: Counter({'unacc': 1210, 'acc': 384, 'good': 69, 'vgood': 65})
Trail 0 Accuracy: 0.583333333333
Trail 1 Accuracy: 0.583333333333
Trail 2 Accuracy: 0.583333333333
Trail 3 Accuracy: 0.583333333333
Trail 4 Accuracy: 0.583333333333
Trail 5 Accuracy: 0.583333333333
Trail 6 Accuracy: 0.583333333333
Trail 7 Accuracy: 0.583333333333
Trail 8 Accuracy: 0.583333333333
Trail 9 Accuracy: 0.583333333333
0.583333333333
我知道洗牌会影响这个数据集中算法的准确性——因为数据集是根据类排序的
因此,第一次迭代的精度约为70
但为什么准确度不断提高?这对我来说毫无意义。
如果算法继续训练,它的性能会更好,但在这里,我使用了一个新实例,并对数据集进行了洗牌。看起来原始数据集将通过复制示例而增长,这增加了测试样本也出现在训练集中的可能性,从而使模型可能在最后的迭代中过度拟合。然而,通过阅读代码,我无法识别罪犯。您应该在运行轨迹中打印X_列、X_测试、Y_列和Y_测试数组的大小以检查此假设。@ogrisel I打印长度:在所有运行轨迹上打印lenX_列、lenY_列、lenX_测试、lenY_测试1140 1140 588 588查找错误-我必须将构造函数的fit_Previor参数设置为false,或者先发概率会被考虑。这并不能解释为什么先发概率会从一次跑到下一次。说到概率,我是个傻瓜。请您先解释一下什么是Previor,以及为什么它会或不会从一次运行更改到下一次运行?