Python 2.7 Can';t从朴素贝叶斯分类器生成ROC-AUC曲线
我正试图利用某些变量的特征来预测种族。从我之前的问题中,我学会了使用决策函数或预测概率,而不是实际预测来拟合ROC曲线 我能够使用以下代码,使用SVM分类器生成ROC-AUC图Python 2.7 Can';t从朴素贝叶斯分类器生成ROC-AUC曲线,python-2.7,matplotlib,machine-learning,scikit-learn,roc,Python 2.7,Matplotlib,Machine Learning,Scikit Learn,Roc,我正试图利用某些变量的特征来预测种族。从我之前的问题中,我学会了使用决策函数或预测概率,而不是实际预测来拟合ROC曲线 我能够使用以下代码,使用SVM分类器生成ROC-AUC图 # coding=utf-8 import pandas as pd from pandas import DataFrame, Series import numpy as np import nltk import re import random from random import randint import
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# multi_class : str, {'ovr', 'multinomial'}
#$$
lr = LogisticRegression()
#lr = LogisticRegression(penalty='l2', class_weight='auto', solver='lbfgs', multi_class='multinomial')
nb = MultinomialNB(fit_prior=False)
#$$
svm = LinearSVC(class_weight='auto')
dv = DictVectorizer()
# Get csv file into data frame
data = pd.read_csv("FamilySearchData_All_OCT2015_newEthnicity_filledEthnicity_processedName_trimmedCol.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Class list
ethnicity2 = ['fr', 'en', 'ir', 'sc', 'others', 'ab', 'rus', 'ch', 'it', 'ja']
Ab_group = ['fr', 'en', 'ir', 'sc', 'others', 'ab', 'rus', 'ch', 'it', 'ja', 'fn', 'metis', 'inuit']
Ab_lang = ['fr', 'en', 'ir', 'sc', 'others', 'ab', 'rus', 'ch', 'it', 'ja', 'x', 'y']
############################################
########## CONTROL ROOM ####################
# change-tag: '#$$'
# Output file name decoration
# Total N = 5031794
#$$
featureUsed = 8
#$$
subsample_size = 50000
#$$
ethnicity_var = 'ethnicity2' # Ab_group, Ab_tribe, Ab_lang
count = 0
# Declaration
print 'No. features=', featureUsed
print 'N=', subsample_size, 'Training_N=', subsample_size/2, 'Test_N=', subsample_size/2
print 'ethnicity_var:', ethnicity_var
#$$
print ethnicity2
#$$
print 'ML classifier:', 'svm = LinearSVC(class_weight=\'auto\')'
print ''
print '//////////////////////////////////////////////////////'
print ''
try:
#$$
for i in ethnicity2:
count+=1
ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, ja
# fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay
############################################
############################################
def ethnicity_target(row):
try:
if row[ethnicity_var] == ethnicity_tar:
return 1
else:
return 0
except: return None
df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1)
print '1=', ethnicity_tar
print '0=', 'non-'+ethnicity_tar
# Random sampling a smaller dataframe for debugging
rows = random.sample(df.index, subsample_size)
df = df.ix[rows] # Warning!!!! overwriting original df
print 'Class count:'
print df['ethnicity_scan'].value_counts()
# Assign X and y variables
X = df.raw_name.values
y = df.ethnicity_scan.values
# Feature extraction functions
def feature_full_name(nameString):
#... codes omitted
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'last-name': feature_full_last_name(i)} for i in X]
my_dict2 = [list_to_dict(feature_twoLetters(feature_full_last_name(i))) for i in X]
my_dict3 = [list_to_dict(feature_threeLetters(feature_full_last_name(i))) for i in X]
my_dict4 = [list_to_dict(feature_fourLetters(feature_full_last_name(i))) for i in X]
my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X]
my_dict6 = [list_to_dict(feature_twoLetters(feature_full_first_name(i))) for i in X]
my_dict7 = [list_to_dict(feature_threeLetters(feature_full_first_name(i))) for i in X]
my_dict8 = [list_to_dict(feature_fourLetters(feature_full_first_name(i))) for i in X]
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
+ my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items())
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
#$$
svm.fit(X_train, y_train)
# Making predictions using trained data
#$$
y_train_predictions = svm.predict(X_train)
#$$
y_test_predictions = svm.predict(X_test)
#print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])
print 'Classification report:'
print classification_report(y_test, y_test_predictions)
#print sk_confusion_matrix(y_train, y_train_predictions)
print 'Confusion matrix:'
print sk_confusion_matrix(y_test, y_test_predictions)
#print y_test[1:20]
#print y_test_predictions[1:20]
#print y_test[1:10]
#print np.bincount(y_test)
#print np.bincount(y_test_predictions)
# Find and plot AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Find and plot AUC
y_score = svm.fit(X_train, y_train).decision_function(X_test)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_score)
roc_auc = auc(false_positive_rate, true_positive_rate)
print 'AUC-'+ethnicity_tar+'=',roc_auc
# Get different color each graph line
colorSet = ['navy', 'greenyellow', 'deepskyblue', 'darkviolet', 'crimson',
'darkslategray', 'indigo', 'brown', 'orange', 'palevioletred', 'mediumseagreen',
'k', 'darkgoldenrod', 'g', 'midnightblue', 'c', 'y', 'r', 'b', 'm', 'lawngreen'
'mediumturquoise', 'lime', 'teal', 'drive', 'sienna', 'sandybrown']
color = colorSet[count-1]
# Plotting
plt.title('ROC')
plt.plot(false_positive_rate, true_positive_rate, c=color, label=('AUC-'+ethnicity_tar+'= %0.2f'%roc_auc))
plt.legend(loc='lower right', prop={'size':8})
plt.plot([0,1],[0,1], color='lightgrey', linestyle='--')
plt.xlim([-0.05,1.0])
plt.ylim([0.0,1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
#plt.show()
# Save ROC graphs
plt.savefig('TESTROCXXX.jpg')
print ''
print '//////////////////////////////////////////////////////'
print ''
except Exception as e:
print 'Error:', str(e)
print ''
print '//////////////////////////////////////////////////////'
print ''
其中:
但当我尝试使用朴素贝叶斯分类器时,我做了以下更改:
nb.fit(X_train, y_train) # from svm.fit(X_train, y_train)
y_train_predictions = nb.predict(X_train) # from y_train_predictions = svm.predict(X_train)
y_test_predictions = nb.predict(X_test) # from y_test_predictions = svm.predict(X_test)
y_score = nb.fit(X_train, y_train).predict_proba(X_test) # from y_score = svm.fit(X_train, y_train).decision_function(X_test)
但是,我得到了一个错误:
Error: bad input shape (25000L, 2L)
编辑:按照建议添加[:,1]后,我显示了4个ROC图,最后两个是NB,看起来很奇怪
我忘了在这个答案中提到,当您使用roc曲线的预测概率结果时,您需要从可能的两列中选择一些列
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_score[:,1])
这可能行得通
补充道:好吧,这是天真的贝叶斯,在大多数情况下,它不应该打败LR。它的模型比LR更简单,并且不能捕捉特性之间的交互(顺便说一句,这就是为什么它被称为Naive)。在ML论文中,作者经常使用NB只是为了在精度上做一些起点,展示最简单的ML算法的结果,并将更高级的算法与之进行比较
请看这里:
另一方面,虽然朴素的贝叶斯被认为是一个体面的人
在分类器中,它被认为是一个糟糕的估计器,因此概率
不应过于重视predict_proba的输出
该错误发生在哪一行?X_测试和X_火车有什么形状?还有,为什么你要第二次在nb(和svm)上调用fit?您已经在该数据方面培训了模型。您可以在最后一行调用nb.predict\u proba(X\u test)。您正在示例中导入nltk。这仍然是必要的吗?@Olologin错误发生在predict_proba声明之后,特别是在“假阳性率,真阳性率,阈值=roc_曲线(y_测试,y_分数)”下。X_测试和X_序列是形状为(2500063470)的稀疏矩阵@colidyre不,这是不必要的,但matterI不应该添加它(请参见我上面的编辑)。它确实会生成ROC图和AUC,但看起来很奇怪(有点像我之前的问题)。因此,这可能不是因为我可能在编码时出错,尽管NB的图形看起来很奇怪?我在某个我不知道的地方犯了错误,这是我担心的。还有一个问题,“fr”总是会得到非常高的RUC,尽管我故意使用非常低的训练数据1k(50:50训练/测试)。这是预期的吗?('fr'是发病率最高的类别之一,约占总数据的1/3)我没有检查您的代码,但图表看起来不错。您是否将分类器训练为OneVsAll?i、 你训练每个分类器来区分他的类和其他类吗?pred_proba[:,1]是关键。如果在未选择第一列的情况下传递pred_proba,则会出现一个错误,即将二维数组传递到一维参数中