Python 为DecisionTreeClassifier绘制多类ROC曲线

Python 为DecisionTreeClassifier绘制多类ROC曲线,python,machine-learning,scikit-learn,roc,Python,Machine Learning,Scikit Learn,Roc,我试图用除svm.SVC之外的分类器绘制ROC曲线,该分类器在文档中提供。我的代码适用于svm.SVC;然而,在我切换到KNeighborsClassifier、多项式NB和DecisionTreeClassifier之后,系统一直告诉我检查一致的长度(y\u true,y\u score)和找到样本数不一致的输入变量:[26632,53264] 这是我的密码 import pandas as pd import numpy as np import matplotlib.pyplot as p

我试图用除svm.SVC之外的分类器绘制ROC曲线,该分类器在文档中提供。我的代码适用于svm.SVC;然而,在我切换到KNeighborsClassifier、多项式NB和DecisionTreeClassifier之后,系统一直告诉我
检查一致的长度(y\u true,y\u score)
找到样本数不一致的输入变量:[26632,53264]

这是我的密码

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
import sys
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
# Import some data to play with
df = pd.read_csv("E:\\autodesk\\Hourly and weather categorized2.csv")
X =df[['TTI','Max TemperatureF','Mean TemperatureF','Min TemperatureF',' Min Humidity']].values
y = df['TTI_Category'].as_matrix()
y=y.reshape(-1,1)
# Binarize the output
y = label_binarize(y, classes=['Good','Bad'])
n_classes = y.shape[1]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                    random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(DecisionTreeClassifier(random_state=0))
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()

roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 1
plt.plot(fpr[0], tpr[0], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[0])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
我怀疑错误发生在这行
fpr[“micro”]、tpr[“micro”]、roc\u曲线(y\u test.ravel()、y\u score.ravel())
roc_auc[“micro”]=auc(fpr[“micro”]、tpr[“micro”])
,但我是roc曲线的初学者,因此请有人指导我完成此回溯。非常感谢你的时间和帮助。 顺便说一下,这是整个追踪。希望我的解释足够清楚`

Traceback (most recent call last):

  File "<ipython-input-1-16eb0db9d4d9>", line 1, in <module>
    runfile('C:/Users/Think/Desktop/Python Practice/ROC with decision tree.py', wdir='C:/Users/Think/Desktop/Python Practice')

  File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\Users\Think\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Users/Think/Desktop/Python Practice/ROC with decision tree.py", line 47, in <module>
    fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())

  File "C:\Users\Think\Anaconda2\lib\site-packages\sklearn\metrics\ranking.py", line 510, in roc_curve
    y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)

  File "C:\Users\Think\Anaconda2\lib\site-packages\sklearn\metrics\ranking.py", line 302, in _binary_clf_curve
    check_consistent_length(y_true, y_score)

  File "C:\Users\Think\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 173, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])

ValueError: Found input variables with inconsistent numbers of samples: [26632, 53264]
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
runfile('C:/Users/Think/Desktop/Python Practice/ROC with decision tree.py',wdir='C:/Users/Think/Desktop/Python Practice')
文件“C:\Users\Think\Anaconda2\lib\site packages\spyder\utils\site\sitecustomize.py”,第880行,在runfile中
execfile(文件名、命名空间)
文件“C:\Users\Think\Anaconda2\lib\site packages\spyder\utils\site\sitecustomize.py”,第87行,在execfile中
exec(编译(脚本文本,文件名,'exec'),glob,loc)
文件“C:/Users/Think/Desktop/Python Practice/ROC with decision tree.py”,第47行,在
fpr[“微”]、tpr[“微”]、roc\U曲线(y\u test.ravel()、y\u score.ravel())
文件“C:\Users\Think\Anaconda2\lib\site packages\sklearn\metrics\ranking.py”,第510行,roc\U曲线
y_为真,y_分数,位置标签=位置标签,样本重量=样本重量)
文件“C:\Users\Think\Anaconda2\lib\site packages\sklearn\metrics\ranking.py”,第302行,在二进制clf曲线中
检查长度是否一致(y_正确,y_分数)
文件“C:\Users\Think\Anaconda2\lib\site packages\sklearn\utils\validation.py”,第173行,检查长度是否一致
“样本:%r”%[int(l)表示长度为l的样本])
ValueError:找到样本数不一致的输入变量:[2663253264]

您需要使用
DecisionTreeClassifier
预测概率
功能:

示例:

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)

classifier = OneVsRestClassifier(DecisionTreeClassifier(random_state=0))
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)


fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()

通过将这一行添加到原始代码
y\u resampled=label\u binarize(y\u resampled,classes=['Good','Bad','Ok']),问题就解决了

我没有收到任何关于scikit中iris数据的代码错误,它也有3个类。如果错误仍然发生,你能上传你的数据吗?谢谢@Vive Kumar。问题已经解决了。我对x重采样和y重采样进行了二值化以解决此问题,但无论如何,非常感谢您花时间查看我的代码。对于决策树,没有decision_函数。您需要使用predict\u prob