如何用Python编写混淆矩阵?
我用Python编写了一个混淆矩阵计算代码:如何用Python编写混淆矩阵?,python,machine-learning,Python,Machine Learning,我用Python编写了一个混淆矩阵计算代码: def conf_mat(prob_arr, input_arr): # confusion matrix conf_arr = [[0, 0], [0, 0]] for i in range(len(prob_arr)): if int(input_arr[i]) == 1: if float(prob_arr[i])
def conf_mat(prob_arr, input_arr):
# confusion matrix
conf_arr = [[0, 0], [0, 0]]
for i in range(len(prob_arr)):
if int(input_arr[i]) == 1:
if float(prob_arr[i]) < 0.5:
conf_arr[0][1] = conf_arr[0][1] + 1
else:
conf_arr[0][0] = conf_arr[0][0] + 1
elif int(input_arr[i]) == 2:
if float(prob_arr[i]) >= 0.5:
conf_arr[1][0] = conf_arr[1][0] +1
else:
conf_arr[1][1] = conf_arr[1][1] +1
accuracy = float(conf_arr[0][0] + conf_arr[1][1])/(len(input_arr))
input_arr是数据集的原始类标签,如下所示:
[1.0, 1.0, 1.0, 0.41592955657342651, 1.0, 0.0053405015805891975, 4.5321494433440449e-299, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.70943426182688163, 1.0, 1.0, 1.0, 1.0]
[2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1]
我的代码试图做的是:我得到prob_arr和input_arr,对于每个类(1和2),我检查它们是否被错误分类
但是我的代码只适用于两个类。如果我为多类数据运行此代码,它将不起作用。我如何为多个类制作这个
例如,对于包含三个类的数据集,它应该返回:
[[21,7,3],[3,38,6],[5,4,19]
您应该从类映射到混淆矩阵中的一行
这里的映射很简单:
def row_of_class(classe):
return {1: 0, 2: 1}[classe]
在循环中,计算
expected\u row
,correct\u row
,并递增conf\u arr[expected\u row][correct\u row]
。您甚至可以使用比开始时更少的代码。此函数为任意数量的类创建混淆矩阵
def create_conf_matrix(expected, predicted, n_classes):
m = [[0] * n_classes for i in range(n_classes)]
for pred, exp in zip(predicted, expected):
m[pred][exp] += 1
return m
def calc_accuracy(conf_matrix):
t = sum(sum(l) for l in conf_matrix)
return sum(conf_matrix[i][i] for i in range(len(conf_matrix))) / t
与上面的函数不同,在调用函数之前,必须根据分类结果提取预测类,例如
[1 if p < .5 else 2 for p in classifications]
[1如果p<0.5,则分类中的p为2]
一般来说,您需要更改概率数组。您需要一个分数列表(每个类一个),而不是为每个实例指定一个数字,并根据其是否大于0.5进行分类,然后将最大的分数作为所选的类(也称为argmax)
您可以使用字典保存每个分类的概率:
prob_arr = [{classification_id: probability}, ...]
选择一个分类将类似于:
for instance_scores in prob_arr :
predicted_classes = [cls for (cls, score) in instance_scores.iteritems() if score = max(instance_scores.values())]
这将处理两个类分数相同的情况。通过选择列表中的第一个,你可以得到一个分数,但是你如何处理它取决于你在分类什么
一旦你有了预测类的列表和预期类的列表,你就可以使用类似于的代码来创建混淆数组并计算准确度。你可以使你的代码更简洁,并且(有时)使用它运行得更快。例如,在两种情况下,函数可以重写为(请参阅): ,其中:
actual = (numpy.array(input_arr) == 2)
predicted = (numpy.array(prob_arr) < 0.5)
actual=(numpy.array(输入数组)==2)
预测=(numpy.数组(prob_arr)<0.5)
(我建议无论如何都要使用)它是否包含在度量模块中:
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 0, 0, 0, 1, 1, 0, 2, 2]
>>> confusion_matrix(y_true, y_pred)
array([[3, 0, 0],
[1, 1, 1],
[1, 1, 1]])
Scikit学习提供了一个混淆矩阵
功能
from sklearn.metrics import confusion_matrix
y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]
confusion_matrix(y_actu, y_pred)
它输出一个Numpy数组
array([[3, 0, 0],
[0, 1, 2],
[2, 1, 3]])
但您也可以使用熊猫创建混淆矩阵:
import pandas as pd
y_actu = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], name='Actual')
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2], name='Predicted')
df_confusion = pd.crosstab(y_actu, y_pred)
您将得到一个(贴有精美标签的)熊猫数据帧:
Predicted 0 1 2
Actual
0 3 0 0
1 0 1 2
2 2 1 3
如果添加margins=True
like
df_confusion = pd.crosstab(y_actu, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
您还将获得每行和每列的总和:
Predicted 0 1 2 All
Actual
0 3 0 0 3
1 0 1 2 3
2 2 1 3 6
All 5 2 5 12
您还可以使用以下方法获得标准化混淆矩阵:
df_conf_norm = df_confusion / df_confusion.sum(axis=1)
Predicted 0 1 2
Actual
0 1.000000 0.000000 0.000000
1 0.000000 0.333333 0.333333
2 0.666667 0.333333 0.500000
plot_confusion_matrix(df_conf_norm)
您可以使用
import matplotlib.pyplot as plt
def plot_confusion_matrix(df_confusion, title='Confusion matrix', cmap=plt.cm.gray_r):
plt.matshow(df_confusion, cmap=cmap) # imshow
#plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(df_confusion.columns))
plt.xticks(tick_marks, df_confusion.columns, rotation=45)
plt.yticks(tick_marks, df_confusion.index)
#plt.tight_layout()
plt.ylabel(df_confusion.index.name)
plt.xlabel(df_confusion.columns.name)
plot_confusion_matrix(df_confusion)
或使用以下方法绘制标准化混淆矩阵:
df_conf_norm = df_confusion / df_confusion.sum(axis=1)
Predicted 0 1 2
Actual
0 1.000000 0.000000 0.000000
1 0.000000 0.333333 0.333333
2 0.666667 0.333333 0.500000
plot_confusion_matrix(df_conf_norm)
您可能还对该项目及其Pip包感兴趣
有了这个软件包,混乱矩阵可以很好的打印、绘图。
您可以对混淆矩阵进行二值化,获取类统计信息,如TP、TN、FP、FN、ACC、TPR、FPR、FNR、TNR(SPC)、LR+、LR-、DOR、PPV、FDR、FOR、NPV和一些总体统计信息
In [1]: from pandas_ml import ConfusionMatrix
In [2]: y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
In [3]: y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]
In [4]: cm = ConfusionMatrix(y_actu, y_pred)
In [5]: cm.print_stats()
Confusion Matrix:
Predicted 0 1 2 __all__
Actual
0 3 0 0 3
1 0 1 2 3
2 2 1 3 6
__all__ 5 2 5 12
Overall Statistics:
Accuracy: 0.583333333333
95% CI: (0.27666968568210581, 0.84834777019156982)
No Information Rate: ToDo
P-Value [Acc > NIR]: 0.189264302376
Kappa: 0.354838709677
Mcnemar's Test P-Value: ToDo
Class Statistics:
Classes 0 1 2
Population 12 12 12
P: Condition positive 3 3 6
N: Condition negative 9 9 6
Test outcome positive 5 2 5
Test outcome negative 7 10 7
TP: True Positive 3 1 3
TN: True Negative 7 8 4
FP: False Positive 2 1 2
FN: False Negative 0 2 3
TPR: (Sensitivity, hit rate, recall) 1 0.3333333 0.5
TNR=SPC: (Specificity) 0.7777778 0.8888889 0.6666667
PPV: Pos Pred Value (Precision) 0.6 0.5 0.6
NPV: Neg Pred Value 1 0.8 0.5714286
FPR: False-out 0.2222222 0.1111111 0.3333333
FDR: False Discovery Rate 0.4 0.5 0.4
FNR: Miss Rate 0 0.6666667 0.5
ACC: Accuracy 0.8333333 0.75 0.5833333
F1 score 0.75 0.4 0.5454545
MCC: Matthews correlation coefficient 0.6831301 0.2581989 0.1690309
Informedness 0.7777778 0.2222222 0.1666667
Markedness 0.6 0.3 0.1714286
Prevalence 0.25 0.25 0.5
LR+: Positive likelihood ratio 4.5 3 1.5
LR-: Negative likelihood ratio 0 0.75 0.75
DOR: Diagnostic odds ratio inf 4 2
FOR: False omission rate 0 0.2 0.4285714
我注意到一个关于混淆矩阵的新Python库已经发布:也许你可以看看。如果你不想让scikit学习为你做这项工作
import numpy
actual = numpy.array(actual)
predicted = numpy.array(predicted)
# calculate the confusion matrix; labels is numpy array of classification labels
cm = numpy.zeros((len(labels), len(labels)))
for a, p in zip(actual, predicted):
cm[a][p] += 1
# also get the accuracy easily with numpy
accuracy = (actual == predicted).sum() / float(len(actual))
或者看看中更完整的实现。我编写了一个简单的类来构建混淆矩阵,而无需依赖机器学习库
可以使用该类,例如:
labels=[“猫”、“狗”、“迅猛龙”、“海怪”、“小马”]
ConversionMatrix=ConversionMatrix(标签)
confusionMatrix.update(“cat”、“cat”)
confusionMatrix.update(“猫”、“狗”)
...
confusionMatrix.update(“海怪”、“velociraptor”)
confusionMatrix.update(“velociraptor”、“velociraptor”)
composionMatrix.plot()
类混淆矩阵:
导入pylab
导入集合
将numpy作为np导入
类混淆矩阵:
定义初始化(自我,标签):
self.labels=标签
self.mission\u dictionary=self.build\u mission\u dictionary(标签)
def更新(自我、预测的_标签、预期的_标签):
self.mission\u dictionary[预期的\u标签][预期的\u标签]+=1
def生成字典(自我、标签集):
预期的\u标签=collections.OrderedDict()
对于标签集合中的预期标签:
预期的\u标签[预期的\u标签]=集合。OrderedDict()
对于标签集合中的预测标签:
预期的\u标签[预期的\u标签][预测的\u标签]=0.0
返回预期的\u标签
def将_转换为_矩阵(自身、字典):
长度=长度(字典)
混淆字典=np.0((长度,长度))
i=0
对于字典中的行:
j=0
对于字典中的列:
字典[i][j]=字典[行][列]
j+=1
i+=1
返回字典
def获取矩阵(自身):
矩阵=self.convert\u到矩阵(self.convert\u字典)
返回自规范化(矩阵)
def规格化(自身、矩阵):
amin=np.amin(矩阵)
amax=np.amax(矩阵)
返回[[((y-amin)*(1-0))/(amax-amin)],表示x中的y]表示矩阵中的x]
def绘图(自):
矩阵=自身。获取矩阵()
pylab.图()
imshow(矩阵,插值='nearest',cmap=pylab.cm.jet)
pylab.标题(“混淆矩阵”)
对于枚举(矩阵)中的i、vi:
对于枚举(vi)中的j,vj:
pylab.text(j,i+.1,%.1f”%vj,fontsize=12)
pylab.colorbar()
类=np.arange(len(self.labels))
pylab.xticks(类、self.label)
pylab.yticks(类、自标签)
pylab.ylabel('预期标签')
pylab.xlabel('预测标签')
pylab.show()
仅限
# A Simple Confusion Matrix Implementation
def confusionmatrix(actual, predicted, normalize = False):
"""
Generate a confusion matrix for multiple classification
@params:
actual - a list of integers or strings for known classes
predicted - a list of integers or strings for predicted classes
normalize - optional boolean for matrix normalization
@return:
matrix - a 2-dimensional list of pairwise counts
"""
unique = sorted(set(actual))
matrix = [[0 for _ in unique] for _ in unique]
imap = {key: i for i, key in enumerate(unique)}
# Generate Confusion Matrix
for p, a in zip(predicted, actual):
matrix[imap[p]][imap[a]] += 1
# Matrix Normalization
if normalize:
sigma = sum([sum(matrix[imap[i]]) for i in unique])
matrix = [row for row in map(lambda i: list(map(lambda j: j / sigma, i)), matrix)]
return matrix
cm = confusionmatrix(
[1, 1, 2, 0, 1, 1, 2, 0, 0, 1], # actual
[0, 1, 1, 0, 2, 1, 2, 2, 0, 2] # predicted
)
# And The Output
print(cm)
[[2, 1, 0], [0, 2, 1], [1, 2, 1]]
# Actual
# 0 1 2
# # #
[[2, 1, 0], # 0
[0, 2, 1], # 1 Predicted
[1, 2, 1]] # 2
cm = confusionmatrix(
["B", "B", "C", "A", "B", "B", "C", "A", "A", "B"], # actual
["A", "B", "B", "A", "C", "B", "C", "C", "A", "C"] # predicted
)
# And The Output
print(cm)
[[2, 1, 0], [0, 2, 1], [1, 2, 1]]
cm = confusionmatrix(
["B", "B", "C", "A", "B", "B", "C", "A", "A", "B"], # actual
["A", "B", "B", "A", "C", "B", "C", "C", "A", "C"], # predicted
normalize = True
)
# And The Output
print(cm)
[[0.2, 0.1, 0.0], [0.0, 0.2, 0.1], [0.1, 0.2, 0.1]]
# Actual & Predicted Classes
actual = ["A", "B", "C", "C", "B", "C", "C", "B", "A", "A", "B", "A", "B", "C", "A", "B", "C"]
predicted = ["A", "B", "B", "C", "A", "C", "A", "B", "C", "A", "B", "B", "B", "C", "A", "A", "C"]
# Initialize Performance Class
performance = Performance(actual, predicted)
# Print Confusion Matrix
performance.tabulate()
===================================
Aᴬ Bᴬ Cᴬ
Aᴾ 3 2 1
Bᴾ 1 4 1
Cᴾ 1 0 4
Note: classᴾ = Predicted, classᴬ = Actual
===================================
# Print Normalized Confusion Matrix
performance.tabulate(normalized = True)
===================================
Aᴬ Bᴬ Cᴬ
Aᴾ 17.65% 11.76% 5.88%
Bᴾ 5.88% 23.53% 5.88%
Cᴾ 5.88% 0.00% 23.53%
Note: classᴾ = Predicted, classᴬ = Actual
===================================
import numpy as np
def compute_confusion_matrix(true, pred):
'''Computes a confusion matrix using numpy for two np.arrays
true and pred.
Results are identical (and similar in computation time) to:
"from sklearn.metrics import confusion_matrix"
However, this function avoids the dependency on sklearn.'''
K = len(np.unique(true)) # Number of classes
result = np.zeros((K, K))
for i in range(len(true)):
result[true[i]][pred[i]] += 1
return result
import numpy as np
classes = 3
true = np.random.randint(0, classes, 50)
pred = np.random.randint(0, classes, 50)
np.bincount(true * classes + pred).reshape((classes, classes))
def confusion_matrix(actual, predicted):
classes = np.unique(np.concatenate((actual,predicted)))
confusion_mtx = np.empty((len(classes),len(classes)),dtype=np.int)
for i,a in enumerate(classes):
for j,p in enumerate(classes):
confusion_mtx[i,j] = np.where((actual==a)*(predicted==p))[0].shape[0]
return confusion_mtx
actual = np.array([1,1,1,1,0,0,0,0])
predicted = np.array([1,1,1,1,0,0,0,1])
confusion_matrix(actual,predicted)
0 1
0 3 1
1 0 4
actual = np.array(["a","a","a","a","b","b","b","b"])
predicted = np.array(["a","a","a","a","b","b","b","a"])
confusion_matrix(actual,predicted)
0 1
0 4 0
1 1 3
actual = np.array(["a","a","a","a","b","b","b","b"])
predicted = np.array(["a","a","a","a","b","b","b","z"]) # <-- notice the 3rd class, "z"
confusion_matrix(actual,predicted)
0 1 2
0 4 0 0
1 0 3 1
2 0 0 0
actual = np.array(["a","a","a","x","x","b","b","b"]) # <-- notice the 4th class, "x"
predicted = np.array(["a","a","a","a","b","b","b","z"])
confusion_matrix(actual,predicted)
0 1 2 3
0 3 0 0 0
1 0 2 0 1
2 1 1 0 0
3 0 0 0 0
def get_confusion_matrix(l1, l2):
assert len(l1)==len(l2), "Two lists have different size."
K = len(np.unique(l1))
# create label-index value
label_index = dict(zip(np.unique(l1), np.arange(K)))
result = np.zeros((K, K))
for i in range(len(l1)):
result[label_index[l1[i]]][label_index[l2[i]]] += 1
return result
def confusionMatrix(actual, pred):
TP = (actual==pred)[actual].sum()
TN = (actual==pred)[~actual].sum()
FP = (actual!=pred)[~actual].sum()
FN = (actual!=pred)[actual].sum()
return [[TP, TN], [FP, FN]]