Python 从混淆矩阵中（有效地）创建假真值/预测值_Python_Numpy_Pandas_Confusion Matrix

Python 从混淆矩阵中（有效地）创建假真值/预测值

python numpy pandas

Python 从混淆矩阵中（有效地）创建假真值/预测值,python,numpy,pandas,confusion-matrix,Python,Numpy,Pandas,Confusion Matrix,出于测试目的，我需要从混淆矩阵中创建假真值/预测值我的混淆矩阵存储在熊猫数据框中，使用： labels = ['N', 'L', 'R', 'A', 'P', 'V'] df = pd.DataFrame([ [1971, 19, 1, 8, 0, 1], [16, 1940, 2, 23, 9, 10], [8, 3, 181, 87, 0, 11], [2, 25, 159, 1786, 16, 12], [0, 24, 4, 8, 1958, 6]

出于测试目的，我需要从混淆矩阵中创建假真值/预测值

我的混淆矩阵存储在熊猫数据框中，使用：

labels = ['N', 'L', 'R', 'A', 'P', 'V']
df = pd.DataFrame([
    [1971, 19, 1, 8, 0, 1],
    [16, 1940, 2, 23, 9, 10],
    [8, 3, 181, 87, 0, 11],
    [2, 25, 159, 1786, 16, 12],
    [0, 24, 4, 8, 1958, 6],
    [11, 12, 29, 11, 11, 1926] ], columns=labels, index=labels)
df.index.name = 'Actual'
df.columns.name = 'Predicted'

我假设索引是实际值，列是预测值

该混淆矩阵如下所示：

Predicted     N     L    R     A     P     V
Actual
N          1971    19    1     8     0     1
L            16  1940    2    23     9    10
R             8     3  181    87     0    11
A             2    25  159  1786    16    12
P             0    24    4     8  1958     6
V            11    12   29    11    11  1926

Predicted  False  True
Actual
False          5      3
True           2      7

我正在寻找一种有效的方法来创建两个Numpy数组：

y\u true

和

y\u predict

，它们将生成这样一个混淆矩阵

我的第一个想法是首先创建大小合适的Numpy数组

所以我做了：

N_all = df.sum().sum()

y_true = np.empty(N_all)
y_pred = np.empty(N_all)

但我不知道如何有效地填充这2个Numpy数组

同样的代码也应适用于二进制混淆矩阵，如：

labels = [False, True]
df = pd.DataFrame([
    [5, 3],
    [2, 7]], columns=labels, index=labels)
df.index.name = 'Actual'
df.columns.name = 'Predicted'

此二进制混淆矩阵如下所示：

Predicted     N     L    R     A     P     V
Actual
N          1971    19    1     8     0     1
L            16  1940    2    23     9    10
R             8     3  181    87     0    11
A             2    25  159  1786    16    12
P             0    24    4     8  1958     6
V            11    12   29    11    11  1926

Predicted  False  True
Actual
False          5      3
True           2      7

如果要准确地重新创建，可以使用以下功能：

def create_arrays(df):
    # Unstack to make tuples of actual,pred,count
    df = df.unstack().reset_index()

    # Pull the value labels and counts
    actual = df['Actual'].values
    predicted = df['Predicted'].values
    totals = df.iloc[:,2].values

    # Use list comprehension to create original arrays
    y_true = [[curr_val]*n for (curr_val, n) in zip(actual, totals)]
    y_predicted = [[curr_val]*n for (curr_val, n) in zip(predicted, totals)]

    # They come nested so flatten them
    y_true = [item for sublist in y_true for item in sublist]
    y_predicted = [item for sublist in y_predicted for item in sublist]

    return y_true, y_predicted

我们可以检查这是否产生了预期的结果：

import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix

labels = ['N', 'L', 'R', 'A', 'P', 'V']
df = pd.DataFrame([
    [1971, 19, 1, 8, 0, 1],
    [16, 1940, 2, 23, 9, 10],
    [8, 3, 181, 87, 0, 11],
    [2, 25, 159, 1786, 16, 12],
    [0, 24, 4, 8, 1958, 6],
    [11, 12, 29, 11, 11, 1926] ], columns=labels, index=labels)
df.index.name = 'Actual'
df.columns.name = 'Predicted'

# Recreate the original confusion matrix and check for equality
y_t, y_p = create_arrays(df)
conf_mat = confusion_matrix(y_t,y_p)
check_labels = np.unique(y_t)

df_new = pd.DataFrame(conf_mat, columns=check_labels, index=check_labels).loc[labels, labels]
df_new.index.name = 'Actual'
df_new.columns.name = 'Predicted'

df == df_new

输出：

Predicted     N     L     R     A     P     V
Actual                                       
N          True  True  True  True  True  True
L          True  True  True  True  True  True
R          True  True  True  True  True  True
A          True  True  True  True  True  True
P          True  True  True  True  True  True
V          True  True  True  True  True  True

对于二进制文件：

# And for the binary
labels = ['False', 'True']
df = pd.DataFrame([
    [5, 3],
    [2, 7]], columns=labels, index=labels)
df.index.name = 'Actual'
df.columns.name = 'Predicted'

# Recreate the original confusion matrix and check for equality
y_t, y_p = create_arrays(df)
conf_mat = confusion_matrix(y_t,y_p)
check_labels = np.unique(y_t)

df_new = pd.DataFrame(conf_mat, columns=check_labels, index=check_labels).loc[labels, labels]
df_new.index.name = 'Actual'
df_new.columns.name = 'Predicted'

df == df_new

Predicted False  True
Actual               
False      True  True
True       True  True

那么，你将矩阵作为输入，需要生成向量y_true和y_predicted？具有相同标签的值在两个向量中应具有相同的值。具有不同标签的值应作为其预测值/真实值推送到每个向量。我想我错过了这个问题。你可以添加

df\u comp=df！=df_new

和

断言df_comp.sum（）.sum（）==0

但它不适用于二进制

预测假真实际假5 3真2 7预测真实际真7

y_真数组可以使用

sum_真=df.sum（轴=1）。shift（1）。fillna（0）sum_预测的=df.sum（轴=0）。shift（1）.fillna（0）index=np.arange（N_all）s_true=pd.Series（index=index）s_true[sum_true]=sum_true.index s_true=s_true.fillna（method='ffill'）y_true=s_true.values

但我仍然不知道如何高效地创建y_。我正在使用Pandas系列，但纯Numpy可能会更有效。@SCL从上面看它是否有效？你传递的是布尔值而不是字符串吗？如果是这样，则会导致它失败。@scls我很困惑…我编写了上面的函数来创建两个数组，几乎只使用stdlib，这很快？不，很抱歉，它不能使用二进制，它会引发

值错误：只能比较相同标记的数据帧对象df.columns索引（[False，True]，dtype='object'）
但是df_new.columns索引（[True]，dtype='object'）
。For循环比Numpy调用慢（C）