Scikit learn sklearn精度\回忆\曲线和阈值

Scikit learn sklearn精度\回忆\曲线和阈值,scikit-learn,precision,precision-recall,Scikit Learn,Precision,Precision Recall,我想知道sklearn是如何决定在精确回忆曲线中使用多少阈值的。这里还有一个帖子: . 它提到了我发现这个例子的源代码 import numpy as np from sklearn.metrics import precision_recall_curve y_true = np.array([0, 0, 1, 1]) y_scores = np.array([0.1, 0.4, 0.35, 0.8]) precision, recall, thresholds = precision_rec

我想知道sklearn是如何决定在精确回忆曲线中使用多少阈值的。这里还有一个帖子: . 它提到了我发现这个例子的源代码

import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
这就给了

>>>precision  
    array([0.66666667, 0.5       , 1.        , 1.        ])
>>> recall
    array([1. , 0.5, 0.5, 0. ])
>>> thresholds
    array([0.35, 0.4 , 0.8 ])

有人能给我解释一下如何通过向我展示计算出的内容来获取这些召回和精度吗?

我知道我来晚了一点,但我同样怀疑你提供的链接是否已经清除。粗略地说,以下是
precision\u recall\u curve()
sklearn
实现之后发生的情况

  • 决策得分按降序排列,并根据刚刚获得的顺序进行标记:

    desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1]
    y_scores = y_scores[desc_score_indices]
    y_true = y_true[desc_score_indices]
    
    您将获得:

    y_scores, y_true
    (array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))
    
    distinct_value_indices, threshold_idxs 
    (array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
    
  • sklearn
    实现预计将排除
    y_分数的重复值(本例中没有重复值)

    由于缺少副本,您将获得:

    y_scores, y_true
    (array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))
    
    distinct_value_indices, threshold_idxs 
    (array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
    
  • 最后,您可以计算真阳性和假阳性的数量,通过这些数量,您可以依次计算精度和召回率

    # tps at index i being the number of positive samples assigned a score >= thresholds[i]
    tps = np.cumsum(y_true)[threshold_idxs]
    # fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
    fps = np.cumsum(1 - y_true)[threshold_idxs]
    y_scores = y_scores[threshold_idxs]
    
    precision = tps / (tps + fps)
    # tps[-1] being the total number of positive samples
    recall = tps / tps[-1]
    
    precision, recall
    (array([1.        , 0.5       , 0.66666667, 0.5       ]), array([0.5, 0.5, 1. , 1. ]))
    
    在此步骤之后,您将有两个数组,其中包含每个考虑分数的真阳性和假阳性数

    tps, fps
    (array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))
    
  • 最终,您可以计算精度和召回率

    # tps at index i being the number of positive samples assigned a score >= thresholds[i]
    tps = np.cumsum(y_true)[threshold_idxs]
    # fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
    fps = np.cumsum(1 - y_true)[threshold_idxs]
    y_scores = y_scores[threshold_idxs]
    
    precision = tps / (tps + fps)
    # tps[-1] being the total number of positive samples
    recall = tps / tps[-1]
    
    precision, recall
    (array([1.        , 0.5       , 0.66666667, 0.5       ]), array([0.5, 0.5, 1. , 1. ]))
    
    导致
    阈值
    数组短于
    y_分数
    1的一个重要点(即使
    y_分数
    中没有重复项)是您引用的链接中指出的。基本上,第一次出现的
    召回
    的索引等于1,定义了
    阈值
    数组的长度(这里的索引2对应于长度=3,以及
    阈值
    长度为3的原因)

    最后一点,
    precision
    recall
    的长度是4,因为precision等于1和recall等于0的值被连接到获得的数组中,以便让精度召回曲线从y轴对应的位置开始