Scikit learn sklearn精度\回忆\曲线和阈值
我想知道sklearn是如何决定在精确回忆曲线中使用多少阈值的。这里还有一个帖子: . 它提到了我发现这个例子的源代码Scikit learn sklearn精度\回忆\曲线和阈值,scikit-learn,precision,precision-recall,Scikit Learn,Precision,Precision Recall,我想知道sklearn是如何决定在精确回忆曲线中使用多少阈值的。这里还有一个帖子: . 它提到了我发现这个例子的源代码 import numpy as np from sklearn.metrics import precision_recall_curve y_true = np.array([0, 0, 1, 1]) y_scores = np.array([0.1, 0.4, 0.35, 0.8]) precision, recall, thresholds = precision_rec
import numpy as np
from sklearn.metrics import precision_recall_curve
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
这就给了
>>>precision
array([0.66666667, 0.5 , 1. , 1. ])
>>> recall
array([1. , 0.5, 0.5, 0. ])
>>> thresholds
array([0.35, 0.4 , 0.8 ])
有人能给我解释一下如何通过向我展示计算出的内容来获取这些召回和精度吗?我知道我来晚了一点,但我同样怀疑你提供的链接是否已经清除。粗略地说,以下是
precision\u recall\u curve()
在sklearn
实现之后发生的情况
desc_score_indices = np.argsort(y_scores, kind="mergesort")[::-1]
y_scores = y_scores[desc_score_indices]
y_true = y_true[desc_score_indices]
您将获得:
y_scores, y_true
(array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))
distinct_value_indices, threshold_idxs
(array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
sklearn
实现预计将排除y_分数的重复值(本例中没有重复值)
由于缺少副本,您将获得:
y_scores, y_true
(array([0.8 , 0.4 , 0.35, 0.1 ]), array([1, 0, 1, 0]))
distinct_value_indices, threshold_idxs
(array([0, 1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64))
# tps at index i being the number of positive samples assigned a score >= thresholds[i]
tps = np.cumsum(y_true)[threshold_idxs]
# fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
fps = np.cumsum(1 - y_true)[threshold_idxs]
y_scores = y_scores[threshold_idxs]
precision = tps / (tps + fps)
# tps[-1] being the total number of positive samples
recall = tps / tps[-1]
precision, recall
(array([1. , 0.5 , 0.66666667, 0.5 ]), array([0.5, 0.5, 1. , 1. ]))
在此步骤之后,您将有两个数组,其中包含每个考虑分数的真阳性和假阳性数
tps, fps
(array([1, 1, 2, 2], dtype=int32), array([0, 1, 1, 2], dtype=int32))
# tps at index i being the number of positive samples assigned a score >= thresholds[i]
tps = np.cumsum(y_true)[threshold_idxs]
# fps at index i being the number of negative samples assigned a score >= thresholds[i], sklearn computes it as fps = 1 + threshold_idxs - tps
fps = np.cumsum(1 - y_true)[threshold_idxs]
y_scores = y_scores[threshold_idxs]
precision = tps / (tps + fps)
# tps[-1] being the total number of positive samples
recall = tps / tps[-1]
precision, recall
(array([1. , 0.5 , 0.66666667, 0.5 ]), array([0.5, 0.5, 1. , 1. ]))
导致阈值
数组短于y_分数
1的一个重要点(即使y_分数
中没有重复项)是您引用的链接中指出的。基本上,第一次出现的召回
的索引等于1,定义了阈值
数组的长度(这里的索引2对应于长度=3,以及阈值
长度为3的原因)
最后一点,precision
和recall
的长度是4,因为precision等于1和recall等于0的值被连接到获得的数组中,以便让精度召回曲线从y轴对应的位置开始