R中glm-logistic回归模型阈值的确定_R_Glm_Predict_Logistic Regression

R中glm-logistic回归模型阈值的确定

R中glm-logistic回归模型阈值的确定,r,glm,predict,logistic-regression,R,Glm,Predict,Logistic Regression,我有一些带有预测器和二进制目标的数据。例如： df <- data.frame(a=sort(sample(1:100,30)), b= sort(sample(1:100,30)), target=c(rep(0,11),rep(1,4),rep(0,4),rep(1,11))) 现在我尝试预测输出（例如，相同的数据应该足够了）这将生成一个概率数向量。但我想预测实际的班级。我可以对概率数使用round（），但这假设低于0.5的都是类“0”，高于0.

我有一些带有预测器和二进制目标的数据。例如：

df <- data.frame(a=sort(sample(1:100,30)), b= sort(sample(1:100,30)), 
                 target=c(rep(0,11),rep(1,4),rep(0,4),rep(1,11)))

现在我尝试预测输出（例如，相同的数据应该足够了）

这将生成一个概率数向量。但我想预测实际的班级。我可以对概率数使用round（），但这假设低于0.5的都是类“0”，高于0.5的都是类“1”。这是正确的假设吗？即使每个阶层的人口可能不相等（或接近相等）？或者有没有办法估计这个阈值？

确定良好模型参数的黄金标准，包括逻辑回归的“我应该设置什么阈值”，是交叉验证

总体思路是保留训练集的一个或多个部分，并选择最大化该保留集上正确分类数量的阈值，但可以提供更多细节。

glm模型中使用的最佳阈值（或截止点）是使特异性和敏感性最大化的点。这个临界点可能不会在您的模型中给出最高的预测，但它不会偏向正面或负面。

ROCR

软件包包含可以帮助您完成此操作的函数。检查此软件包中的

performance（）

函数。它会帮你找到你想要的东西。以下是您希望得到的图片：

在找到截止点后，我通常自己编写一个函数，以查找其预测值高于截止点的数据点的数量，并将其与它们所属的组进行匹配。

您可以尝试以下方法：

perfspec <- performance(prediction.obj = pred, measure="spec", x.measure="cutoff")

plot(perfspec)

par(new=TRUE)

perfsens <- performance(prediction.obj = pred, measure="sens", x.measure="cutoff")

plot(perfsens)

perfspec尝试复制第一个图形。给定一个预测在presenceidence
包的函数presenceidence:：optimal.thresholds
中实现了12种方法
弗里曼，E.A.和莫森，G.G.（2008）也涵盖了这一点。根据预测患病率和kappa比较二元分类阈值标准的性能。生态建模，217（1-2），48-58。
要以编程方式获得具有最接近的灵敏度和特异性值（即上图中的交叉）的数据阈值，您可以使用以下非常接近的代码：
predictions = prediction(PREDS, LABELS)

sens = cbind(unlist(performance(predictions, "sens")@x.values), unlist(performance(predictions, "sens")@y.values))
spec = cbind(unlist(performance(predictions, "spec")@x.values), unlist(performance(predictions, "spec")@y.values))
sens[which.min(apply(sens, 1, function(x) min(colSums(abs(t(spec) - x))))), 1]

有不同的标准，一个是灵敏度和特异性之和最大的点，例如见这个问题：@adibender谢谢！但用阈值作为总体分数肯定是不正确的，对吗？也就是说，如果在人群中，30%的病例为“0”，70%的病例为“1”，那么简单的估计是使用0.3作为阈值。但这不是一个合乎逻辑的方法来处理这个问题？你可以在这里找到一个关于这个主题的很好的教程：因为我们将调整交叉验证数据的阈值参数，表面上，这将需要第三个等待评估的集合来报告一个无偏的预期错误？@user2175594，是的，这是正确的。传统上，您的数据至少有三个独立的分区：培训、验证和测试（评估）。但是，如果您正在执行类似于k-fold交叉验证的操作，那么培训和验证本质上是以多种方式重新划分的同一集合。您能否提供生成上述图形的更具体的代码？另外，对于取值在0和1之间的概率，截止值怎么可能在0和14之间？我在下面添加了baseR/ggplot方法！
perfspec <- performance(prediction.obj = pred, measure="spec", x.measure="cutoff")

plot(perfspec)

par(new=TRUE)

perfsens <- performance(prediction.obj = pred, measure="sens", x.measure="cutoff")

plot(perfsens)

plot(unlist(performance(predictions, "sens")@x.values), unlist(performance(predictions, "sens")@y.values), 
     type="l", lwd=2, ylab="Specificity", xlab="Cutoff")
par(new=TRUE)
plot(unlist(performance(predictions, "spec")@x.values), unlist(performance(predictions, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2),labels=z)
mtext("Specificity",side=4, padj=-2, col='red')

sens <- data.frame(x=unlist(performance(predictions, "sens")@x.values), 
                   y=unlist(performance(predictions, "sens")@y.values))
spec <- data.frame(x=unlist(performance(predictions, "spec")@x.values), 
                   y=unlist(performance(predictions, "spec")@y.values))

sens %>% ggplot(aes(x,y)) + 
  geom_line() + 
  geom_line(data=spec, aes(x,y,col="red")) +
  scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
  labs(x='Cutoff', y="Sensitivity") +
  theme(axis.title.y.right = element_text(colour = "red"), legend.position="none") 

predictions = prediction(PREDS, LABELS)

sens = cbind(unlist(performance(predictions, "sens")@x.values), unlist(performance(predictions, "sens")@y.values))
spec = cbind(unlist(performance(predictions, "spec")@x.values), unlist(performance(predictions, "spec")@y.values))
sens[which.min(apply(sens, 1, function(x) min(colSums(abs(t(spec) - x))))), 1]