R-在控制交互时比较两种类型的性能_R_Statistics

R-在控制交互时比较两种类型的性能

r statistics

R-在控制交互时比较两种类型的性能,r,statistics,R,Statistics,我一直在R中编程，有一个数据集，其中包含两种机器学习算法的结果（成功与否），这两种算法已经使用不同数量的参数进行了试验。下面提供了一个示例： type success paramater_amount a1 0 15639 a1 0 18623 a1 1 19875 a2 1 12513 a2 1 10256 a2 0 12548 我现在想比较这两种算法，看看哪一种具有最好的

我一直在R中编程，有一个数据集，其中包含两种机器学习算法的结果（成功与否），这两种算法已经使用不同数量的参数进行了试验。下面提供了一个示例：

type success paramater_amount
a1     0       15639
a1     0       18623
a1     1       19875
a2     1       12513
a2     1       10256
a2     0       12548

我现在想比较这两种算法，看看哪一种具有最好的整体性能。但有一个陷阱。众所周知，参数_值越高，成功的几率越高。当检查两种算法测试的参数量时，还可以注意到a1的测试参数量高于a2。这将使简单地计算两种算法的成功数量变得不公平

处理这种情况的好方法是什么？

我会给你一个答案，但不能保证我所说的是真的。事实上，为了获得更高的精度，您应该提供更多关于算法和其他方面的信息。我还建议将此问题迁移到交叉验证

事实上，你的问题是一个统计问题。因为，在统计学中，我们寻找稀疏性。在给定性能下，我们更喜欢简单的模型，而不是非常复杂的模型，因为我们担心过拟合：

一种方法是将性能与模型的复杂性进行比较，如本玩具示例所示：

library(tidyverse)
library(ggplot2)

set.seed(123)
# number of estimation for each models
n <- 1000

performance_1 <- round(runif(n))
complexity_1 <- round(rnorm(n, mean = n, sd = 50))

performance_2 <- round(runif(n, min = 0, max = 0.6))
complexity_2 <- round(rnorm(n, mean = n, sd = 50))

df <- data.frame(performance = c(performance_1, performance_2),
                 complexity = c(complexity_1, complexity_2),
                 models = as.factor(c(rep(1, n), rep(2, n))))

temp <- df %>% group_by(complexity, models) %>% summarise(perf = sum(performance))

ggplot(temp, aes(x = complexity, y = perf, group = models, fill = models)) +
  geom_smooth() +
  theme_classic()

库（tidyverse）
图书馆（GG2）
种子集（123）
#每个模型的估计数
N