R中所有列的排列和组合_R_Knn

R中所有列的排列和组合

R中所有列的排列和组合,r,knn,R,Knn,在选择R中的模型时，我想检查列的所有排列和组合。我的数据集中有8列，下面的代码允许我检查一些模型，但不是全部。类似于列1+6、1+2+5的模型将不包含在此循环中。有没有更好的方法来实现这一点 best_model <- rep(0,3) #store the best model in this array for(i in 1:8){ for(j in 1:8){ for(x in k){ diabetes_prediction <- knn(train =

在选择R中的模型时，我想检查列的所有排列和组合。我的数据集中有8列，下面的代码允许我检查一些模型，但不是全部。类似于列1+6、1+2+5的模型将不包含在此循环中。有没有更好的方法来实现这一点

best_model <- rep(0,3) #store the best model in this array
for(i in 1:8){
  for(j in 1:8){
    for(x in k){
      diabetes_prediction <- knn(train = diabetes_training[, i:j], test = diabetes_test[, i:j], cl = diabetes_train_labels, k = x)
      accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/183
      if( best_model[1] < accuracy[x] ){
        best_model[1] = accuracy[x]
        best_model[2] = i
        best_model[3] = j
      }
    }
  }
}

best_model解决了它，下面是带有解释性注释的代码：
# find out the best model for this data
number_of_columns_to_model <- ncol(diabetes_training)-1
best_model <- c()
best_model_accuracy = 0
for(i in 2:2^number_of_columns_to_model-1){
  # ignoring the first case i.e. i=1, as it doesn't represent any model
  # convert the value of i to binary, e.g. i=5 will give combination = 0 0 0 0 0 1 0 1
  combination = as.binary(i, n=number_of_columns_to_model) # from the binaryLogic package
  model <- c()
  for(i in 1:length(combination)){
    # choose which columns to consider depending on the combination
    if(combination[i])
      model <- c(model, i)
  }
  for(x in k){
    # for the columns decides by model, find out the accuracies of model for k=1:27
    diabetes_prediction <- knn(train = diabetes_training[, model, with = FALSE], test = diabetes_test[, model, with = FALSE], cl = diabetes_train_labels, k = x)
    accuracy[x] <- 100 * sum(diabetes_test_labels == diabetes_prediction)/length(diabetes_test_labels)
    if( best_model_accuracy < accuracy[x] ){
      best_model_accuracy = accuracy[x]
      best_model = model
      print(model)
    }
  }
}

#找出此数据的最佳模型
从列到模型的数量这个答案并不完整，但也许可以让你开始。您希望能够按列的所有可能子集进行子集。因此，对于某些i和j，你不需要i:j，而是希望能够用c（1,6）或c（1,2,5）等来子集
使用sets软件包，您可以为集合的幂集合（所有子集的集合）设置。这是最简单的部分。我是R的新手，所以对我来说最困难的部分是理解集合、列表、向量等之间的区别。我习惯了Mathematica，它们都是一样的
  library(sets)
  my.set <- 1:8  # you want column indices from 1 to 8
  my.power.set <- set_power(my.set)  # this creates the set of all subsets of those indices
  my.names <- c("a")  #I don't know how to index into sets, so I created names (that are numbers, but of type characters)
  for(i in 1:length(my.power.set)) {my.names[i] <- as.character(i)}
  names(my.power.set) <- my.names
  my.indices <- vector("list",length(my.power.set)-1)
  for(i in 2:length(my.power.set)) {my.indices[i-1] <- as.vector(my.power.set[[my.names[i]]])} #this is the line I couldn't get to work

库（套）
my.set我用Pima.tr训练，用Pima.te测试。预处理预测值的KNN准确率为78%，未经预处理的为80%（这是因为一些变量的影响很大）。

80%的表现与逻辑回归模型一致。在逻辑回归中，不需要预处理变量。
RandomForest和Logistic回归提供了删除哪些变量的提示，因此您无需执行所有可能的组合。
另一种方法是查看矩阵散点图

当涉及npreg、glu、bmi、年龄时，你会感觉到0型和1型之间存在差异

您还注意到高度倾斜的ped和年龄，并且您注意到皮肤和其他变量之间可能存在异常数据点（在继续之前，您可能需要删除该观察）
“外观与类型”方框图显示，对于“是”类型，存在极端异常值（尝试将其删除）
您还注意到，Yes type的大多数框高于No type=>变量可能会将预测添加到模型中（您可以通过Wilcoxon秩和测试确认这一点）
皮肤和bmi之间的高度相关性意味着你可以使用其中一个或另一个，或者两者的交互作用。
另一种减少预测值数量的方法是使用PCA
你还需要什么吗？只是一个问题，顺便问一下，你的k是什么？我假设183是观测值，对吗？是的，我自己解决了。把答案贴在下面：）这对我来说很好，但到目前为止，我已经使用了其他答案中的一些替代方法。谢谢你的帮助我相信你对解决编程问题感到满意，但是你仍然有一个严重的统计问题。这种方法对取样非常敏感。总的来说，如果你使用传统的显著性水平，结果就“最佳模型”而言是误导性的。这种方法没有考虑多重比较，并且过分高估了拟合优度指标。我同意你所说的。你能建议应该做些什么来减轻这个问题吗？读一下“多重比较问题”，并考虑使用判断模型比较的标准，适当地考虑你在“所有模型”比较中花费的更高的自由度。它不应该只是变量的次数（级别减1），而应该设置得更高。还可以看看惩罚方法。更合适的地方是CrossValidated.com。我尝试了KNN的缩放，但每次它都会产生100%的准确率，我高度怀疑这是真的。你能分享你的数据吗？数据，我可以看到100%是可疑的。糖尿病患者的读数（测量值）可能与非糖尿病患者的读数（测量值）非常不同，这解释了100%的准确性。例如，血糖水平与非糖尿病患者和糖尿病患者非常不同。我进行了测量和中心测试，但没有得到100%。如果您使用Prima.tr进行培训，使用Pima.te进行测试，您应该获得0.7289的最大精确度。你的混淆矩阵应该靠近这一点寻找最佳模型。再想一想，你还必须找到最佳的k。在我的例子中，k=15给出了最好的结果，这仍然和逻辑回归结果一样好。