R 为什么我的模型在使用knn()时如此精确,其中k=1?

R 为什么我的模型在使用knn()时如此精确,其中k=1?,r,plotly,r-caret,knn,R,Plotly,R Caret,Knn,我目前正在使用基因组表达水平、年龄和吸烟强度水平来预测肺癌患者的生存天数。我有少量的数据;173名患者和20438个变量,包括基因表达水平(共20436个)。我已经将我的数据分成了测试和训练,使用80:20的比例。数据中没有缺少的值 我正在使用knn()来训练模型。下面是代码的样子: prediction <- knn(train = trainData, test = testData, cl = trainAnswers, k=1) 因为K=1是最复杂的knn模型。它具有最灵活的决策

我目前正在使用基因组表达水平、年龄和吸烟强度水平来预测肺癌患者的生存天数。我有少量的数据;173名患者和20438个变量,包括基因表达水平(共20436个)。我已经将我的数据分成了测试和训练,使用80:20的比例。数据中没有缺少的值

我正在使用knn()来训练模型。下面是代码的样子:

prediction <- knn(train = trainData, test = testData, cl = trainAnswers, k=1)

因为K=1是最复杂的knn模型。它具有最灵活的决策边界。这会造成过盈。在训练数据范围内,它的表现会很好,因为在坚持集上表现不佳(但并非总是如此)。

几件事:1。如果没有任何示例数据,我们无法处理代码。2.这听起来像是你在过度拟合或处理非常不平衡的数据,或是你应该首先检查的其他东西。我们不可能知道,因为我们没有您的数据或任何输出。3.这更适合于,但在做其他事情之前,我建议您退一步,了解k-最近邻算法,k的含义,以及该算法的适用范围。有很多很好的教程和书籍——我用过,其中有一小段是关于
knn
——而且应该有关于交叉验证的材料可供学习。然而,问题似乎不是交叉验证。看起来我可能是过度拟合了数据,但是我在训练和测试中得分很高。过度装配通常在训练中表现良好,但在测试中表现不佳。不管怎样,我用完全不同的数据进行了测试,得到了相似的结果。我想我可以分享一些数据。Lemme编辑我的原始帖子。不是交叉验证交叉验证是堆栈交换上的统计板。这个问题更适合那里,因为它不是严格意义上的编程问题好吧,我会把问题贴在那里。我该怎么处理这个?
# install.packages(c('caret', 'skimr', 'RANN', 'randomForest', 'fastAdaboost', 'gbm', 'xgboost', 'caretEnsemble', 'C50', 'earth'))
library(caret)

# Gather the data and store it in variables
LUAD <- read.csv('/Users/username/Documents/ClinicalData.csv')
geneData <- read.csv('/Users/username/Documents/GenomicExpressionLevelData.csv')
geneData <- data.frame(geneData)
row.names(geneData) = geneData$X
geneData <- geneData[2:514]
colNamesGeneData <- gsub(".","-",colnames(geneData),fixed = TRUE)
colnames(geneData) = colNamesGeneData

# Organize the data
# Important columns are 148 (smoking), 123 (OS Month, basically how many days old), and the gene data. And column 2 (barcode).
LUAD = data.frame(LUAD$patient, LUAD$TOBACCO_SMOKING_HISTORY_INDICATOR, LUAD$OS_MONTHS, LUAD$days_to_death)[complete.cases(data.frame(LUAD$patient, LUAD$TOBACCO_SMOKING_HISTORY_INDICATOR, LUAD$OS_MONTHS, LUAD$days_to_death)), ]
rownames(LUAD)=LUAD$LUAD.patient
LUAD <- LUAD[2:4]

# intersect(rownames(LUAD),colnames(geneData))
# ind=which(colnames(geneData)=="TCGA-778-7167-01A-11R-2066-07")
gene_expression=geneData[, rownames(LUAD)]

# Merge the two datasets to use the geneomic expression levels in your model
LUAD <- data.frame(LUAD,t(gene_expression))
LUAD.days_to_death <- LUAD[,3]
LUAD <- LUAD[,c(1:2,4:20438)]
LUAD <- data.frame(LUAD.days_to_death,LUAD)

set.seed(401)

# Number of Rows in the training data (createDataPartition(dataSet, percentForTraining, boolReturnAsList))
trainRowNum <- createDataPartition(LUAD$LUAD.days_to_death, p=0.8, list=FALSE)

# Training/Test Dataset
trainData <- LUAD[trainRowNum, ]
testData <- LUAD[-trainRowNum, ]

x = trainData[, c(2:20438)]
y = trainData$LUAD.days_to_death
v = testData[, c(2:20438)]
w = testData$LUAD.days_to_death

# Imputing missing values into the data
preProcess_missingdata_model <- preProcess(trainData, method='knnImpute')
library(RANN)
if (anyNA(trainData)) {
    trainData <- predict(preProcess_missingdata_model, newdata = trainData)
}
anyNA(trainData)

# Normalizing the data
preProcess_range_model <- preProcess(trainData, method='range')
trainData <- predict(preProcess_range_model, newdata = trainData)
trainData$LUAD.days_to_death <- y
apply(trainData[,1:20438], 2, FUN=function(x){c('min'=min(x), 'max'=max(x))})

preProcess_range_model_Test <- preProcess(testData, method='range')
testData <- predict(preProcess_range_model_Test, newdata = testData)
testData$LUAD.days_to_death <- w
apply(testData[,1:20438], 2, FUN=function(v){c('min'=min(v), 'max'=max(v))})

# To uncomment, select the text and press 'command' + 'shift' + 'c'
# set.seed(401)
# options(warn=-1)
# subsets <- c(1:10)
# ctrl <- rfeControl(functions = rfFuncs,
#                    method = "repeatedcv",
#                    repeats = 5,
#                    verbose = TRUE)
# lmProfile <- rfe(x=trainData[1:20437], y=trainAnswers,
#                  sizes = subsets,
#                  rfeControl = ctrl)
# lmProfile

trainAnswers <- trainData[,1]
testAnswers <- testData[,1]

library(class)
prediction <- knn(train = trainData, test = testData, cl = trainAnswers, k=1)

#install.packages("plotly")
library(plotly)
Test_Question_Number <- c(1:32)
prediction2 <- data.frame(prediction[1:32])
prediction2 <- as.numeric(as.vector(prediction2[c(1:32),]))
data <- data.frame(Test_Question_Number, prediction2, testAnswers)
names(data) <- c("Test Question Number","Prediction","Answer")

p <- plot_ly(data, x = ~Test_Question_Number, y = ~prediction2, type = 'bar', name = 'Prediction') %>%
    add_trace(y = ~testAnswers, name = 'Answer') %>%
    layout(yaxis = list(title = 'Days to Death'), barmode = 'group')
p
merge <- data.frame(prediction2,testAnswers)

difference <- abs((merge[,1])-(merge[,2]))
difference <- sort(difference)
meanDifference <- mean(difference)
medianDifference <- median(difference)
modeDifference <- names(table(difference))[table(difference)==max(table(difference))]
cat("Mean difference:", meanDifference, "\n")
cat("Median difference:", medianDifference, "\n")
cat("Mode difference:", modeDifference,"\n")
days_to_death    OS_MONTHS
121              3.98

NACC1   2001.5708   2363.8063   1419.879
NACC2   58.2948     61.8157     43.4386
NADK    706.868     1053.4424   732.1562
NADSYN1 1628.7634   912.1034    638.6471
NAE1    832.8825    793.3014    689.7123
NAF1    140.3264    165.4858    186.355
NAGA    1523.3441   1524.4619   1858.9074
NAGK    983.6809    899.869     1168.2003
NAGLU   621.3457    510.9453    1172.511
NAGPA   346.9762    257.5654    275.5533
NAGS    460.7732    107.2116    321.9763
NAIF1   217.1219    202.5108    132.3054
NAIP    101.2305    87.8942     77.261
NALCN   13.9628     36.7031     48.0809
NAMPT   3245.6584   1257.8849   5465.6387