glm—R中的离群点检测和去除
我构建了一个二元逻辑模型。响应变量是二进制的。有4个回归系数-2个二进制和2个整数。我想找到异常值并删除它们。为此,我创建了一些绘图:glm—R中的离群点检测和去除,r,glm,outliers,diagnostics,R,Glm,Outliers,Diagnostics,我构建了一个二元逻辑模型。响应变量是二进制的。有4个回归系数-2个二进制和2个整数。我想找到异常值并删除它们。为此,我创建了一些绘图: par(mfrow = c(2,2)) plot(hat.ep,rstudent.ep,col="#E69F00", main="hat-values versus studentized residuals", xlab="Hat value", ylab="Studentized residual") dffits.ep <-
par(mfrow = c(2,2))
plot(hat.ep,rstudent.ep,col="#E69F00", main="hat-values versus studentized residuals",
xlab="Hat value", ylab="Studentized residual")
dffits.ep <- dffits(model_logit)
plot(id,dffits.ep,type="l", col="#E69F00", main="Index Plot",
xlab="Identification", ylab="Diffits")
cov.ep <- covratio(model_logit)
plot(id,cov.ep,type="l",col="#E69F00", main="Covariance Ratio",
xlab="Identification", ylab="Covariance Ratio")
cook.ep <- cooks.distance(model_logit)
plot(id,cook.ep,type="l",col="#E69F00", main="Cook's Distance",
xlab="Identification", ylab="Cook's Distance")
par(mfrow=c(2,2))
绘图(hat.ep,rstudent.ep,col=“#E69F00”,main=“hat值与学生化残差”,
xlab=“Hat value”,ylab=“学生剩余量”)
这个答案来得很晚。我不确定你是否找到了答案。接下来,在没有a的情况下,我将尝试使用一些虚拟数据和两个自定义函数来回答这个问题。对于给定的连续变量,离群值是那些位于1.5*IQR
之外的观测值,其中IQR,“四分位范围”是第75个四分位和第25个四分位之间的差值。我还建议你看看这个包含比我粗略的答案好得多的解决方案
> df <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
> head(df)
X Y Z
1 NA 0.8651 0.2784
2 -0.06838 0.4700 2.0483
3 -0.18734 0.9887 1.8353
4 -0.05015 0.7731 2.4464
5 0.25010 0.9941 1.3979
6 -0.26664 0.6778 1.1277
> boxplot(df$Y) # notice the outliers above the top whisker
您可能会发现识别
功能很有用。
> df <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
> head(df)
X Y Z
1 NA 0.8651 0.2784
2 -0.06838 0.4700 2.0483
3 -0.18734 0.9887 1.8353
4 -0.05015 0.7731 2.4464
5 0.25010 0.9941 1.3979
6 -0.26664 0.6778 1.1277
> boxplot(df$Y) # notice the outliers above the top whisker
# this function will return the indices of the outlier values
> findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
# check for outliers
> outliers <- findOutlier(df)
# custom function to remove outliers
> removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
> filterData<- removeOutlier(df, outliers)
> boxplot(filterData$Y)