高效清除R中数据帧中的缺失值_R_Dataframe_Missing Data_Data Cleaning

高效清除R中数据帧中的缺失值

r dataframe

高效清除R中数据帧中的缺失值,r,dataframe,missing-data,data-cleaning,R,Dataframe,Missing Data,Data Cleaning,之后，如果成功使用，则生成所有数据帧列系数： > dat2=subset(dat1, dat1$V4!='?') Error in `[.data.table`(x, r, vars, with = FALSE) : i evaluates to a logical vector length 339 but there are 184 rows. Recycling of logical i is no longer allowed as it hides more bugs th

之后，如果成功使用，则生成所有

数据帧

列

系数

：

> dat2=subset(dat1, dat1$V4!='?')
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  i evaluates to a logical vector length 339 but there are 184 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

dat1[] <- lapply(dat1, mostFreq)

以下是

str（dat1）

：

str（dat1）类“data.table”和“data.frame”：339 obs。在18个变量中： $V1:int 1 1。。。 $V2:int 1 1 2 2 2。。。 $V3：系数w/3级“1”、“2”和“？”：1 2 1。。。 $V4：系数w/4级“1”、“2”、“3”和“？”：4 4 1 1。。。 $V5：系数w/4级“1”、“2”、“3”、“3”、“？”：3 3 1 2。。。 $V6:int 2 2 1 1 1。。。 $V7:int 2 2。。。 $V8:int 1 2 1 2 2 2 2。。。 $V9:int 2 2 1 2 2 2。。。 $V10:int 2 2。。。 $V11:int 2 1 2 2 2。。。 $V12:int 2 2 2 2。。。 $V13：系数w/3级“1”、“2”、“2”？：2 2 3。。。 $V14:int 2 2 1 1 1。。。 $V15:int 2 1 2 1 2 1。。。 $V16：系数w/3级“1”、“2”、“2”？：2。。。 $V17:int 2 1 2 2 2 2。。。 $V18:int 2 2。。。 -属性（*，“.internal.selfref”）=

以下函数将所有

NA

和

'？'

值替换为最常见的列值。然后，只需将它重叠到data.frame

> str (dat1)
Classes ‘data.table’ and 'data.frame':  339 obs. of  18 variables:
 $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : int  1 1 1 1 1 1 2 2 2 2 ...
 $ V3 : Factor w/ 3 levels "1","2","?": 1 1 2 2 2 2 1 1 1 1 ...
 $ V4 : Factor w/ 4 levels "1","2","3","?": 4 4 2 4 4 4 1 1 1 1 ...
 $ V5 : Factor w/ 4 levels "1","2","3","?": 3 3 3 3 3 3 1 1 1 2 ...
 $ V6 : int  2 2 1 1 1 1 1 1 2 1 ...
 $ V7 : int  2 2 2 2 2 2 2 2 2 2 ...
 $ V8 : int  1 2 2 1 1 2 2 2 2 2 ...
 $ V9 : int  2 2 2 1 1 2 2 2 2 2 ...
 $ V10: int  2 2 2 2 2 2 2 2 2 2 ...
 $ V11: int  2 1 2 2 2 2 2 2 2 2 ...
 $ V12: int  2 2 2 2 2 1 2 2 2 2 ...
 $ V13: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 2 1 2 2 3 ...
 $ V14: int  2 2 2 2 2 2 1 2 1 1 ...
 $ V15: int  2 1 2 2 2 1 1 2 2 1 ...
 $ V16: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 1 2 2 2 2 ...
 $ V17: int  2 1 1 1 1 1 2 2 2 2 ...
 $ V18: int  2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr>

set.seed(1234)    # Make the results reproducible
n <- 300
x <- replicate(6, sample(c(NA, '?', 1:2), n, TRUE))
y <- replicate(6, sample(c(NA, '?', 1:3), n, TRUE))
dat1 <- cbind.data.frame(x, y, stringsAsFactors = FALSE)
dat1 <- dat1[, sample(ncol(dat1))]
names(dat1) <- paste0('V', 1:12)
str(dat1)

编辑。

如果首先读取数据设置

na.strings='？'

，则可以简化上述函数

dat1[] <- lapply(dat1, factor)

测试数据。

由于您尚未发布示例数据集，因此我将创建一个类似于问题所描述的数据集

mostFreq2 <- function(x){
  tbl <- table(x, useNA = "no")
  x[is.na(x)] <- names(tbl)[which.max(tbl)]
  x
}

set.seed（1234）#使结果重现
n以下函数将所有NA
和“？”
值替换为最常见的列值。然后，只需将它重叠到data.frame
> str (dat1)
Classes ‘data.table’ and 'data.frame':  339 obs. of  18 variables:
 $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : int  1 1 1 1 1 1 2 2 2 2 ...
 $ V3 : Factor w/ 3 levels "1","2","?": 1 1 2 2 2 2 1 1 1 1 ...
 $ V4 : Factor w/ 4 levels "1","2","3","?": 4 4 2 4 4 4 1 1 1 1 ...
 $ V5 : Factor w/ 4 levels "1","2","3","?": 3 3 3 3 3 3 1 1 1 2 ...
 $ V6 : int  2 2 1 1 1 1 1 1 2 1 ...
 $ V7 : int  2 2 2 2 2 2 2 2 2 2 ...
 $ V8 : int  1 2 2 1 1 2 2 2 2 2 ...
 $ V9 : int  2 2 2 1 1 2 2 2 2 2 ...
 $ V10: int  2 2 2 2 2 2 2 2 2 2 ...
 $ V11: int  2 1 2 2 2 2 2 2 2 2 ...
 $ V12: int  2 2 2 2 2 1 2 2 2 2 ...
 $ V13: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 2 1 2 2 3 ...
 $ V14: int  2 2 2 2 2 2 1 2 1 1 ...
 $ V15: int  2 1 2 2 2 1 1 2 2 1 ...
 $ V16: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 1 2 2 2 2 ...
 $ V17: int  2 1 1 1 1 1 2 2 2 2 ...
 $ V18: int  2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr>

set.seed(1234)    # Make the results reproducible
n <- 300
x <- replicate(6, sample(c(NA, '?', 1:2), n, TRUE))
y <- replicate(6, sample(c(NA, '?', 1:3), n, TRUE))
dat1 <- cbind.data.frame(x, y, stringsAsFactors = FALSE)
dat1 <- dat1[, sample(ncol(dat1))]
names(dat1) <- paste0('V', 1:12)
str(dat1)

编辑。
如果首先读取数据设置na.strings='？'
，则可以简化上述函数
dat1[] <- lapply(dat1, factor)

测试数据。
由于您尚未发布示例数据集，因此我将创建一个类似于问题所描述的数据集
mostFreq2 <- function(x){
  tbl <- table(x, useNA = "no")
  x[is.na(x)] <- names(tbl)[which.max(tbl)]
  x
}

set.seed（1234）#使结果重现
n虽然有点“黑”，但这应该能让你达到目的。我在你的data.frame中没有看到任何NA
> str (dat1)
Classes ‘data.table’ and 'data.frame':  339 obs. of  18 variables:
 $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : int  1 1 1 1 1 1 2 2 2 2 ...
 $ V3 : Factor w/ 3 levels "1","2","?": 1 1 2 2 2 2 1 1 1 1 ...
 $ V4 : Factor w/ 4 levels "1","2","3","?": 4 4 2 4 4 4 1 1 1 1 ...
 $ V5 : Factor w/ 4 levels "1","2","3","?": 3 3 3 3 3 3 1 1 1 2 ...
 $ V6 : int  2 2 1 1 1 1 1 1 2 1 ...
 $ V7 : int  2 2 2 2 2 2 2 2 2 2 ...
 $ V8 : int  1 2 2 1 1 2 2 2 2 2 ...
 $ V9 : int  2 2 2 1 1 2 2 2 2 2 ...
 $ V10: int  2 2 2 2 2 2 2 2 2 2 ...
 $ V11: int  2 1 2 2 2 2 2 2 2 2 ...
 $ V12: int  2 2 2 2 2 1 2 2 2 2 ...
 $ V13: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 2 1 2 2 3 ...
 $ V14: int  2 2 2 2 2 2 1 2 1 1 ...
 $ V15: int  2 1 2 2 2 1 1 2 2 1 ...
 $ V16: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 1 2 2 2 2 ...
 $ V17: int  2 1 1 1 1 1 2 2 2 2 ...
 $ V18: int  2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr>

set.seed(1234)    # Make the results reproducible
n <- 300
x <- replicate(6, sample(c(NA, '?', 1:2), n, TRUE))
y <- replicate(6, sample(c(NA, '?', 1:3), n, TRUE))
dat1 <- cbind.data.frame(x, y, stringsAsFactors = FALSE)
dat1 <- dat1[, sample(ncol(dat1))]
names(dat1) <- paste0('V', 1:12)
str(dat1)

库（dplyr）
图书馆（stringr）
dat1虽然有点“黑”，但这应该能让你达到目的。我在你的data.frame中没有看到任何NA
> str (dat1)
Classes ‘data.table’ and 'data.frame':  339 obs. of  18 variables:
 $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : int  1 1 1 1 1 1 2 2 2 2 ...
 $ V3 : Factor w/ 3 levels "1","2","?": 1 1 2 2 2 2 1 1 1 1 ...
 $ V4 : Factor w/ 4 levels "1","2","3","?": 4 4 2 4 4 4 1 1 1 1 ...
 $ V5 : Factor w/ 4 levels "1","2","3","?": 3 3 3 3 3 3 1 1 1 2 ...
 $ V6 : int  2 2 1 1 1 1 1 1 2 1 ...
 $ V7 : int  2 2 2 2 2 2 2 2 2 2 ...
 $ V8 : int  1 2 2 1 1 2 2 2 2 2 ...
 $ V9 : int  2 2 2 1 1 2 2 2 2 2 ...
 $ V10: int  2 2 2 2 2 2 2 2 2 2 ...
 $ V11: int  2 1 2 2 2 2 2 2 2 2 ...
 $ V12: int  2 2 2 2 2 1 2 2 2 2 ...
 $ V13: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 2 1 2 2 3 ...
 $ V14: int  2 2 2 2 2 2 1 2 1 1 ...
 $ V15: int  2 1 2 2 2 1 1 2 2 1 ...
 $ V16: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 1 2 2 2 2 ...
 $ V17: int  2 1 1 1 1 1 2 2 2 2 ...
 $ V18: int  2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr>

set.seed(1234)    # Make the results reproducible
n <- 300
x <- replicate(6, sample(c(NA, '?', 1:2), n, TRUE))
y <- replicate(6, sample(c(NA, '?', 1:3), n, TRUE))
dat1 <- cbind.data.frame(x, y, stringsAsFactors = FALSE)
dat1 <- dat1[, sample(ncol(dat1))]
names(dat1) <- paste0('V', 1:12)
str(dat1)

库（dplyr）
图书馆（stringr）
dat1不要在子集（dat1，V4！='？'）中使用数据帧名称。
。不要在子集（dat1，V4！='？'）中使用数据帧名称。
。如果使用fread
中的参数na.strings
，可以简化此过程。那么就不需要使用i
或drop.levels
@Moody\u Mudskipper你是对的，我假设OP想要清理数据，而完全忘记了na.strings
。谢谢至于stringsAsFactors
，它在数据创建代码中dat2@Avi“多”是什么意思？如果使用fread
中的参数na.strings
，则可以简化此操作。那么就不需要使用i
或drop.levels
@Moody\u Mudskipper你是对的，我假设OP想要清理数据，而完全忘记了na.strings
。谢谢至于stringsAsFactors
，它在数据创建代码中dat2@Avi你说的很多是什么意思？编辑。应已进行temp编辑。应该是temp