如何告诉readr::read_csv正确猜测双栏
我有很多零值的径流数据,偶尔还有一些非零值的双值 “readr::read_csv”猜测整数列类型,因为有许多零 如何使read_csv猜测正确的双栏类型? 我事先不知道变量名的映射,因此无法给出名称类型映射 这里有一个小例子如何告诉readr::read_csv正确猜测双栏,r,tidyverse,readr,R,Tidyverse,Readr,我有很多零值的径流数据,偶尔还有一些非零值的双值 “readr::read_csv”猜测整数列类型,因为有许多零 如何使read_csv猜测正确的双栏类型? 我事先不知道变量名的映射,因此无法给出名称类型映射 这里有一个小例子 # create a column of doubles with many zeros (runoff data) #dsTmp <- data.frame(x = c(rep(0.0, 2), 0.5)) # this works dsTmp <
# create a column of doubles with many zeros (runoff data)
#dsTmp <- data.frame(x = c(rep(0.0, 2), 0.5)) # this works
dsTmp <- data.frame(x = c(rep(0.0, 1e5), 0.5))
write_csv(dsTmp, "tmp/dsTmp.csv")
# 0.0 is written as 0
# read_csv now guesses integer instead of double and reports
# a parsing failure.
ans <- read_csv("tmp/dsTmp.csv")
# the last value is NA instead of 0.5
tail(ans)
#创建一列包含多个零的双精度数据(径流数据)
#dsTmp这里有两种技术。(底部的数据准备。$hp
和$vs
及以上为整数列。)
注意:我将cols(.default=col\u guess())
添加到大多数第一次调用中,这样我们就不会得到read\u csv
发现列的内容的大消息。它可以省略,但代价是更嘈杂的控制台
使用cols(.default=…)
设置,强制所有列为双精度,只要知道文件中没有非数字,即可安全工作:
read_csv("mtcars.csv", col_types = cols(.default = col_double()))
# Warning in rbind(names(probs), probs_f) :
# number of columns of result is not a multiple of vector length (arg 1)
# Warning: 32 parsing failures.
### ...snip...
# See problems(...) for more details.
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 NA 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 NA 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 NA 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 NA 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 NA 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 NA 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 NA 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 NA 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 NA 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 NA 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
最后一次阅读,注意到$hp
及以后的版本现在是
(与下面的数据准备阅读不同)
read\u csv(“mtcars.csv”,col\u types=types)
##tibble:32 x 11
#mpg气缸显示hp drat wt qsec与am齿轮carb
#
#121C61601103.926216.5014
#2 21 c6 160 110 3.9 2.88 17.0 1 4
#3 22.8 c4 108 93 3.85 2.32 18.6 1 4 1
#4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
#5 18.7 c8 360 175 3.15 3.44 17.0 0 3 2
#6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
#7 14.3 c8 360 245 3.21 3.57 15.8 0 3 4
#824.4 c4 147。62 3.69 3.19 20 1 0 4 2
#9 22.8 c4 141。95 3.92 3.15 22.9 1 0 4 2
#10 19.2 c6 168。123 3.92 3.44 18.3 1 0 4 4
# # ... 还有22排
数据:
库(readr)
mtdata.table::fread
似乎可以很好地解决这个问题
write_csv(dsTmp, ttfile <- tempfile())
ans <- fread(ttfile)
tail(ans)
# x
# 1: 0.0
# 2: 0.0
# 3: 0.0
# 4: 0.0
# 5: 0.0
# 6: 0.5
write_csv(dsTmp,ttfile我将r2evans解决方案的代码传输到一个小函数:
read_csvDouble <- function(
### read_csv but read guessed integer columns as double
... ##<< further arguments to \code{\link{read_csv}}
, n_max = Inf ##<< see \code{\link{read_csv}}
, col_types = cols(.default = col_guess()) ##<< see \code{\link{read_csv}}
## the default suppresses the type guessing messages
){
##details<< Sometimes, double columns are guessed as integer, e.g. with
## runoff data where there are many zeros, an only occasionally
## positive values that can be recognized as double.
## This functions modifies \code{read_csv} by changing guessed integer
## columns to double columns.
#https://stackoverflow.com/questions/52934467/how-to-tell-readrread-csv-to-guess-double-column-correctly
colTypes <- read_csv(..., n_max = 3, col_types = col_types) %>% attr("spec")
isIntCol <- map_lgl(colTypes$cols, identical, col_integer())
colTypes$cols[isIntCol] <- replicate(sum(isIntCol), col_double())
##value<< tibble as returned by \code{\link{read_csv}}
ans <- read_csv(..., n_max = n_max, col_types = colTypes)
ans
}
read\u csvDouble您可以尝试增加guess\u max
参数,以便它在猜测之前进一步查看文件以查找值。您可以尝试一下data.table::fread()
吗?有什么原因吗read.csv()
不是一个选项?@12b345b6b78 base R的read.csv
是慢汉克斯,r2evans。您的解决方案2解决了我的问题。我将您的代码转换为一个小函数:data.table::fread
确实工作得很好。但我不喜欢添加更多的包依赖项,而且read\u csv
已经在项目的许多地方使用过。
library(readr)
mt <- mtcars
mt$cyl <- paste0("c", mt$cyl) # for fun
write_csv(mt, path = "mtcars.csv")
read_csv("mtcars.csv", col_types = cols(.default = col_guess()))
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
# 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
write_csv(dsTmp, ttfile <- tempfile())
ans <- fread(ttfile)
tail(ans)
# x
# 1: 0.0
# 2: 0.0
# 3: 0.0
# 4: 0.0
# 5: 0.0
# 6: 0.5
read_csvDouble <- function(
### read_csv but read guessed integer columns as double
... ##<< further arguments to \code{\link{read_csv}}
, n_max = Inf ##<< see \code{\link{read_csv}}
, col_types = cols(.default = col_guess()) ##<< see \code{\link{read_csv}}
## the default suppresses the type guessing messages
){
##details<< Sometimes, double columns are guessed as integer, e.g. with
## runoff data where there are many zeros, an only occasionally
## positive values that can be recognized as double.
## This functions modifies \code{read_csv} by changing guessed integer
## columns to double columns.
#https://stackoverflow.com/questions/52934467/how-to-tell-readrread-csv-to-guess-double-column-correctly
colTypes <- read_csv(..., n_max = 3, col_types = col_types) %>% attr("spec")
isIntCol <- map_lgl(colTypes$cols, identical, col_integer())
colTypes$cols[isIntCol] <- replicate(sum(isIntCol), col_double())
##value<< tibble as returned by \code{\link{read_csv}}
ans <- read_csv(..., n_max = n_max, col_types = colTypes)
ans
}