R:使用fread或等效工具从文件中随机读取行?

R:使用fread或等效工具从文件中随机读取行?,r,R,我有一个非常大的数千兆字节的文件,它的成本太高,无法加载到内存中。但是,文件中行的顺序不是随机的。有没有一种方法可以使用fread之类的东西读取行的随机子集 比如像这样的东西 data <- fread("data_file", nrows_sample = 90000) 然而,这对我不起作用。有什么想法吗?如果您的数据文件恰好是一个文本文件,那么使用包LaF的此解决方案可能非常有用: library(LaF) # Prepare dummy data mat <- matrix

我有一个非常大的数千兆字节的文件,它的成本太高,无法加载到内存中。但是,文件中行的顺序不是随机的。有没有一种方法可以使用fread之类的东西读取行的随机子集

比如像这样的东西

data <- fread("data_file", nrows_sample = 90000)

然而,这对我不起作用。有什么想法吗?

如果您的数据文件恰好是一个文本文件,那么使用包
LaF
的此解决方案可能非常有用:

library(LaF)

# Prepare dummy data
mat <- matrix(sample(letters,10*1000000,T), nrow = 1000000)

dim(mat)
#[1] 1000000      10

write.table(mat, "tmp.csv",
    row.names = F,
    sep = ",",
    quote = F)

# Read 90'000 random lines
start <- Sys.time()
random_mat <- sample_lines(filename = "tmp.csv",
    n = 90000,
    nlines = 1000000)
random_mat <- do.call("rbind",strsplit(random_mat,","))
Sys.time() - start
#Time difference of 1.135546 secs    

dim(random_mat)
#[1] 90000    10
库(LaF)
#准备虚拟数据
mat使用tidyverse(与data.table相反),您可以执行以下操作:

library(readr)
library(purrr)
library(dplyr)

# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start, 
# giving us a total of 9000 rows in the final
start_at  <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))

# sort the index sequentially
start_at  <- start_at[order(start_at)]

# Read in 10 rows at a time, starting at your random numbers, 
# binding results rowwise into a single data frame
sample_of_rows  <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) ) 
库(readr)
图书馆(purrr)
图书馆(dplyr)
#生成一些介于1和文件行数之间的随机数,
#假设您可以大致估计文件中的行数
#
#生成900个整数,因为我们将为每个开始获取10行,
#决赛总共有9000行
开始于
library(readr)
library(purrr)
library(dplyr)

# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start, 
# giving us a total of 9000 rows in the final
start_at  <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))

# sort the index sequentially
start_at  <- start_at[order(start_at)]

# Read in 10 rows at a time, starting at your random numbers, 
# binding results rowwise into a single data frame
sample_of_rows  <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) )