如何将fread()用作readLines(),而不进行自动列检测?
我有一个5Gb.dat文件(>1000万行)。例如,每一行的格式类似于如何将fread()用作readLines(),而不进行自动列检测?,r,data.table,R,Data.table,我有一个5Gb.dat文件(>1000万行)。例如,每一行的格式类似于aaaa bb cccc0123 xxx kkkkk或aaaa bbbcccc1234xxkkkk。由于readLines在读取大文件时性能不佳,我选择fread()读取此文件,但出现了错误: library("data.table") x <- fread("test.DAT") Error in fread("test.DAT") : Expecting 5 cols, but line 5 contains
aaaa bb cccc0123 xxx kkkkk
或aaaa bbbcccc1234xxkkkk
。由于readLines
在读取大文件时性能不佳,我选择fread()
读取此文件,但出现了错误:
library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") :
Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
Unable to find 5 lines with expected number of columns (+ middle)
库(“data.table”)
这里有一个窍门。您可以使用您知道不在文件中的sep
值。这样做会迫使fread()
将整行作为一列读取。然后我们可以将该列放到一个原子向量中(如下所示)。下面是一个csv示例,我使用?
作为sep
。这样,它的行为类似于readLines()
,只是速度快了很多
f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"
注意:正如@Cath在评论中所提到的,您也可以简单地使用换行符\n
作为sep
值。这应该被大大提高。这是一个很好的技巧,在我的例子中,它实际上与sep='~'一起工作。为什么不使用sep=“\n”
?@Cath-是的,我想也可以使用它。我知道这个线程很旧-但是我现在如何处理这些行呢?将行放入data.table的有效方法是什么?@lukehawk-如果您有上面这样的字符向量,您可以执行fread(粘贴(f,collapse=“\n”)
。否则,我将直接使用fread
读取文件。
head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"