fread（）：正在读取作为换行符的表\r\n_R_Performance_Data.table_Line Endings

fread（）：正在读取作为换行符的表\r\n

r performance

fread（）：正在读取作为换行符的表\r\n,r,performance,data.table,line-endings,R,Performance,Data.table,Line Endings,我在文本文件中有制表符分隔的表，其中所有行都以\r\n结尾（0x0D 0x0D 0x0A）。如果我试图用fread（）读取这样的文件，它会说行结尾是\r\r\n。R的download.file（）将添加额外的\R\n 在Windows上以文本模式。请以二进制模式再次下载（mode='wb'）也可能更快。或者，传递URL 直接下载到fread，它将以二进制模式下载该文件，以便你但是我没有下载这些文件，我已经有了到目前为止，我找到了一个解决方案，它首先使用read.table（）（它将\r

我在文本文件中有制表符分隔的表，其中所有行都以

\r\n

结尾（

0x0D 0x0D 0x0A

）。如果我试图用

fread（）

读取这样的文件，它会说

行结尾是\r\r\n。R的download.file（）将添加额外的\R\n 在Windows上以文本模式。请以二进制模式再次下载（mode='wb'）也可能更快。或者，传递URL 直接下载到fread，它将以二进制模式下载该文件，以便你

但是我没有下载这些文件，我已经有了

到目前为止，我找到了一个解决方案，它首先使用

read.table（）

（它将

\r\n

组合视为一个单行字符），然后将结果的

data.frame

转换为

data.table（）

：

mydt我建议使用GNU实用程序tr
来去除那些不必要的\r
字符。e、 g
cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") : 
##  Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
##    a b c
## 1: 1 2 3
## 2: 4 5 6

感谢您的解决方案，但理想情况下，我希望找到一种不使用外部实用程序的方法（是的，我在Windows下）。好的，我添加了一个仅限R的方法加上比较。请参阅编辑的答案。您可以将该系统调用放入fread（）
。执行fread（“cat test.csv | tr-d'\r'）
，您可以跳过system（）
步骤，类似于。\r
周围的引号可能没有必要。谢谢@RichardScriven，我刚刚发现fread（）
可以直接接受shell命令。但是，如果在\r
@DirtySockSniffer的好提示中没有引号，它对我来说是不起作用的，应该加上引号，尽管不仅引号似乎是必要的，而且还需要双转义反斜杠，即fread（“cat f.csv | tr-d'\\r'）。这是我让它在Linux上工作的唯一方法
cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") : 
##  Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
##    a b c
## 1: 1 2 3
## 2: 4 5 6

# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
    sample.txt <- paste0(sample.txt,
                        paste(round(runif(5)*100), collapse = ","),
                        delim)
}
cat(sample.txt, file = "sample.csv")


# function that translates the extra \r characters in R only
fread2 <- function(filename) {
    tmp <- scan(file = filename, what = "character", quiet = TRUE)
    # remove empty lines caused by \r
    tmp <- tmp[tmp != ""]
    # paste lines back together together with \n character
    tmp <- paste(tmp, collapse = "\n")
    fread(tmp)
}

# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
    data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))

require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
               freadScan = fread2("sample.csv"),
               freadtr = fread("tr -d \'\\r\' < sample.csv"),
               unit = "relative")
## Unit: relative
##           expr      min       lq     mean   median       uq      max neval
##          OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223   100
##      freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434   100
##        freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100