在r中读取大型固定格式文本文件_R_Text

在r中读取大型固定格式文本文件

r text

在r中读取大型固定格式文本文件,r,text,R,Text,我正在尝试将一个大于70MB的固定格式文本文件输入r。对于小于1MB的较小文件，我可以使用read.fwf函数，如下所示 condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname) 我的问题至少有一部分是，当我使用read.fwf时，我将所有数据作为因子读取，结果超出了计算机上的内存限制我尝试使用read.table作为格式化每个变量的方法，但似乎我需要一个文本分隔符来完成

我正在尝试将一个大于70MB的固定格式文本文件输入r。对于小于1MB的较小文件，我可以使用read.fwf函数，如下所示

condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)

我的问题至少有一部分是，当我使用read.fwf时，我将所有数据作为因子读取，结果超出了计算机上的内存限制

我尝试使用read.table作为格式化每个变量的方法，但似乎我需要一个文本分隔符来完成该函数。下面链接的第3.3节中有一条建议，我可以使用sep来确定每个变量开始的列

但是，当我使用以下命令时：

condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)

在这一点上，我所要做的就是格式化数据，当它们进入r时，作为其他因素。我希望这将限制我使用的内存量，并允许我实际输入文件。如果您能给我一些建议，我将不胜感激。我知道所有变量的Fortran格式以及每个变量开始的列

谢谢,

沃伦

也许这个代码适合你。您必须用字段大小填充varlen，并将相应的类型字符串（例如数字、字符、整数）添加到colclass

my.readfwf <- function(filename,varlen,colclasses) {
  sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
  eidx <- sidx+varlen-1
  filecontent <- scan(filename,character(0),sep="\n")
  if (any(diff(nchar(filecontent))!=0))
    stop("line lengths differ!")
  nlines <- length(filecontent)
  res <- list()
  for (i in seq_along(varlen)) {
    res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
    mode(res[[i]]) <- colclasses[i]
  }
  attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
  return(res)
}

请看下面的图片。或者，也许值得创建一个数据库，并使用RODBC访问数据。看看mnel最近的答案

condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)

condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)

Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion

my.readfwf <- function(filename,varlen,colclasses) {
  sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
  eidx <- sidx+varlen-1
  filecontent <- scan(filename,character(0),sep="\n")
  if (any(diff(nchar(filecontent))!=0))
    stop("line lengths differ!")
  nlines <- length(filecontent)
  res <- list()
  for (i in seq_along(varlen)) {
    res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
    mode(res[[i]]) <- colclasses[i]
  }
  attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
  return(res)
}