
在R中使用ffdfwith的操作,r,ff,ffbase,R,Ff,Ffbase,我使用ff和R是因为我有一个巨大的数据集(大约16GB)要处理。作为一个测试用例,我让该文件读取大约100万条记录,并将其作为ff数据库写出 system.time(te3 <- read.csv.ffdf(file="testdata.csv", sep = ",", header=TRUE, first.rows=10000, next.rows=50000, colClasses=c("numeric","numeric","numeric","numeric"))) system


system.time(te3 <- read.csv.ffdf(file="testdata.csv", sep = ",", header=TRUE, first.rows=10000, next.rows=50000, colClasses=c("numeric","numeric","numeric","numeric")))



## Using ffdfwith or with will chunkwise execute the expression
te3$odfips <- ffdfwith(te3, ofips*100000 + dfips)
te3$odfips <- with(te3, ofips*100000 + dfips)
## It is better to restrict to the columns you need in the expression 
## otherwise you are going to load other columns in RAM also which is not needed. 
## This will speedup
te3$odfips <- ffdfwith(te3[c("ofips","dfips")], ofips*100000 + dfips)
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips)
## ffdfwith will look at options("ffbatchbytes") and look at how many rows in your ffdf
## can be put in 1 batch in order to not overflow options("ffbatchbytes") and hence RAM. 
## So creating this variable will be done in chunks.
## If you want to specify the chunksize yourself, you can e.g. pass the by argument
## to with which will be passed on to ?chunk. Eg. below this variable is created
## in chunks of 100000 records.
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips, by = 100000)

## As the Ops * and + are implemented in ffbase for ff vectors you can also do this:
te3$odfips <- te3$ofips * 100000 + te3$dfips

Error in if (by < 1) stop("'by' must be > 0") : missing value where TRUE/FALSE needed
In addition: Warning message: In chunk.default(from = 1L, to = 1000000L, by = 2293760000, maxindex = 1000000L) : NAs introduced by coercion

## Using ffdfwith or with will chunkwise execute the expression
te3$odfips <- ffdfwith(te3, ofips*100000 + dfips)
te3$odfips <- with(te3, ofips*100000 + dfips)
## It is better to restrict to the columns you need in the expression 
## otherwise you are going to load other columns in RAM also which is not needed. 
## This will speedup
te3$odfips <- ffdfwith(te3[c("ofips","dfips")], ofips*100000 + dfips)
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips)
## ffdfwith will look at options("ffbatchbytes") and look at how many rows in your ffdf
## can be put in 1 batch in order to not overflow options("ffbatchbytes") and hence RAM. 
## So creating this variable will be done in chunks.
## If you want to specify the chunksize yourself, you can e.g. pass the by argument
## to with which will be passed on to ?chunk. Eg. below this variable is created
## in chunks of 100000 records.
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips, by = 100000)

## As the Ops * and + are implemented in ffbase for ff vectors you can also do this:
te3$odfips <- te3$ofips * 100000 + te3$dfips